Merge chunked documents during JSONL export #81

srividya-0001 · 2025-12-28T14:21:59Z

Hi,

This PR is submitted by @srividya-0001.

Issue:

Large PDFs are split into multiple chunks during processing, which results in multiple files with page-range suffixes. These chunks were being exported and evaluated as separate documents instead of being treated as a single original document.

This PR ensures that:

Detecting chunked files that belong to the same source document
Merging chunk contents in correct page order
Aggregating metadata across chunks
Generating a stable doc_id from the original filename
Ensuring source metadata resolution works correctly for chunked inputs

This addresses the chunk-handling issue raised in #73

Testing:
I ran the JSONL export tests locally, and all related tests are passing, including cases involving chunked documents. This helped confirm that the merged output matches the expected structure.

If there is anything I should improve, refactor, or adjust to better fit the project direction, please let me know — I’d be very happy to work on it.

Thanks again for your guidance and for reviewing this PR.

Ensure large documents split into page-range chunks are merged back into a single document during JSONL export, with aggregated metadata and stable doc_id generation.

Merge chunked documents during JSONL export

ca74314

Ensure large documents split into page-range chunks are merged back into a single document during JSONL export, with aggregated metadata and stable doc_id generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge chunked documents during JSONL export #81

Merge chunked documents during JSONL export #81

Uh oh!

srividya-0001 commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Merge chunked documents during JSONL export #81

Are you sure you want to change the base?

Merge chunked documents during JSONL export #81

Uh oh!

Conversation

srividya-0001 commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant