Skip to content

Conversation

@srividya-0001
Copy link

Hi,

This PR is submitted by @srividya-0001.

Issue:

Large PDFs are split into multiple chunks during processing, which results in multiple files with page-range suffixes. These chunks were being exported and evaluated as separate documents instead of being treated as a single original document.

This PR ensures that:

  • Detecting chunked files that belong to the same source document
  • Merging chunk contents in correct page order
  • Aggregating metadata across chunks
  • Generating a stable doc_id from the original filename
  • Ensuring source metadata resolution works correctly for chunked inputs

This addresses the chunk-handling issue raised in #73

Testing:
I ran the JSONL export tests locally, and all related tests are passing, including cases involving chunked documents. This helped confirm that the merged output matches the expected structure.

If there is anything I should improve, refactor, or adjust to better fit the project direction, please let me know — I’d be very happy to work on it.

Thanks again for your guidance and for reviewing this PR.

Ensure large documents split into page-range chunks are merged back
into a single document during JSONL export, with aggregated metadata
and stable doc_id generation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant