Merge chunked documents during JSONL export #81
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
This PR is submitted by @srividya-0001.
Issue:
Large PDFs are split into multiple chunks during processing, which results in multiple files with page-range suffixes. These chunks were being exported and evaluated as separate documents instead of being treated as a single original document.
This PR ensures that:
doc_idfrom the original filenameThis addresses the chunk-handling issue raised in #73
Testing:
I ran the JSONL export tests locally, and all related tests are passing, including cases involving chunked documents. This helped confirm that the merged output matches the expected structure.
If there is anything I should improve, refactor, or adjust to better fit the project direction, please let me know — I’d be very happy to work on it.
Thanks again for your guidance and for reviewing this PR.