-
Notifications
You must be signed in to change notification settings - Fork 2k
GH 3421 : Fix silent failures in PDF outline processing #3623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…line items Signed-off-by: WOONBE <[email protected]>
…h imperfect outlines and coordinate edge cases Signed-off-by: WOONBE <[email protected]>
…d Outline Signed-off-by: WOONBE <[email protected]>
Signed-off-by: WOONBE <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rearrange the import statements in the following order:
java.*packages- Other packages
org.springframework.*packages- Static imports
For reference, you can refer to the example in this file:
OpenAiApiIT.java.
| public String getTextBetweenParagraphs(Paragraph fromParagraph, Paragraph toParagraph) { | ||
|
|
||
| if (fromParagraph.startPageNumber() < 1) { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an intended line break?
| if (h < 0) | ||
| h = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add curly braces to the if statement
Signed-off-by: WOONBE <[email protected]>
Signed-off-by: WOONBE <[email protected]>
|
Please update the Copyright Notice year in the header of all modified files from 2023-2024 to 2023-2025. |
Signed-off-by: WOONBE <[email protected]>
|
@WOONBE Thanks for the PR and @dev-jonghoonpark Thanks for the review! Rebased and merged as d92a2ea |
As identified in issue #3421, the
ParagraphPdfDocumentReaderis not resilient to common imperfections in PDF outline structures. This can lead to anIndexOutOfBoundsExceptionor, more critically, silent failures where valid sections of the document are skipped, resulting in an empty or incomplete list of documents.This PR includes the following changes:
Corrects the iteration logic in the
get()method by replacing the buggy iterator-based loop, which failed to process the last outline item, with an indexedforloop that guarantees every paragraph is processed correctly.Adds a defensive fallback for page range calculation in
getTextBetweenParagraphs().Refactors the text extraction area calculation, replacing the flawed logic that produced zero-height rectangles.
Adds a unit test that reproduces the exact condition that caused the IndexOutOfBoundsException.
Fixed #3421