Skip to content

Conversation

@WOONBE
Copy link
Contributor

@WOONBE WOONBE commented Jun 20, 2025

As identified in issue #3421, the ParagraphPdfDocumentReader is not resilient to common imperfections in PDF outline structures. This can lead to an IndexOutOfBoundsException or, more critically, silent failures where valid sections of the document are skipped, resulting in an empty or incomplete list of documents.

This PR includes the following changes:

  • Corrects the iteration logic in the get() method by replacing the buggy iterator-based loop, which failed to process the last outline item, with an indexed for loop that guarantees every paragraph is processed correctly.

  • Adds a defensive fallback for page range calculation in getTextBetweenParagraphs().

  • Refactors the text extraction area calculation, replacing the flawed logic that produced zero-height rectangles.

  • Adds a unit test that reproduces the exact condition that caused the IndexOutOfBoundsException.

Fixed #3421

@WOONBE WOONBE closed this Jun 20, 2025
@WOONBE WOONBE reopened this Jun 20, 2025
@WOONBE WOONBE closed this Jun 20, 2025
@WOONBE WOONBE reopened this Jun 20, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rearrange the import statements in the following order:

  1. java.* packages
  2. Other packages
  3. org.springframework.* packages
  4. Static imports

For reference, you can refer to the example in this file:
OpenAiApiIT.java.

public String getTextBetweenParagraphs(Paragraph fromParagraph, Paragraph toParagraph) {

if (fromParagraph.startPageNumber() < 1) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intended line break?

Comment on lines 224 to 225
if (h < 0)
h = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add curly braces to the if statement

@dev-jonghoonpark
Copy link
Contributor

Please update the Copyright Notice year in the header of all modified files from 2023-2024 to 2023-2025.

@ilayaperumalg ilayaperumalg self-assigned this Jun 24, 2025
@ilayaperumalg ilayaperumalg added bug Something isn't working for: backport-to-1.0.x labels Jun 24, 2025
@ilayaperumalg ilayaperumalg added this to the 1.1.x milestone Jun 24, 2025
@ilayaperumalg
Copy link
Member

@WOONBE Thanks for the PR and @dev-jonghoonpark Thanks for the review!

Rebased and merged as d92a2ea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working for: backport-to-1.0.x

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParagraphPdfDocumentReader throws java.lang.IndexOutOfBoundsException

3 participants