Skip to content

ParagraphPdfDocumentReader throws java.lang.IndexOutOfBoundsException #3421

@torakiki

Description

@torakiki

I'm using Spring Boot 3.4.5 with Spring AI 1.0.0. On a particular PDF document the ParagraphPdfDocumentReader throws:

java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index out of bounds: -1
	at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.getTextBetweenParagraphs(ParagraphPdfDocumentReader.java:248)
	at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.toDocument(ParagraphPdfDocumentReader.java:161)
	at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.get(ParagraphPdfDocumentReader.java:147)
	at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.get(ParagraphPdfDocumentReader.java:50)
	at org.springframework.ai.document.DocumentReader.read(DocumentReader.java:25)
	at org.pdfsam.spec.agent.service.DefaultPdfLoader.loadPdf(DefaultPdfLoader.java:59)
	at org.pdfsam.spec.agent.service.DefaultPdfLoader.loadPdfWithOutlineFrom(DefaultPdfLoader.java:54)
	at org.pdfsam.spec.agent.service.DefaultLoadService.loadPDFFilesWithOutline(DefaultLoadService.java:82)
	at org.pdfsam.spec.agent.service.DefaultLoadService.loadUnprocessed(DefaultLoadService.java:61)
	at org.pdfsam.spec.agent.ETLApplication.lambda$commandLineRunner$0(ETLApplication.java:43)
	at org.springframework.boot.SpringApplication.lambda$callRunner$5(SpringApplication.java:789)
	at org.springframework.util.function.ThrowingConsumer$1.acceptWithException(ThrowingConsumer.java:82)
	at org.springframework.util.function.ThrowingConsumer.accept(ThrowingConsumer.java:60)
	at org.springframework.util.function.ThrowingConsumer$1.accept(ThrowingConsumer.java:86)
	at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:797)
	at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:788)
	at org.springframework.boot.SpringApplication.lambda$callRunners$3(SpringApplication.java:773)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:186)
	at java.base/java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:357)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:571)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:153)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:176)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:636)
	at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:773)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:325)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:1362)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:1351)
	at org.pdfsam.spec.agent.ETLApplication.main(ETLApplication.java:34)
Caused by: java.lang.IndexOutOfBoundsException: Index out of bounds: -1
	at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:299)
	at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:263)
	at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1220)
	at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.getTextBetweenParagraphs(ParagraphPdfDocumentReader.java:196)
	... 29 common frames omitted

The issue is with an outline item without any page destination (no Dest nor A item in the dictionary). This results in this printed as outline item:
Bla [-1,17], children = 0, pos = 0

I cannot share the PDF file but I guess I can create one if needed.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions