Skip to content

Garbled extraction for Amazon Sustainability Report #3594

@gtmtech

Description

@gtmtech

Description of the bug

Using PyMuPDF or PyMuPDF4LLM to extract the Amazon Sustainability Report gives quite incomprehensible output with beginnings of words garbled, and spacing screwed up. Something about this PDF is not being parsed properly.

I have found other extractors do extract this report correctly.

How to reproduce the bug

import fitz
d = fitz.open("file.pdf")
for p in d.pages()
    print(p.get_text())

07ef2453.pdf

PyMuPDF version

1.24.2

Operating system

MacOS

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions