Skip to content

extractText() extracts broken text from pdf #3186

@spagliarini

Description

@spagliarini

Description of the bug

Hi,

I noticed a bug in PyMuPDF version > 1.23.9 (included) when using get_text to extract text from PDF documents.

To reproduce the bug

  • Consider the attached PDF file: test_file.pdf

  • Extract text using the code below (see "How to reproduce the bug")

  • To reproduce the correct behavior install a PyMuPDF version < 1.23.9 (e.g., 1.23.8). We obtain the following complete text: doc_text_1238.txt

  • To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24). We obtain the following broken text: doc_text_12324.txt

ADDITIONAL NOTES

  • The bug behavior can only be observed on certain documents (e.g., the one attached above)
  • extractBLOCKS, extractWORDS and extractDICT work fine, the bug seems to show only for extractTEXT
  • We tried both on windows and linux and neither works

Thank you for your help

How to reproduce the bug

Extract text using the following code

fitz_doc = fitz.open(pdf_path)

doc_text = list()
for page in fitz_doc:
    doc_text.append(page.get_text())

doc_text = ' '.join(doc_text)

To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24).

PyMuPDF version

1.23.24

Operating system

Windows

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions