-
Notifications
You must be signed in to change notification settings - Fork 664
Closed
Labels
Description
Description of the bug
Hi,
I noticed a bug in PyMuPDF version > 1.23.9 (included) when using get_text to extract text from PDF documents.
To reproduce the bug
-
Consider the attached PDF file: test_file.pdf
-
Extract text using the code below (see "How to reproduce the bug")
-
To reproduce the correct behavior install a PyMuPDF version < 1.23.9 (e.g., 1.23.8). We obtain the following complete text: doc_text_1238.txt
-
To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24). We obtain the following broken text: doc_text_12324.txt
ADDITIONAL NOTES
- The bug behavior can only be observed on certain documents (e.g., the one attached above)
- extractBLOCKS, extractWORDS and extractDICT work fine, the bug seems to show only for extractTEXT
- We tried both on windows and linux and neither works
Thank you for your help
How to reproduce the bug
Extract text using the following code
fitz_doc = fitz.open(pdf_path)
doc_text = list()
for page in fitz_doc:
doc_text.append(page.get_text())
doc_text = ' '.join(doc_text)To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24).
PyMuPDF version
1.23.24
Operating system
Windows
Python version
3.10