-
Notifications
You must be signed in to change notification settings - Fork 661
Description
Description of the bug
I have found a bug when using doc[0].get_text('words') in case there is unreadable character present.
There is a table in the PDF, and whenever the "cell" contains this special character, the character is not read via get_text('words') method, and the next word gets wrong bbox values, having coordinates of that nonreadable character instead of its own coordinates.
In this example that strange symbol (vertical bar with x on top) is not read at all, and the -4.00 is assigned x0, y0, x1, y1 coordinates of that non-read character.
In cases where are several consecutive such characters then only \u200d (zero-length-joiner) is read (probably between such characters). In one case (se picture below) with four such characters, two \u200d's are read (as separate "words"), but probably because of multiple such characters, +200.00 gets correct coordinates.
However, in all cases with one such character in the "cell" coordinates get mixed up for the following word.
So far have noticed only with this particular character (it tends to come up several times in such files).
Using latest version of PyMuPDF.
P.S. Cannot share full file as it contains sensitive data (probably could try somehow to extract part of that PDF with problematic characters).
How to reproduce the bug
Have a PDF with wrong character.
PyMuPDF version
1.26.5
Operating system
Windows
Python version
3.11