Skip to content

Problems with unreadable characters #4716

@renchix

Description

@renchix

Description of the bug

I have found a bug when using doc[0].get_text('words') in case there is unreadable character present.

There is a table in the PDF, and whenever the "cell" contains this special character, the character is not read via get_text('words') method, and the next word gets wrong bbox values, having coordinates of that nonreadable character instead of its own coordinates.

Image

In this example that strange symbol (vertical bar with x on top) is not read at all, and the -4.00 is assigned x0, y0, x1, y1 coordinates of that non-read character.

In cases where are several consecutive such characters then only \u200d (zero-length-joiner) is read (probably between such characters). In one case (se picture below) with four such characters, two \u200d's are read (as separate "words"), but probably because of multiple such characters, +200.00 gets correct coordinates.

Image

However, in all cases with one such character in the "cell" coordinates get mixed up for the following word.
So far have noticed only with this particular character (it tends to come up several times in such files).

Using latest version of PyMuPDF.

P.S. Cannot share full file as it contains sensitive data (probably could try somehow to extract part of that PDF with problematic characters).

How to reproduce the bug

Have a PDF with wrong character.

PyMuPDF version

1.26.5

Operating system

Windows

Python version

3.11

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions