-
Notifications
You must be signed in to change notification settings - Fork 665
Closed
Labels
enhancementenhancement-upstreamto be implemented by MuPDFto be implemented by MuPDFfix developedrelease schedule to be determinedrelease schedule to be determined
Description
Description of the bug
When I use page.get_text('blocks') , I get the very similar text with different bbox.
The output of Page 5 (start from 1) as follows:

And the associated page as follows:

The raw pdf is
00b3ad2ad0af97ec4a85274510343e04.pdf
I think block 12 is the redundant one.
What's more, my python version is actually 3.8.19 but I select 3.9 because the available choice is start from 3.9
How to reproduce the bug
import fitz
with open("./00b3ad2ad0af97ec4a85274510343e04.pdf", "rb") as f:
pdf_bytes = f.read()
document = fitz.open(stream=pdf_bytes, filetype="pdf")
for i in range(document.page_count):
if i==4:
page = document.load_page(i)
blocks = page.get_text("blocks")
for i, block in enumerate(blocks):
print(f"block {i}:", block)
print('\n')
PyMuPDF version
1.24.6
Operating system
Linux
Python version
3.9
Metadata
Metadata
Assignees
Labels
enhancementenhancement-upstreamto be implemented by MuPDFto be implemented by MuPDFfix developedrelease schedule to be determinedrelease schedule to be determined