Skip to content

page.get_text('blocks') output two piece of very similar text with different bbox #4026

@qianyue76

Description

@qianyue76

Description of the bug

When I use page.get_text('blocks') , I get the very similar text with different bbox.
The output of Page 5 (start from 1) as follows:
image
And the associated page as follows:
image

The raw pdf is
00b3ad2ad0af97ec4a85274510343e04.pdf
I think block 12 is the redundant one.

What's more, my python version is actually 3.8.19 but I select 3.9 because the available choice is start from 3.9

How to reproduce the bug

import fitz

with open("./00b3ad2ad0af97ec4a85274510343e04.pdf", "rb") as f:
    pdf_bytes = f.read()
document = fitz.open(stream=pdf_bytes, filetype="pdf")

for i in range(document.page_count):
    if i==4:
        page = document.load_page(i)
        blocks = page.get_text("blocks")
        for i, block in enumerate(blocks):
            print(f"block {i}:", block)
            print('\n')

PyMuPDF version

1.24.6

Operating system

Linux

Python version

3.9

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions