Skip to content

fitz.get_text ignores 'pages' kwarg #4524

@ananyablonko

Description

@ananyablonko

Description of the bug

The function fitz.get_text ignores the optional keyword argument 'pages'. Oddly enough, it does not ignore it if the method used to read pages is 'mp' (multiprocessing).
The culprit is fitz.apply_pages.

How to reproduce the bug

import fitz
from pathlib import Path

pdf_name = "C-standard"
path = str(Path(f"~/Downloads/{pdf_name}.pdf").expanduser())
pages = list(range(2, 3))
text1 = fitz.get_text(path, pages=pages)
text2 = fitz.get_text(path, pages=pages, method='mp')
assert text1 == text2

I would expect text1 and text2 to be the same, but text1 includes text from the entire file whilst text2 is as expected (only required page(s)).

The culprit:
lines 20633-20640:

    if method == 'single':
        if initfn:
            initfn(*initfn_args, **initfn_kwargs)
        ret = list()
        document = Document(path)
        for page in document:
            r = pagefn(page, *pagefn_args, **initfn_kwargs)
            ret.append(r)

Instead of this:

    ret = list()
    ...
    for page in document:
        r = pagefn(page, *pagefn_args, **initfn_kwargs)
        ret.append(r)

I would expect something like:

    ret = [pagefn(page, *pagefn_args, **initfn_kwargs) for page in doc if page.number in pages]

I would create a pull request but I am unsure if there's a faster way to get only specific pages.

Thank you for your time.

PyMuPDF version

1.26.0

Operating system

Windows

Python version

3.13

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions