-
Notifications
You must be signed in to change notification settings - Fork 661
Closed
Labels
Description
Description of the bug
The function fitz.get_text ignores the optional keyword argument 'pages'. Oddly enough, it does not ignore it if the method used to read pages is 'mp' (multiprocessing).
The culprit is fitz.apply_pages.
How to reproduce the bug
import fitz
from pathlib import Path
pdf_name = "C-standard"
path = str(Path(f"~/Downloads/{pdf_name}.pdf").expanduser())
pages = list(range(2, 3))
text1 = fitz.get_text(path, pages=pages)
text2 = fitz.get_text(path, pages=pages, method='mp')
assert text1 == text2I would expect text1 and text2 to be the same, but text1 includes text from the entire file whilst text2 is as expected (only required page(s)).
The culprit:
lines 20633-20640:
if method == 'single':
if initfn:
initfn(*initfn_args, **initfn_kwargs)
ret = list()
document = Document(path)
for page in document:
r = pagefn(page, *pagefn_args, **initfn_kwargs)
ret.append(r)Instead of this:
ret = list()
...
for page in document:
r = pagefn(page, *pagefn_args, **initfn_kwargs)
ret.append(r)I would expect something like:
ret = [pagefn(page, *pagefn_args, **initfn_kwargs) for page in doc if page.number in pages]I would create a pull request but I am unsure if there's a faster way to get only specific pages.
Thank you for your time.
PyMuPDF version
1.26.0
Operating system
Windows
Python version
3.13