Skip to content

Page.get_text results in AssertionError for epub files #3687

@arun-mani-j

Description

@arun-mani-j

Description of the bug

Page.get_text results in AssertionError for all options except "blocks" and "words" in epub files. However, directly accessing the methods from TextPage works fine.

This is there only in 1.24.7 I think. My distribution package of 1.23.7 does not cause this error.

How to reproduce the bug

  1. Download an epub file, I was able to reproduce the bug with https://www.gutenberg.org/ebooks/73987 for context.
  2. Run the following code.
import pymupdf

doc = pymupdf.open("/home/arun-mani-j/Downloads/test.epub")

p = doc[0]

p.get_text("text")
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
----> 1 p.get_text("text")

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/utils.py in ?(page, option, clip, flags, textpage, sort, delimiters)
    794     if clip is not None:
    795         clip = pymupdf.Rect(clip)
    796         cb = None
    797     elif type(page) is pymupdf.Page:
--> 798         cb = page.cropbox
    799     # pymupdf.TextPage with or without images
    800     tp = textpage
    801     #pymupdf.exception_info()

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/__init__.py in ?(self)
   8531     @property
   8532     def cropbox(self):
   8533         """The CropBox."""
   8534         CheckParent(self)
-> 8535         page = self._pdf_page()
   8536         if not page.m_internal:
   8537             val = mupdf.fz_bound_page(self.this)
   8538         else:

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/__init__.py in ?(self)
   8050     def _pdf_page(self):
-> 8051         return _as_pdf_page(self.this)

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/__init__.py in ?(page, required)
    333         return page
    334     elif isinstance(page, mupdf.FzPage):
    335         ret = mupdf.pdf_page_from_fz_page(page)
    336         if required:
--> 337             assert ret.m_internal
    338         return ret
    339     elif page is None:
    340         assert 0, f'page is None'

AssertionError: 
  1. Using TextPage methods directly works fine.
tp = p.get_textpage()
tp.extractText() # No errors raised
  1. Using "words" or "blocks" work fine.
p.get_text("words")
p.get_text("blocks")

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.12

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions