Skip to content

page.getText('html') returns empty string #726

@further-reading

Description

@further-reading

Describe the bug (mandatory)

page.getText('html') is returning an empty string for some files. Interestingly, page.getText('text') returns content so it is unclear why it is failing.

To Reproduce (mandatory)

Code:

import fitz  # import pymupdf by importing fitz
from io import BytesIO
import requests


# Working file
# url =  'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/09/30/nen_price_202011.pdf'

# Broken file
# url = 'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/06/29/nen_price_202008.pdf'

res = requests.request('get', url)
data = BytesIO(res.content)
doc = fitz.open(stream=data, filetype="pdf")
page = doc[0]
text = page.getText('text')
html = page.getText('html')

When using the url tagged # Working file everything works fine. When using the url tagged # Broken file html is empty while text has content.

Expected behavior (optional)

I should have gotten the file converted to a html format, or if there is an issue parsing some sort of error message.

Your configuration (mandatory)

  • Ubuntu 20.04.1 LTS
  • Python 3.8.5
  • PyMuPDF version 1.18.3 installed via pip

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions