-
Notifications
You must be signed in to change notification settings - Fork 664
Closed
Labels
Description
Describe the bug (mandatory)
page.getText('html') is returning an empty string for some files. Interestingly, page.getText('text') returns content so it is unclear why it is failing.
To Reproduce (mandatory)
Code:
import fitz # import pymupdf by importing fitz
from io import BytesIO
import requests
# Working file
# url = 'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/09/30/nen_price_202011.pdf'
# Broken file
# url = 'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/06/29/nen_price_202008.pdf'
res = requests.request('get', url)
data = BytesIO(res.content)
doc = fitz.open(stream=data, filetype="pdf")
page = doc[0]
text = page.getText('text')
html = page.getText('html')When using the url tagged # Working file everything works fine. When using the url tagged # Broken file html is empty while text has content.
Expected behavior (optional)
I should have gotten the file converted to a html format, or if there is an issue parsing some sort of error message.
Your configuration (mandatory)
- Ubuntu 20.04.1 LTS
- Python 3.8.5
- PyMuPDF version 1.18.3 installed via pip