Skip to content

Memory Leak with getText(“rawDICT”) #362

@sshustov

Description

@sshustov

Please provide all mandatory information!

Describe the bug (mandatory)

This bug was fixed in 1.14.15 version, but for some reason new fitz version has similar issue. This is same problem that appeared in the past issue: #290

To Reproduce (mandatory)

I used same script from issue #290

Tested 2 different versions: 1.16.1 (newest) and wheel PyMuPDF-1.14.15-cp27-cp27mu-manylinux1_x86_64.whl from this page https://github.com/pymupdf/PyMuPDF/releases/tag/1.14.15

import os
import psutil
import gc

file_path = 'any_pdf_document.pdf'
# Loop
niter = 300

memory_usage_start = [0] * niter
memory_usage_before_gc = [0] * niter
memory_usage_stop = [0] * niter

for i in range(niter):

    # Record initial memory usage
    process = psutil.Process(os.getpid())
    memory_usage_start[i] = process.memory_info().rss / 2 ** 20

    # Load file
    doc = fitz.open(file_path)
    page_count = doc.pageCount
    page = 0
    text = ''

    # Extract text
    while (page < page_count):
        p = doc.loadPage(page)
        page += 1
        # words = p.getTextWords()
        rawdict = p.getText('rawDict')
        # text += p.getText()

    # Record memory usage before cleanup
    memory_usage_before_gc[i] = process.memory_info().rss / 2 ** 20

    # Cleanup attempt
    # del text
    doc.close()
    gc.collect()

    # Record mem final usage
    memory_usage_stop[i] = process.memory_info().rss / 2 ** 20

# Viz mem usage VS iterations
from pylab import *
from matplotlib import pyplot as plt

iteration = list(range(0, niter))

plot(iteration[20:], memory_usage_start[20:])
plot(iteration[20:], memory_usage_before_gc[20:], 'r+')
plot(iteration[20:], memory_usage_stop[20:], 'g')

xlabel('iteration')
ylabel('memory usage (Mb)')
title('Memory usage over getText("rawDict")')
grid(True)

plt.savefig('memory_graph.png')

Expected behavior (optional)

Graphs for getText() for both versions look similar. But when I used getText('rawDict') different versions gave different memory graph. It seems fix that was made on 1.14.15 does not exists in newest versions.

Screenshots (optional)

Attached 4 screenshots

wheels 1.14.15 using getText()
just_get_text_wheels1 14 15

1.16.1 using getText()
just_get_text_version1 16 1

wheels 1.14.15 using getText('rawDict')
raw_dict_wheels1 14 15

1.16.1 using getText('rawDict')
version1 16 1

Your configuration (mandatory)

  • Ubuntu 16.04 x64
  • python 2.7.12
  • PyMuPDF version (1.14.15 from wheel PyMuPDF-1.14.15-cp27-cp27mu-manylinux1_x86_64.whl), and 1.16.1 from pip install.

Additional context (optional)

As I understand memory usage graphs should look similar for both versions, but it seems getText('rawDict') consumes much more memory than it should.
How long it would take to release new version with this bug fixed?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions