Skip to content

Performance issue with preparedHTML due to bs4 encoding detection #452

@digidigital

Description

@digidigital

Bug Metadata

  • Version of extract_msg: 0.53.1
  • Your python version: Python 3.10.12
  • How did you launch extract_msg?
    • [X ] I used the extract_msg package

Describe the bug
If you convert msg files that have a lot of big images embedded within the html body the performance of extract-msg degrades if you use the prepared HTML option .

I was able to track this down (i guess) to parsing the html multiple times with bs4 after injecting the images.
Since bs4 lacks the information of the HTML charset it tries to figure it out each time you call bs4.
With the injected images the file get's really large and it seems the character detection is a byte-by-byte process that is taking minutes.

msg.getSaveHtmlBody(prepared=True)

It seems the first time bs4 is called is in injectHtmlHeader -> self.htmlBodyPrepared
This step is fast since bs4 just parses the (small) html and the images are injected as base64 encoded strings after the encoding was detected. Now the HTML is way larger than before.

validateHTML parses the HTML again (now the large one)

  • If the validation fails the HTML is parsed a third time!

After validation (or correction of the HTML) getSaveHtml parses the whole HTML a third (or fourth) time if you are in "prepared" mode.

Possible Solution (?)
A possible solution could be to add a self.original_encoding=None to MessageBase init and extend all calls to bs4 with from_encoding=self.original_encoding (as well as validateHTML).

The self.original_encoding could be set to the encoding that is detected in the the first call to bs4 when self.htmlBodyPrepared is called

That way the detection

  • only runs once
  • on the small HTML

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions