preparedHTML: Incorrect output of two-byte utf-8 characters in header data 

**Bug Metadata**
* Version of extract_msg: 0.49.0
* Your python version: Python 3.10
* How did you launch extract_msg?
  - [ x] My command line or
  - [ x] I used the extract_msg package

**Describe the bug**
If you have a two-byte character (like the German umlaut "ä") in the header data that is transmitted with =UTF8?Q? … ?= the result is two separate characters when using prepared-html.
Output for text or "regular" html is fine

**What code did you use or can we use to reproduce this error?**
Just try the attached text-email with --html --prepared-html

**Is there a message.msg file you want to share to help us reproduce this?**
- [x ] Uploaded message (drag and drop on this window)
- [ForwardedMessage.zip](https://github.com/user-attachments/files/17221268/ForwardedMessage.zip)

**Additional context**
I tried to track the issue and i assume it is caused by passing the html as UTF-8-encoded bytes to beautiful soup  (message_base.py, line 385) -> I assume the two-byte character is interpreted as two separate characters by bs

    def getSaveHtmlBody(self, preparedHtml: bool = False, charset: str = 'utf-8', **_) -> bytes:
        """
        Returns the HTML body that will be used in saving based on the
        arguments.

        :param preparedHtml: Whether or not the HTML should be prepared for
            standalone use (add tags, inject images, etc.).
        :param charset: If the html is being prepared, the charset to use for
            the Content-Type meta tag to insert. This exists to ensure that
            something parsing the html can properly determine the encoding (as
            not having this tag can cause errors in some programs). Set this to
            ``None`` or an empty string to not insert the tag. (Default:
            'utf-8')
        :param _: Used to allow kwargs expansion in the save function.
            Arguments absorbed by this are simply ignored.
        """
        if self.htmlBody:
            # Inject the header into the data.
            data = self.injectHtmlHeader(prepared = preparedHtml)

            # If we are preparing the HTML, then we should
            if preparedHtml and charset:
                bs = bs4.BeautifulSoup(data, features = 'html.parser')

- self.injectHtmlHeader returns bytes
- this is caused by the replace function in injectHtmlHeader that encodes the string that is returned by htmlInjectableHeader as bytes
- the string returned by htmlInjectableHeader has the Umlauts in the correct form   

      def replace(bodyMarker):
          """
          Internal function to replace the body tag with itself plus the
          header.
          """

          # I recently had to change this and how it worked. Now we use a new
          # property of `MSGFile` that returns a special tuple of tuples to define
          # how to get all of the properties we are formatting. They are all
          # processed in the same way, making everything neat. By defining them
          # in each class, any class can specify a completely different set to be
          # used.
          return bodyMarker.group() + self.htmlInjectableHeader.encode('utf-8')
          # Use the previously defined function to inject the HTML header.

**Potential fix**
Decode the value passed to beatifulsoup in getSaveHtmlBody with .decode('utf-8') -> pass data as regular utf-8 string to bs

      if self.htmlBody:
          # Inject the header into the data.
          data = self.injectHtmlHeader(prepared = preparedHtml).decode('utf-8')



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

preparedHTML: Incorrect output of two-byte utf-8 characters in header data #432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

preparedHTML: Incorrect output of two-byte utf-8 characters in header data #432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions