Skip to content

Illegal multibyte sequence #462

@wuodar

Description

@wuodar

Bug Metadata

  • Version of extract_msg: 0.54.0
  • Your python version: Python 3.11.6
  • How did you launch extract_msg?
    • My command line or
    • I used the extract_msg package

Describe the bug
For some .msg files, I'm getting UnicodeDecodeError: 'XXX' codec can't decode byte (...): illegal multibyte sequence
The example codecs that fails are

  • windows-950
  • shift_jis
  • charmap
  • gb2312

Traceback

  File "src/doc_parser/loaders.py", line 183, in get_msg_content
    with extract_msg.openMsg(path) as msg:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/open_msg.py", line 124, in openMsg
    return Message(path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/msg_classes/message_base.py", line 83, in __init__
    super().__init__(path, **kwargs)
  File ".venv/lib/python3.11/site-packages/extract_msg/msg_classes/msg.py", line 221, in __init__
    self.attachments
  File "/Users/kacperwlodarczyk/.local/share/uv/python/cpython-3.11.6-macos-aarch64-none/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/msg_classes/msg.py", line 862, in attachments
    attachments.append(self.initAttachmentFunc(self, attachmentDir))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/attachments/__init__.py", line 108, in initStandardAttachment
    return EmbeddedMsgAttachment(msg, dir_, propStore)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/attachments/emb_msg_att.py", line 38, in __init__
    self.__data = openMsg(self.msg.path, prefix = self.__prefix, parentMsg = self.msg, treePath = self.treePath, **self.msg.kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/open_msg.py", line 90, in openMsg
    msg = MSGFile(path, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/msg_classes/msg.py", line 206, in __init__
    filename = self.getStringStream(prefixl[:-1] + ['__substg1.0_3001'], prefix = False)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/extract_msg/msg_classes/msg.py", line 738, in getStringStream
    return None if tmp is None else tmp.decode(self.stringEncoding)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0xfc in position 74: illegal multibyte sequence

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions