Skip to content

UTF8Encoding drops bytes during decoding some input sequences #29017

@BCSharp

Description

@BCSharp

For some input byte sequences System.Text.UTF8Encoding looses, or silently drops some bytes. That is, the bytes are neither decoded by the internal decoder nor are they passed to the installed DecoderFallback.

Example. The encoded input is 3 valid ASCII characters, 3 bytes encoding a surrogate character, and again 3 valid ASCII characters. The default encoding singleton instance uses a decoder replacement fallback, which converts every invalid byte to U+FFFD ('�').

byte[] encoded = new byte[] { 
    (byte)'a', (byte)'b', (byte)'c', 
    0xED, 0xA0, 0x90, 
    (byte)'x', (byte)'y', (byte)'z' 
};
char[] decoded;
decoded = Encoding.UTF8.GetChars(encoded);
Console.WriteLine(decoded);

Produced output:

abc��xyz

Expected output:

abc���xyz

The produced output is only 8 characters long. Although it is not visible in the example above, further debugging with a custom DecoderFallback implementation reveals that the first two invalid bytes (0xED, 0xA0) are being passed to the fallback, but the byte 0x90 is skipped.

Also, continuing the example, compare to the correct behaviour of the ASCIIEncoding, also with the default replacement fallback.

decoded = Encoding.ASCII.GetChars(encoded);
Console.WriteLine(decoded);

Produced correct output (9 characters):

abc???xyz

Related issue: #14785

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions