-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
For some input byte sequences System.Text.UTF8Encoding looses, or silently drops some bytes. That is, the bytes are neither decoded by the internal decoder nor are they passed to the installed DecoderFallback.
Example. The encoded input is 3 valid ASCII characters, 3 bytes encoding a surrogate character, and again 3 valid ASCII characters. The default encoding singleton instance uses a decoder replacement fallback, which converts every invalid byte to U+FFFD ('�').
byte[] encoded = new byte[] {
(byte)'a', (byte)'b', (byte)'c',
0xED, 0xA0, 0x90,
(byte)'x', (byte)'y', (byte)'z'
};
char[] decoded;
decoded = Encoding.UTF8.GetChars(encoded);
Console.WriteLine(decoded);Produced output:
abc��xyz
Expected output:
abc���xyz
The produced output is only 8 characters long. Although it is not visible in the example above, further debugging with a custom DecoderFallback implementation reveals that the first two invalid bytes (0xED, 0xA0) are being passed to the fallback, but the byte 0x90 is skipped.
Also, continuing the example, compare to the correct behaviour of the ASCIIEncoding, also with the default replacement fallback.
decoded = Encoding.ASCII.GetChars(encoded);
Console.WriteLine(decoded);Produced correct output (9 characters):
abc???xyz
Related issue: #14785