UTF8Encoding drops bytes during decoding some input sequences

For some input byte sequences `System.Text.UTF8Encoding` looses, or silently drops some bytes. That is, the bytes are neither decoded by the internal decoder nor are they passed to the installed `DecoderFallback`.

**Example**. The encoded input is 3 valid ASCII characters, 3 bytes encoding a surrogate character, and again 3 valid ASCII characters. The default encoding singleton instance uses a decoder replacement fallback, which converts every invalid byte to U+FFFD (`'�'`).
```csharp
byte[] encoded = new byte[] { 
    (byte)'a', (byte)'b', (byte)'c', 
    0xED, 0xA0, 0x90, 
    (byte)'x', (byte)'y', (byte)'z' 
};
char[] decoded;
decoded = Encoding.UTF8.GetChars(encoded);
Console.WriteLine(decoded);
```
Produced output:
```
abc��xyz
```
Expected output:
```
abc���xyz
```
The produced output is only 8 characters long. Although it is not visible in the example above, further debugging with a custom `DecoderFallback` implementation reveals that the first two invalid bytes (0xED, 0xA0) are being passed to the fallback, but the byte 0x90 is skipped.

Also, continuing the example, compare to the correct behaviour of the `ASCIIEncoding`, also with the default replacement fallback.
```csharp
decoded = Encoding.ASCII.GetChars(encoded);
Console.WriteLine(decoded);
```
Produced correct output (9 characters):
```
abc???xyz
```

Related issue: dotnet/runtime#14785

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF8Encoding drops bytes during decoding some input sequences #29017

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UTF8Encoding drops bytes during decoding some input sequences #29017

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions