- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.9k
Windows: Don't error on broken non UTF-8 output #134534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
Also update the tests to avoid testing implementation details.
| pub struct Stdout { | ||
| incomplete_utf8: IncompleteUtf8, | ||
| } | ||
| pub struct Stdout {} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the removal of incomplete UTF-8 handling at the end of the string is not clear to me from the commit description. Why was that removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it now simply truncates the write to remove incomplete UTF-8 from the end and instead leaves the buffering to buffer types, i.e. LineWriter in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A LineWriter will flush an incomplete line if its buffer capacity is exceeded. If that happens, the output must support partial UTF-8 writes, or non-ASCII characters might get lost or replaced with the replacement character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can only result in broken UTF-8 if the user writes incomplete UTF-8 to LineWriter themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, digging through the source code, BufWriter makes sure to not split writes that the user issued.
What is the motivation for truncating invalid UTF-8 at the end of the string?
All else being equal, I'd rather expect the previous behavior, that I can construct UTF-8 output byte-by-byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having a secret stack buffer that can't be inspected or flushed, I'd strongly prefer buffering be done at a higher level. It's also a lot of added complexity for an edge case where the better solution is to set the console code page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, if that behavior is wanted, it should probably be documented in the commit message so that it is clear to future readers that this change was on purpose.
Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.
That's what happens in this code. No bytes are ever lost. Either the caller is informed that less bytes were written than were provided or, if there is only an incomplete code point, then that is written to the console (which will be converted to replacement characters when lossy translating to UTF-16).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that makes sense. I didn't realize the caller would be informed by the return value of write.
What is the motivation for special casing trailing invalid UTF-8? It seems to increase the code complexity a little as well, and is not necessary for std's own use cases.
Is it for supporting a potential non-std buffered writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it could be removed. Stderr is not buffered by us though and there have been proposals for unbuffered stdout.
| assert!(result != 0, "Unexpected error in MultiByteToWideChar"); | ||
|  | ||
| // The only way an error can happen here is if we've messed up. | ||
| debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar"); | |
| assert!(result != 0, "Unexpected error in MultiByteToWideChar"); | 
I think this should be an assert since this isn't performance critical — we've just done a syscall.
| I'm going to split off the LineWriter and  | 
| ☔ The latest upstream changes (presumably #134822) made this pull request unmergeable. Please resolve the merge conflicts. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of moving the buffering the the outer buffer, especially since we already wrap Stdout/Stderr in a LineWriter. trim_last_char_boundary looks sound. I haven't reviewed the rest as closely, but nothing stands out as wrong. I'm not yet convinced that LineWriter won't endlessly try to write the same incomplete single-char sequence, as I haven't reviewed that type with this invariant in mind.
Currently, on Windows, the standard library will error if you try to write invalid UTF-8. Whereas other platforms allow this. The issue arises because the console uses UTF-16 so Rust has to re-encode the output.
This PR fixes it in two ways:
Fixes #116871