- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.9k
Add a check for ASCII characters in to_upper and to_lower #81358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @sfackler (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. | 
| This should be accompanied with some benchmark outputs. | 
| I didn't see to_upper / to_lower being benchmarked anywhere, so I added two more benches for them. Below are the results before and after this PR's changes. EDIT: To be clear, these benchmarks test both ASCII and non-ASCII characters combined. Maybe it would be better to separate them, but I'll wait for feedback before doing so. BeforeAfter | 
| Just linking this up with the internals chat on this subject: https://internals.rust-lang.org/t/to-upper-speed/13896/12 | 
        
          
                library/core/benches/char/methods.rs
              
                Outdated
          
        
      
      
        
              This comment was marked as outdated.
        
          
      
    
    This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to say anything about benchmarks results on mixed and non-ascii text? I expect branch prediction would mean that on pure non-ascii, there is virtually no change(?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect so. What do you think of adding two more benchmarks for a total of three?
- all ASCII characters
- no ASCII characters
- 50/50 mixed characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be slightly careful here. The non-ascii benchmark is always one byte unicode is it not? It is non-ascii but above 192 it's going to be indicating another code point so there's going to be pleanty of malformed unicode in here.
It might be slightly better to create a random slice of u8 and then use String::from_utf8_lossy() to make it valid first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a benchmark I would want to make the "random chars" constant so runs can be compared.
What do you think of this? Playground Each test case can use the same list of chars and filter for what it's testing for (where mixed does no filtering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random was the wrong choice of words here, sorry! It looks like a fair few other benchmarks / tests don't worry about putting in malformed utf sequences so I think it's fine to define the benchmarks this way for now.
| Additional BenchmarksBeforeAfter | 
| That's quite definitive! It's a performance win for strings containing ASCII, even if they're not entirely ASCII, and it appears entirely neutral for strings containing non-ASCII. There are likely other potential performance wins to be had here, but this is an obvious win with no apparent downside. Let's see if it affects any of the existing benchmarks. @bors try @rust-timer queue | 
| Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf | 
| ⌛ Trying commit c6dae69c489b509f675035e9a255112a66587266 with merge cd94273159f60d0494f0803385204b8aa90b0e55... | 
| ☀️ Try build successful - checks-actions | 
| Queued cd94273159f60d0494f0803385204b8aa90b0e55 with parent d95d4f0, future comparison URL. | 
| Finished benchmarking try commit (cd94273159f60d0494f0803385204b8aa90b0e55): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying  Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never | 
| Looks reasonable to merge, to me. | 
| Does this have an impact on  | 
| Yes - a positive one. At the moment the string to_uppercase calls the char fn for each char and though I think there's room for optimistion there too, that's a tail for another PR. | 
| FYI: Also in the same area: #81837 | 
This extra check has better performance. See discussion here: https://internals.rust-lang.org/t/to-upper-speed/13896
c6dae69    to
    229fdf8      
    Compare
  
    | @bors r+ | 
| 📌 Commit 229fdf8 has been approved by  | 
| ☀️ Test successful - checks-actions | 
| Please note,  | 
Update to Unicode 14.0 The Unicode Standard [announced Version 14.0](https://home.unicode.org/announcing-the-unicode-standard-version-14-0/) on September 14, 2021, and this pull request updates the generated tables in `core` accordingly. This did require a little prep-work in `unicode-table-generator`. First, rust-lang#81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well. Next, I found that the "Alphabetic" property in version 14 was panicking when generating a bitset, "cannot pack 264 into 8 bits". We've been using the skiplist for that anyway, so I changed this to fail gracefully. Finally, I confirmed that the tool still created the exact same tables for 13 before moving to 14.
Update to Unicode 14.0 The Unicode Standard [announced Version 14.0](https://home.unicode.org/announcing-the-unicode-standard-version-14-0/) on September 14, 2021, and this pull request updates the generated tables in `core` accordingly. This did require a little prep-work in `unicode-table-generator`. First, rust-lang#81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well. Next, I found that the "Alphabetic" property in version 14 was panicking when generating a bitset, "cannot pack 264 into 8 bits". We've been using the skiplist for that anyway, so I changed this to fail gracefully. Finally, I confirmed that the tool still created the exact same tables for 13 before moving to 14.
This extra check has better performance. See discussion here:
https://internals.rust-lang.org/t/to-upper-speed/13896
Thanks to @gilescope for helping discover and test this.