-
Notifications
You must be signed in to change notification settings - Fork 468
improve text encoder encode performance #5448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #5448 will degrade performances by 25.86%Comparing Summary
Benchmarks breakdown
Footnotes
|
30c0921 to
2259fce
Compare
|
Looks like there are still some bugs to work out here with the failing CI... but even then, my preference would be to settle on #5449 first before landing any changes here. Also, since we do utf8 conversions everywhere, not just in TextEncoder::encode, my preference would be to address this more generally. That is, the optimized encoding path -- if we decide it's worthwhile -- should go into |
98afb46 to
f1bbfe6
Compare
7fac631 to
681bf71
Compare
3a6ea76 to
6e3972e
Compare
|
I don't think we need to speed optimize for the broken UTF-16 case unless and until someone shows it matters. The only reason to space-optimize would be to avoid throwing OOM, so if we can guarantee that doesn't happen I'm OK with some temporary blowup in space too. |
|
This helps a lot. I'll push with your changes. Thanks @erikcorry |
src/workerd/api/encoding.c++
Outdated
|
|
||
| // Calculate UTF-8 length from UTF-16 with potentially invalid surrogates. | ||
| // Invalid surrogates are counted as U+FFFD (3 bytes in UTF-8). | ||
| size_t utf8LengthFromInvalidUtf16(const char16_t* input, size_t length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this in another comment that I believe got resolved somewhere... this really does not belong in encoding.c++. Any improvement we make should be usable by the entire runtime. Let's move the optimization into jsg::JsString so that all of the APIs benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this last. I want to make sure we don't regress with your recommendations first.
src/workerd/api/encoding.c++
Outdated
| if (pendingSurrogate) { | ||
| if (isTrailSurrogate(c)) { | ||
| // Valid surrogate pair = 4 bytes in UTF-8 | ||
| utf8Length += 4; | ||
| pendingSurrogate = false; | ||
| } else { | ||
| // Unpaired lead surrogate = U+FFFD (3 bytes) | ||
| utf8Length += 3; | ||
| if (!isLeadSurrogate(c)) { | ||
| utf8Length += utf8BytesForCodeUnit(c); | ||
| pendingSurrogate = false; | ||
| } | ||
| } | ||
| } else if (isLeadSurrogate(c)) { | ||
| pendingSurrogate = true; | ||
| } else { | ||
| if (isTrailSurrogate(c)) { | ||
| // Unpaired trail surrogate = U+FFFD (3 bytes) | ||
| utf8Length += 3; | ||
| } else { | ||
| utf8Length += utf8BytesForCodeUnit(c); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (pendingSurrogate) { | ||
| utf8Length += 3; // Trailing unpaired lead surrogate | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not an actual suggestion)
Just noting that this whole logic should be identical to the following:
| if (pendingSurrogate) { | |
| if (isTrailSurrogate(c)) { | |
| // Valid surrogate pair = 4 bytes in UTF-8 | |
| utf8Length += 4; | |
| pendingSurrogate = false; | |
| } else { | |
| // Unpaired lead surrogate = U+FFFD (3 bytes) | |
| utf8Length += 3; | |
| if (!isLeadSurrogate(c)) { | |
| utf8Length += utf8BytesForCodeUnit(c); | |
| pendingSurrogate = false; | |
| } | |
| } | |
| } else if (isLeadSurrogate(c)) { | |
| pendingSurrogate = true; | |
| } else { | |
| if (isTrailSurrogate(c)) { | |
| // Unpaired trail surrogate = U+FFFD (3 bytes) | |
| utf8Length += 3; | |
| } else { | |
| utf8Length += utf8BytesForCodeUnit(c); | |
| } | |
| } | |
| } | |
| if (pendingSurrogate) { | |
| utf8Length += 3; // Trailing unpaired lead surrogate | |
| } | |
| if (isLeadSurrogate(c)) { | |
| utf8Length += 3; | |
| pendingSurrogate = true; | |
| } else if (isTrailSurrogate(c)) { | |
| utf8Length += pendingSurrogate ? 1 : 3; | |
| pendingSurrogate = false; | |
| } else { | |
| utf8Length += utf8BytesForCodeUnit(c); | |
| pendingSurrogate = false; | |
| } | |
| } |
Not sure which is more performant (or if that is significant at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps some form of:
size_t utf8LengthFromInvalidUtf16(kj::ArrayPtr<const char16_t> input) {
size_t utf8Length = 0;
bool pendingSurrogate = false;
for (size_t i = 0; i < input.size(); i++) {
char16_t c = input[i];
if (c < 0xD800 || c > 0xDFFF) {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
} else if (c < 0xDC00) {
// Lead surrogate
utf8Length += 3;
pendingSurrogate = true;
} else {
// Trail surrogate
utf8Length += pendingSurrogate ? 1 : 3;
pendingSurrogate = false;
}
}
return utf8Length;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is even different from str.utf8Length(js) though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because utf8Length flattens the string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things we're trying to avoid is the additional GC pressure caused by string flattening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only called in the // Two-byte string path codepath which has a note about string being already flattened
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. unless something is done with https://github.com/cloudflare/workerd/pull/5448/files#r2488008265, this isn't better than utf8length?
cbe01b6 to
738e03a
Compare
738e03a to
5f816bc
Compare
5f816bc to
216aa9d
Compare
small experiment with v8::String::ValueView and simdutf for TextEncoder::encode method.