Skip to content

Conversation

@anonrig
Copy link
Member

@anonrig anonrig commented Oct 31, 2025

small experiment with v8::String::ValueView and simdutf for TextEncoder::encode method.

@anonrig anonrig requested review from a team as code owners October 31, 2025 14:50
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 31, 2025

CodSpeed Performance Report

Merging #5448 will degrade performances by 25.86%

Comparing yagiz/experiment-value-view (2e8686a) with main (92b6fbf)

Summary

⚡ 16 improvements
❌ 1 regression
✅ 36 untouched
⏩ 9 skipped1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
EncodeInto_ASCII_8192[1/0/8192] 3.2 ms 4.4 ms -25.86%
EncodeInto_OneByte_1024[1/1/1024] 10.5 ms 3.4 ms ×3.1
EncodeInto_OneByte_256[1/1/256] 4.5 ms 2.8 ms +59.87%
EncodeInto_OneByte_8192[1/1/8192] 66.8 ms 9.6 ms ×7
EncodeInto_TwoByte_1024[1/2/1024] 14.1 ms 4.1 ms ×3.4
EncodeInto_TwoByte_256[1/2/256] 5.4 ms 3.2 ms +69.86%
EncodeInto_TwoByte_8192[1/2/8192] 96.7 ms 13.8 ms ×7
Encode_ASCII_1024[0/0/1024] 3.7 ms 3.1 ms +19.63%
Encode_ASCII_256[0/0/256] 3.2 ms 2.6 ms +23.78%
Encode_ASCII_32[0/0/32] 3 ms 2.3 ms +29.62%
Encode_ASCII_8192[0/0/8192] 13.1 ms 7.6 ms +73.87%
Encode_OneByte_1024[0/1/1024] 12.1 ms 5 ms ×2.4
Encode_OneByte_256[0/1/256] 5.3 ms 3.5 ms +51.97%
Encode_OneByte_8192[0/1/8192] 85.2 ms 18.9 ms ×4.5
Encode_TwoByte_1024[0/2/1024] 19.6 ms 5.1 ms ×3.8
Encode_TwoByte_256[0/2/256] 7.1 ms 3.2 ms ×2.2
Encode_TwoByte_8192[0/2/8192] 146.2 ms 23.6 ms ×6.2

Footnotes

  1. 9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 30c0921 to 2259fce Compare October 31, 2025 17:18
@anonrig anonrig changed the title experiment with value view and simdutf improve text encoder encode performance Oct 31, 2025
@jasnell
Copy link
Collaborator

jasnell commented Nov 3, 2025

Looks like there are still some bugs to work out here with the failing CI... but even then, my preference would be to settle on #5449 first before landing any changes here. Also, since we do utf8 conversions everywhere, not just in TextEncoder::encode, my preference would be to address this more generally. That is, the optimized encoding path -- if we decide it's worthwhile -- should go into jsg::JsString so that it can be used everywhere rather than just in TextEncoder::encode.

@anonrig anonrig force-pushed the yagiz/experiment-value-view branch 2 times, most recently from 98afb46 to f1bbfe6 Compare November 3, 2025 17:15
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 7fac631 to 681bf71 Compare November 3, 2025 17:50
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch 3 times, most recently from 3a6ea76 to 6e3972e Compare November 3, 2025 20:04
@erikcorry
Copy link
Contributor

I don't think we need to speed optimize for the broken UTF-16 case unless and until someone shows it matters. The only reason to space-optimize would be to avoid throwing OOM, so if we can guarantee that doesn't happen I'm OK with some temporary blowup in space too.

@erikcorry
Copy link
Contributor

https://paste.cfdata.org/GKKdyFGFqSks

@anonrig
Copy link
Member Author

anonrig commented Nov 3, 2025

https://paste.cfdata.org/GKKdyFGFqSks

This helps a lot. I'll push with your changes. Thanks @erikcorry


// Calculate UTF-8 length from UTF-16 with potentially invalid surrogates.
// Invalid surrogates are counted as U+FFFD (3 bytes in UTF-8).
size_t utf8LengthFromInvalidUtf16(const char16_t* input, size_t length) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in another comment that I believe got resolved somewhere... this really does not belong in encoding.c++. Any improvement we make should be usable by the entire runtime. Let's move the optimization into jsg::JsString so that all of the APIs benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this last. I want to make sure we don't regress with your recommendations first.

Comment on lines 490 to 520
if (pendingSurrogate) {
if (isTrailSurrogate(c)) {
// Valid surrogate pair = 4 bytes in UTF-8
utf8Length += 4;
pendingSurrogate = false;
} else {
// Unpaired lead surrogate = U+FFFD (3 bytes)
utf8Length += 3;
if (!isLeadSurrogate(c)) {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
}
}
} else if (isLeadSurrogate(c)) {
pendingSurrogate = true;
} else {
if (isTrailSurrogate(c)) {
// Unpaired trail surrogate = U+FFFD (3 bytes)
utf8Length += 3;
} else {
utf8Length += utf8BytesForCodeUnit(c);
}
}
}

if (pendingSurrogate) {
utf8Length += 3; // Trailing unpaired lead surrogate
}
Copy link

@ChALkeR ChALkeR Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not an actual suggestion)
Just noting that this whole logic should be identical to the following:

Suggested change
if (pendingSurrogate) {
if (isTrailSurrogate(c)) {
// Valid surrogate pair = 4 bytes in UTF-8
utf8Length += 4;
pendingSurrogate = false;
} else {
// Unpaired lead surrogate = U+FFFD (3 bytes)
utf8Length += 3;
if (!isLeadSurrogate(c)) {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
}
}
} else if (isLeadSurrogate(c)) {
pendingSurrogate = true;
} else {
if (isTrailSurrogate(c)) {
// Unpaired trail surrogate = U+FFFD (3 bytes)
utf8Length += 3;
} else {
utf8Length += utf8BytesForCodeUnit(c);
}
}
}
if (pendingSurrogate) {
utf8Length += 3; // Trailing unpaired lead surrogate
}
if (isLeadSurrogate(c)) {
utf8Length += 3;
pendingSurrogate = true;
} else if (isTrailSurrogate(c)) {
utf8Length += pendingSurrogate ? 1 : 3;
pendingSurrogate = false;
} else {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
}
}

Not sure which is more performant (or if that is significant at all)

Copy link

@ChALkeR ChALkeR Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps some form of:

size_t utf8LengthFromInvalidUtf16(kj::ArrayPtr<const char16_t> input) {
  size_t utf8Length = 0;
  bool pendingSurrogate = false;

  for (size_t i = 0; i < input.size(); i++) {
    char16_t c = input[i];

    if (c < 0xD800 || c > 0xDFFF) {
      utf8Length += utf8BytesForCodeUnit(c);
      pendingSurrogate = false;
    } else if (c < 0xDC00) {
      // Lead surrogate
      utf8Length += 3;
      pendingSurrogate = true;
    } else {
      // Trail surrogate
      utf8Length += pendingSurrogate ? 1 : 3;
      pendingSurrogate = false;
    }
  }

  return utf8Length;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is even different from str.utf8Length(js) though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because utf8Length flattens the string

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things we're trying to avoid is the additional GC pressure caused by string flattening

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only called in the // Two-byte string path codepath which has a note about string being already flattened

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. unless something is done with https://github.com/cloudflare/workerd/pull/5448/files#r2488008265, this isn't better than utf8length?

@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from cbe01b6 to 738e03a Compare November 5, 2025 23:48
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 738e03a to 5f816bc Compare November 11, 2025 15:19
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 5f816bc to 216aa9d Compare November 12, 2025 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants