improve text encoder encode performance #5448

anonrig · 2025-10-31T14:50:27Z

small experiment with v8::String::ValueView and simdutf for TextEncoder::encode method.

src/workerd/api/encoding.c++

codspeed-hq · 2025-10-31T15:10:31Z

CodSpeed Performance Report

Merging #5448 will degrade performances by 25.86%

_{Comparing yagiz/experiment-value-view (2e8686a) with main (92b6fbf)}

Summary

⚡ 16 improvements
❌ 1 regression
✅ 36 untouched
⏩ 9 skipped¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`EncodeInto_ASCII_8192[1/0/8192]`	3.2 ms	4.4 ms	-25.86%
⚡	`EncodeInto_OneByte_1024[1/1/1024]`	10.5 ms	3.4 ms	×3.1
⚡	`EncodeInto_OneByte_256[1/1/256]`	4.5 ms	2.8 ms	+59.87%
⚡	`EncodeInto_OneByte_8192[1/1/8192]`	66.8 ms	9.6 ms	×7
⚡	`EncodeInto_TwoByte_1024[1/2/1024]`	14.1 ms	4.1 ms	×3.4
⚡	`EncodeInto_TwoByte_256[1/2/256]`	5.4 ms	3.2 ms	+69.86%
⚡	`EncodeInto_TwoByte_8192[1/2/8192]`	96.7 ms	13.8 ms	×7
⚡	`Encode_ASCII_1024[0/0/1024]`	3.7 ms	3.1 ms	+19.63%
⚡	`Encode_ASCII_256[0/0/256]`	3.2 ms	2.6 ms	+23.78%
⚡	`Encode_ASCII_32[0/0/32]`	3 ms	2.3 ms	+29.62%
⚡	`Encode_ASCII_8192[0/0/8192]`	13.1 ms	7.6 ms	+73.87%
⚡	`Encode_OneByte_1024[0/1/1024]`	12.1 ms	5 ms	×2.4
⚡	`Encode_OneByte_256[0/1/256]`	5.3 ms	3.5 ms	+51.97%
⚡	`Encode_OneByte_8192[0/1/8192]`	85.2 ms	18.9 ms	×4.5
⚡	`Encode_TwoByte_1024[0/2/1024]`	19.6 ms	5.1 ms	×3.8
⚡	`Encode_TwoByte_256[0/2/256]`	7.1 ms	3.2 ms	×2.2
⚡	`Encode_TwoByte_8192[0/2/8192]`	146.2 ms	23.6 ms	×6.2

9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

src/workerd/api/encoding.c++

jasnell · 2025-11-03T17:01:52Z

Looks like there are still some bugs to work out here with the failing CI... but even then, my preference would be to settle on #5449 first before landing any changes here. Also, since we do utf8 conversions everywhere, not just in TextEncoder::encode, my preference would be to address this more generally. That is, the optimized encoding path -- if we decide it's worthwhile -- should go into jsg::JsString so that it can be used everywhere rather than just in TextEncoder::encode.

src/workerd/api/encoding.c++

erikcorry · 2025-11-03T20:45:07Z

I don't think we need to speed optimize for the broken UTF-16 case unless and until someone shows it matters. The only reason to space-optimize would be to avoid throwing OOM, so if we can guarantee that doesn't happen I'm OK with some temporary blowup in space too.

erikcorry · 2025-11-03T21:05:55Z

https://paste.cfdata.org/GKKdyFGFqSks

anonrig · 2025-11-03T21:28:36Z

https://paste.cfdata.org/GKKdyFGFqSks

This helps a lot. I'll push with your changes. Thanks @erikcorry

jasnell · 2025-11-03T22:01:18Z

src/workerd/api/encoding.c++

+
+// Calculate UTF-8 length from UTF-16 with potentially invalid surrogates.
+// Invalid surrogates are counted as U+FFFD (3 bytes in UTF-8).
+size_t utf8LengthFromInvalidUtf16(const char16_t* input, size_t length) {


I mentioned this in another comment that I believe got resolved somewhere... this really does not belong in encoding.c++. Any improvement we make should be usable by the entire runtime. Let's move the optimization into jsg::JsString so that all of the APIs benefit.

I'll do this last. I want to make sure we don't regress with your recommendations first.

src/workerd/api/encoding.c++

src/workerd/jsg/jsvalue.h

src/workerd/jsg/buffersource.h

src/workerd/api/encoding.c++

ChALkeR · 2025-11-05T01:50:31Z

src/workerd/api/encoding.c++

+    if (pendingSurrogate) {
+      if (isTrailSurrogate(c)) {
+        // Valid surrogate pair = 4 bytes in UTF-8
+        utf8Length += 4;
+        pendingSurrogate = false;
+      } else {
+        // Unpaired lead surrogate = U+FFFD (3 bytes)
+        utf8Length += 3;
+        if (!isLeadSurrogate(c)) {
+          utf8Length += utf8BytesForCodeUnit(c);
+          pendingSurrogate = false;
+        }
+      }
+    } else if (isLeadSurrogate(c)) {
+      pendingSurrogate = true;
+    } else {
+      if (isTrailSurrogate(c)) {
+        // Unpaired trail surrogate = U+FFFD (3 bytes)
+        utf8Length += 3;
+      } else {
+        utf8Length += utf8BytesForCodeUnit(c);
+      }
+    }
+  }
+
+  if (pendingSurrogate) {
+    utf8Length += 3;  // Trailing unpaired lead surrogate
+  }


(Not an actual suggestion)
Just noting that this whole logic should be identical to the following:

Suggested change

if (pendingSurrogate) {

if (isTrailSurrogate(c)) {

// Valid surrogate pair = 4 bytes in UTF-8

utf8Length += 4;

pendingSurrogate = false;

} else {

// Unpaired lead surrogate = U+FFFD (3 bytes)

utf8Length += 3;

if (!isLeadSurrogate(c)) {

utf8Length += utf8BytesForCodeUnit(c);

pendingSurrogate = false;

}

}

} else if (isLeadSurrogate(c)) {

pendingSurrogate = true;

} else {

if (isTrailSurrogate(c)) {

// Unpaired trail surrogate = U+FFFD (3 bytes)

utf8Length += 3;

} else {

utf8Length += utf8BytesForCodeUnit(c);

}

}

}

if (pendingSurrogate) {

utf8Length += 3; // Trailing unpaired lead surrogate

}

if (isLeadSurrogate(c)) {

utf8Length += 3;

pendingSurrogate = true;

} else if (isTrailSurrogate(c)) {

utf8Length += pendingSurrogate ? 1 : 3;

pendingSurrogate = false;

} else {

utf8Length += utf8BytesForCodeUnit(c);

pendingSurrogate = false;

}

}

Not sure which is more performant (or if that is significant at all)

Or perhaps some form of:

size_t utf8LengthFromInvalidUtf16(kj::ArrayPtr<const char16_t> input) { size_t utf8Length = 0; bool pendingSurrogate = false; for (size_t i = 0; i < input.size(); i++) { char16_t c = input[i]; if (c < 0xD800 || c > 0xDFFF) { utf8Length += utf8BytesForCodeUnit(c); pendingSurrogate = false; } else if (c < 0xDC00) { // Lead surrogate utf8Length += 3; pendingSurrogate = true; } else { // Trail surrogate utf8Length += pendingSurrogate ? 1 : 3; pendingSurrogate = false; } } return utf8Length; }

why is even different from str.utf8Length(js) though?

because utf8Length flattens the string

One of the things we're trying to avoid is the additional GC pressure caused by string flattening

This function is only called in the // Two-byte string path codepath which has a note about string being already flattened

I.e. unless something is done with https://github.com/cloudflare/workerd/pull/5448/files#r2488008265, this isn't better than utf8length?

src/workerd/jsg/jsvalue.c++

anonrig requested review from a team as code owners October 31, 2025 14:50

jasnell reviewed Oct 31, 2025

View reviewed changes