Skip to content

Fix str encode in ractors #674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

casperisfine
Copy link

Author is @luke-gruber, opening a draft to ease review.

encoding.c Outdated
Comment on lines 859 to 876
size_t input_len = strlen(name);
switch(input_len) {
case 5:
if (strncmp(name, string_UTF_8, 5) == 0) {
return ENCINDEX_UTF_8;
}
case 8:
if (strncmp(name, string_US_ASCII, 8) == 0) {
return ENCINDEX_US_ASCII;
}
case 10:
if (strncmp(name, string_ASCII_8BIT, 10) == 0) {
return ENCINDEX_ASCII_8BIT;
}
default:
break;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? If one care about perf, they're use Encoding objects (e.g. Encoding::UTF_8), not encoding names. Also all these encodings have tons of aliases so I'm not even sure this opt will match often.

>> Encoding.find("utf-8")
=> #<Encoding:UTF-8>
>> Encoding.find("UTF-8")
=> #<Encoding:UTF-8>

Copy link

@luke-gruber luke-gruber Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out it was going through this path fairly often because when doing any kind of transcoding, only the names of the encodings are passed, not the rb_encodings . I added an internal API (static functions) that does allow passing encodings now, so this should be less necessary. However, other files still reference these other functions, so this is still a win imo, especially with ractors.
Edit: I changed it to use STRCASECMP

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use STRCASECMP

I still really don't like this. We should fix whatever API is calling this, to pass a rb_encoding * instead.

It's not just about capitalization:

>> Encoding.find("ascii")
=> #<Encoding:US-ASCII>
>> Encoding.find("us-ascii")
=> #<Encoding:US-ASCII>

I understand that this short circuit pays off, but it's really not clean.

@luke-gruber luke-gruber force-pushed the fix_str_encode_in_ractors branch 4 times, most recently from e7c1df8 to ff72419 Compare August 6, 2025 18:40
@luke-gruber
Copy link

@Shopify/byroot It's ready for a re-review. I split my original branch into 2, this PR now includes only deadlock fixes (no perf gains). The other branch is built on top of this and includes perf. gains. I will make a PR for the other branch when it's ready if this gets merged. 🙇

@@ -6684,14 +6690,15 @@ env_shift(VALUE _)
VALUE result = Qnil;
VALUE key = Qnil;

rb_encoding *enc = env_encoding();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So env_encoding calls rb_locale_encoding which calls rb_locale_encindex(), which acquire the VM lock, so this isn't helping much.

I'd start by making rb_locale_encindex() lock-free in most case. A simple atomic check to see if the locale encoding was initialized should suffice.

Copy link

@luke-gruber luke-gruber Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was more that env_encoding could load the encoding, and that needed to be done outside the VM lock or else it deadlocks but I'll look into making it atomic 👍 It shouldn't load the encoding often, but native exts could call setlocale() in which case it could.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. That makes sense.

Perhaps it doesn't matter, as this is only used when accessing ENV, which hopefulyl wouldn't be in a hotspot. But yeah, the amount of work done every time is surprising.

I'll look into making it atomic

No need. Let's make if safe. We'll see about optimizing later.

@luke-gruber luke-gruber force-pushed the fix_str_encode_in_ractors branch 2 times, most recently from 51fe205 to c5b37eb Compare August 7, 2025 17:22
Make sure VM lock is not held when calling `load_transcoder_entry`, as
that causes deadlock inside ractors. `String#encode` now works inside
ractors, among others.
@luke-gruber luke-gruber force-pushed the fix_str_encode_in_ractors branch from c5b37eb to 1c9856c Compare August 7, 2025 17:57
Without this, wbcheck would sometimes hit a missing write barrier.

Co-authored-by: John Hawthorn <[email protected]>
@luke-gruber luke-gruber force-pushed the fix_str_encode_in_ractors branch from 49d1da5 to 6f5a722 Compare August 7, 2025 19:31
This st_table can be inserted into at runtime when autoloading
encodings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants