-
Couldn't load subscription status.
- Fork 5.2k
Safer stackallocs #110864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safer stackallocs #110864
Conversation
…k-struct # Conflicts: # src/libraries/System.Private.CoreLib/src/System/Globalization/CalendarData.Browser.cs
I don't know how much of a win it is to use less stack,but in general I think it is preferable to keep a const in for the stack size. It makes it much easier to reason about "how much is going to be stack allocated" during code review and static analysis. cc @GrabYourPitchforks Since we were just chatting about this. |
I don't have a strong preference here, but in some cases it's possible that e.g. |
|
Related discussion: #97895 |
|
#110843 is still not fully fixed. It changes a potential stack overflow in to an access violation. This computation is done unchecked: Lines 208 to 209 in b6038a4
Since it is unsigned and unchecked, can overflow to a small positive value When we try to access memory here: Line 234 in b6038a4
The address will be invalid. Can be reproduced on this PR with [Fact]
public static void BigNumberOfCounters()
{
CounterSet counterSet = new(Guid.NewGuid(), Guid.NewGuid(), CounterSetInstanceType.Single);
for (int i = 0; i <= 134_217_726; i++)
{
counterSet.AddCounter(i, CounterType.ElapsedTime);
}
counterSet.CreateCounterSetInstance("potato");
} |
Thanks. Just wrapping the compution into |
Seems reasonable.
I don't see a problem with it. Many will probably be unnecessary, but I don't think it introduces any overhead that is problematic. |
Hardcoded ad-hoc policies like this are never going be "right". The proper fix would be an API like #52065 so that one can just ask for a scratch buffer and leave the choice of strategy to the system. |
I think this is a case where we should be ensuring our code is correctly tuned. If we know the vast majority of cases will be Dynamic stackalloc is also "expensive", it introduces a loop that has to probe pages. All our usages will be a single page probe, which means that will be mispredicted by the CPU in the default scenario, incurring an approx 20 cycle penalty. Since stackalloc should not be getting used in known recursive functions or directly in loops, it is unlikely the predictor will ever get that trained correctly and this will be a "permanent" cost. Dynamic stackalloc also means the JIT cannot easily optimize around it, like it currently does with some small fixed allocation sizes. I also don't think the JIT implicitly doing things to the primitive Notably dynamic length stackallocs are often considered dangerous as well. The pattern we're using here is safe (relatively speaking), but as the pattern gets copied around and used its also easy for people to misunderstand and do the wrong thing. Since we're trying to reduce unsafe code, I think being explicit and adding documentation around these areas and why certain thresholds are used/correct is a better investment; as is pushing for the helper intrinsic that generally simplifies the pattern and abstracts away the "right" sizes to use so that it can be correctly tuned, potentially participate in PGO, etc. |
|
Reverted that part. Added |
| } | ||
|
|
||
| fixed (byte* pOutput = &MemoryMarshal.GetReference(notNullOutput)) | ||
| fixed (byte* pOutput = notNullOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is introducing a bug. It looks like the code used MemoryMarshal.GetReference to avoid normalization of empty spans to null. You would have to also delete .Slice(1) above to make this change work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only requirement is that the pointer is non-null, wouldn't it make sense to use MemoryMarshal.GetNonNullPinnableReference? It's used in Encoding for the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Replaced with GetNonNullPinnableReference
Co-authored-by: Miha Zupan <[email protected]>
|
This is ready for review. Here is the code analyzer I used to find all unbound stackallocs: https://github.com/EgorBo/UnsafeCodeAnalyzer/blob/main/src/Analyzer/UnboundStackallocAnalyzer.cs (reports 263 places in Libraries and Corelib excluding tests) |
I think if you already use "stackalloc", you should already have a good reason, you work low-level, and should have knowledge about when it's reasonable to do so. Could maybe make an analyzer that warns on stackallocing unverified dynamic lengths. Imo it's better longterm to improve cases where normal new operations are recognized by JIT to not escape, do the length check there and automatically stackalloc them. And leave stackalloc as-is, a programmer controlled thing. Benefits existing code without changing heuristics. If this gets good enough in the future, at some point ImmutableArray.Builder with simple valuetypes populated by a loop, could occur only on the stack up to a certain number of items even putting the builder class on the stack. I don't think simplifying ops like this is needed, except maybe being able to create scratch buffers inside methods that get allocated at callsite (such as ArrayBuilder.Create(int initialCapacity), that could return a ref struct and decide internally if put on stack or heap). Instead of passing in scratch area. |
Presumably, that requires quite complex range check analysis like nullability. Our general strategy is to avoid spending efforts on low-level/unsafe things, if someone decides to go the unsafe route, they're on their own
Nobody argues with that, but it might take a while till we get there, the most complicated part is the Inter-procedure analysis on top of functions which might change their behavior any time (arguments suddenly start escaping) e.g. due to profilers with ReJIT attached in the middle of the flight.
I think till we get to the proper escape analysis it could be a reasonable API to simplify these patterns, I personally don't have a strong opinion on it. Another "cons" of the escape analysis is that it might be fragile and you will never know whether it kicks in or not |
Your reasoning makes very much sense, I agree. For me personally, I think example below would increase my productivity/safety most writing low-level / high-performant code. It would help more than any language word or implicit API. Being able to write methods like this: static MyRefStructArrayBuilder MyRefStructArrayBuilder.CreateOnStackOrHeap();
static MyRefStructArrayBuilder MyRefStructArrayBuilder.CreateOnStackOrArrayPool();Basically moving responsability for making the stack scratch area, and where to allocate if not fitting the stack, into a common reusable place. Within the helper / builder that's used elsewhere. Rather than having this allocation at each callsite and pass it in. No idea how it would be solved though, as you can't stack alloc inside those and return that area without some weird caller/callee contract. It'd have to be like a macro that's always inlined to the callsite or something. Just some thoughts, maybe someone else can spin off on it, or confirm if they have the same use case as me. |
| else | ||
| { | ||
| byte* pUtf8Name = stackalloc byte[cUtf8Name]; | ||
| list = GetListByName(pName, cNameLen, pUtf8Name, cUtf8Name, listType, cacheType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GetListByName still need to accept pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed! Although, Filter and its MdUtf8String still expect pointers, I'll look into cleaning it up separately
|
|
||
| bool* overrides = stackalloc bool[numVirtuals]; | ||
| new Span<bool>(overrides, numVirtuals).Clear(); | ||
| Span<bool> overrides = (uint)numVirtuals > 512 ? new bool[numVirtuals] : stackalloc bool[numVirtuals]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 512 here just a heuristic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, like everywhere else? Do you mean it should be a named constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tended to leave a short comment about it being an arbitrary cutoff. I wasn't sure in this particular case if there was meaning to 512 with regards to how many virtuals could exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment
| { | ||
| int length = GetAlpnProtocolListSerializedLength(applicationProtocols); | ||
| Span<byte> buffer = length <= 256 ? stackalloc byte[256].Slice(0, length) : new byte[length]; | ||
| Span<byte> buffer = (uint)length <= 256 ? stackalloc byte[256].Slice(0, length) : new byte[length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When are we deciding to do stackalloc T[const].Slice(0, length) vs stackalloc T[length]? Both show up in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to preserve the current behavior in this PR. Although, we agreed that stackalloc T[const].Slice(0, length) pattern is generally better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, we agreed that stackalloc T[const].Slice(0, length) pattern is generally better.
Which is unfortunate as it's not intuitive or canonical: I'd hope we could make "stackalloc T[length]" do the "right thing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, we agreed that stackalloc T[const].Slice(0, length) pattern is generally better.
Which is unfortunate as it's not intuitive or canonical: I'd hope we could make "stackalloc T[length]" do the "right thing".
@stephentoub we had a discussion in this PR about it (starting from #110864 (comment) comment and below).
TL;DR:
stackalloc T[length]:
Pros:
- less stack consumption (if the length is less than the threshold)
- simpler code
Cons:
- Performance impact (unknown length leads to additional stack probes, a lot slower initialization if SkipLocalsInit is not set)
- Constant-sized stackalloc reads better/safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I understand the current tradeoffs. I'm suggesting we should think through how to avoid the perf cost of the simpler approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I understand the current tradeoffs. I'm suggesting we should think through how to avoid the perf cost of the simpler approach.
I guess performance was not even the main concern in that debate? I agree with Jan that we either want an API that does what is better under the hood, or we invest into escape analysis and just rely on JIT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can avoid having to write stackalloc altogether, all the better.
But until then, I don't think we should be proliferating changing stackalloc T[length] to stackalloc T[const].Slice(0, length). That's just creating less readable / maintainable debt that will also proliferate through the ecosystem as a "best practice" when it's really just a workaround for a perf limitation, and a small one at that. I also disagree with the latter being much safer, in particular in all of the relevant cases here, where it's immediately guarded by a length check such that it's obvious just from looking at that one line that the length is in-bounds.
|
|
||
|
|
||
| Span<IntPtr> buffer = stackalloc IntPtr[1]; | ||
| Span<byte> buffer = stackalloc byte[IntPtr.Size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codegen for:
IntPtr p = default;
Span<byte> buffer = MemoryMarshal.AsBytes(new Span<IntPtr>(ref p));Is there any way to write the equivalent without using MemoryMarshal.AsBytes? Maybe with some JIT tweaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the codegen is better because your snippet doesn't require a GS cookie while we always emit it for stackallocs (whether we need it or not for such small stackallocs has been discussed here: #52979)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if we can safely remove the GS cookie when JIT sees that the stackalloc is consumed into Span
| int numberBase = IPv4AddressHelper.Decimal; | ||
| int ch = 0; | ||
| long* parts = stackalloc long[3]; // One part per octet. Final octet doesn't have a terminator, so is stored in currentValue. | ||
| Span<long> parts = stackalloc long[3]; // One part per octet. Final octet doesn't have a terminator, so is stored in currentValue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to remember this resulted in some small regressions the last time it was attempted. Can we confirm that's no longer an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just checked - no extra bounds checks generated
| ret); | ||
|
|
||
| Span<byte> part2 = stackalloc byte[ret.Length]; | ||
| Debug.Assert(ret.Length == 48); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does "48" come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a constant
| (buffer = ArrayPool<byte>.Shared.Rent((int)_size)); | ||
| span = span.Slice(0, (int)_size); | ||
| (buffer = ArrayPool<byte>.Shared.Rent((int)size)); | ||
| span = span.Slice(0, (int)size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the size could actually be greater than Array.MaxValue, is this logic not busted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, no clue - I presume my changes don't change the behavior here?
...ies/Common/src/Interop/Android/System.Security.Cryptography.Native.Android/Interop.Cipher.cs
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| fixed (byte* pOutput = &MemoryMarshal.GetReference(notNullOutput)) | ||
| // We can't pass null down to the native shim, so create a valid pointer if we have an empty span, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We own the code for the native shim, so this looks like a self-inflicted wound. Would it be better to relax the argument checking in the native shim instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Addressed
fa5175f to
2099fa7
Compare
|
/azp run runtime-androidemulator |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/ba-g "timeouts on |
First batch of changes around various
stackallocexpressions in BCL.The rules are:
(uint)casts to handle negative/overflows as well (I've checked all places that it's only applied toint, no long, etc.). In many places it's redundant, but it shouldn't affect the performance and reduce number of false-positives reported by automated tooling.Also, add
Debug.Assertto make it more clear.2) For patterns likelen > CONST : stackalloc T[CONST] : new T[len]we changestackalloc T[CONST]tostackalloc T[len]if that CONST is >= 1024 bytes in order to consume less stack. It shouldn't hurt performance since our libs are compiled with SkipLocalsInit (although, a small overhead still there, so we leave it as is for small const sizes).UPD: although, maybe we should do it only when it's saved to Span to make sure nobody relies on some minimal size..
Closes #110843