Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Dec 20, 2024

First batch of changes around various stackalloc expressions in BCL.

The rules are:

  1. Make sure unbound stackallocs have upper (and lower!) bound checks. Add (uint) casts to handle negative/overflows as well (I've checked all places that it's only applied to int, no long, etc.). In many places it's redundant, but it shouldn't affect the performance and reduce number of false-positives reported by automated tooling.
    Also, add Debug.Assert to make it more clear.
    2) For patterns like len > CONST : stackalloc T[CONST] : new T[len] we change stackalloc T[CONST] to stackalloc T[len] if that CONST is >= 1024 bytes in order to consume less stack. It shouldn't hurt performance since our libs are compiled with SkipLocalsInit (although, a small overhead still there, so we leave it as is for small const sizes).
    UPD: although, maybe we should do it only when it's saved to Span to make sure nobody relies on some minimal size..
  2. Flag suspiciously large allocations
  3. Change unmanaged pointers to Spans on the left side of the stackalloc expression (this PR doesn't do it much).

Closes #110843

@ghost ghost added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 20, 2024
…k-struct

# Conflicts:
#	src/libraries/System.Private.CoreLib/src/System/Globalization/CalendarData.Browser.cs
@vcsjones
Copy link
Member

change stackalloc T[CONST] to stackalloc T[len] if that CONST is >= 512 bytes in order to consume less stack.

I don't know how much of a win it is to use less stack,but in general I think it is preferable to keep a const in for the stack size. It makes it much easier to reason about "how much is going to be stack allocated" during code review and static analysis.

cc @GrabYourPitchforks Since we were just chatting about this.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 20, 2024

change stackalloc T[CONST] to stackalloc T[len] if that CONST is >= 512 bytes in order to consume less stack.

I don't know how much of a win it is to use less stack,but in general I think it is preferable to keep a const in for the stack size. It makes it much easier to reason about "how much is going to be stack allocated" during code review and static analysis.

cc @GrabYourPitchforks Since we were just chatting about this.

I don't have a strong preference here, but in some cases it's possible that e.g. len is 4, but we still unconditionally allocate 1024 bytes for it

@EgorBo
Copy link
Member Author

EgorBo commented Dec 20, 2024

Related discussion: #97895

@vcsjones
Copy link
Member

vcsjones commented Dec 20, 2024

#110843 is still not fully fixed. It changes a potential stack overflow in to an access violation. This computation is done unchecked:

uint CounterSetInfoSize = (uint)sizeof(Interop.PerfCounter.PerfCounterSetInfoStruct)
+ (uint)_idToCounter.Count * (uint)sizeof(Interop.PerfCounter.PerfCounterInfoStruct);

Since it is unsigned and unchecked, can overflow to a small positive value

When we try to access memory here:

CounterInfo = (Interop.PerfCounter.PerfCounterInfoStruct*)(CounterSetBuffer + CounterSetInfoUsed);

The address will be invalid.

Can be reproduced on this PR with

[Fact]
public static void BigNumberOfCounters()
{
    CounterSet counterSet = new(Guid.NewGuid(), Guid.NewGuid(), CounterSetInstanceType.Single);

    for (int i = 0; i <= 134_217_726; i++)
    {
        counterSet.AddCounter(i, CounterType.ElapsedTime);
    }

    counterSet.CreateCounterSetInstance("potato");
}

@EgorBo
Copy link
Member Author

EgorBo commented Dec 20, 2024

#110843 is still not fully fixed. It changes a potential stack overflow in to an access violation. This computation is done unchecked:

Thanks. Just wrapping the compution into checked context should be enough? There are many patterns where stackalloc size is computed like that, presumably, we should checked them all?

@vcsjones
Copy link
Member

Just wrapping the computation into checked context should be enough?

Seems reasonable.

There are many patterns where stackalloc size is computed like that, presumably, we should checked them all?

I don't see a problem with it. Many will probably be unnecessary, but I don't think it introduces any overhead that is problematic.

@jkotas
Copy link
Member

jkotas commented Dec 20, 2024

For patterns like len > CONST : stackalloc T[CONST] : new T[len] we change stackalloc T[CONST] to stackalloc T[len] if that CONST is >= 1024 bytes in order to consume less stack.

Hardcoded ad-hoc policies like this are never going be "right". The proper fix would be an API like #52065 so that one can just ask for a scratch buffer and leave the choice of strategy to the system.

@tannergooding
Copy link
Member

I don't have a strong preference here, but in some cases it's possible that e.g. len is 4, but we still unconditionally allocate 1024 bytes for it

I think this is a case where we should be ensuring our code is correctly tuned.

If we know the vast majority of cases will be <= n bytes then that's a much better threshold to use than the default max threshold (which is generally 1024, which comes from the typical threshold used by native malloca implementations). If we have effectively two thresholds to handle (most are small, some are slightly larger), then I think that doing 2 explicit stackalloc thresholds are better (if (size <= 64) { stackalloc byte[64]; } else if (size <= 1024) { stackalloc byte[1024]; } else { RentOrAllocateNewArray; }).

Dynamic stackalloc is also "expensive", it introduces a loop that has to probe pages. All our usages will be a single page probe, which means that will be mispredicted by the CPU in the default scenario, incurring an approx 20 cycle penalty. Since stackalloc should not be getting used in known recursive functions or directly in loops, it is unlikely the predictor will ever get that trained correctly and this will be a "permanent" cost. Dynamic stackalloc also means the JIT cannot easily optimize around it, like it currently does with some small fixed allocation sizes.

I also don't think the JIT implicitly doing things to the primitive stackalloc pattern is goodness, it makes code that is likely being used in lowlevel/tuned scenarios less predictable, meaning it can't be used for it's intended scenario. I'd rather instead see us invest in actually exposing our own StackallocOrRent helper that can then do such optimizations itself, giving users the option of convenience vs control.


Notably dynamic length stackallocs are often considered dangerous as well. The pattern we're using here is safe (relatively speaking), but as the pattern gets copied around and used its also easy for people to misunderstand and do the wrong thing. Since we're trying to reduce unsafe code, I think being explicit and adding documentation around these areas and why certain thresholds are used/correct is a better investment; as is pushing for the helper intrinsic that generally simplifies the pattern and abstracts away the "right" sizes to use so that it can be correctly tuned, potentially participate in PGO, etc.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 20, 2024

Reverted that part. Added checked everywhere where length is calculated near the stackalloc.
Happy to replace all places with a helper call, but seems like it won't be there soon.

}

fixed (byte* pOutput = &MemoryMarshal.GetReference(notNullOutput))
fixed (byte* pOutput = notNullOutput)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is introducing a bug. It looks like the code used MemoryMarshal.GetReference to avoid normalization of empty spans to null. You would have to also delete .Slice(1) above to make this change work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only requirement is that the pointer is non-null, wouldn't it make sense to use MemoryMarshal.GetNonNullPinnableReference? It's used in Encoding for the same reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Replaced with GetNonNullPinnableReference

@teo-tsirpanis teo-tsirpanis added area-Meta and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 20, 2024
@EgorBo EgorBo marked this pull request as ready for review December 20, 2024 21:13
@EgorBo
Copy link
Member Author

EgorBo commented Dec 20, 2024

This is ready for review. Here is the code analyzer I used to find all unbound stackallocs: https://github.com/EgorBo/UnsafeCodeAnalyzer/blob/main/src/Analyzer/UnboundStackallocAnalyzer.cs (reports 263 places in Libraries and Corelib excluding tests)

@ChrML
Copy link

ChrML commented Dec 25, 2024

For patterns like len > CONST : stackalloc T[CONST] : new T[len] we change stackalloc T[CONST] to stackalloc T[len] if that CONST is >= 1024 bytes in order to consume less stack.

Hardcoded ad-hoc policies like this are never going be "right". The proper fix would be an API like #52065 so that one can just ask for a scratch buffer and leave the choice of strategy to the system.

I think if you already use "stackalloc", you should already have a good reason, you work low-level, and should have knowledge about when it's reasonable to do so. Could maybe make an analyzer that warns on stackallocing unverified dynamic lengths.

Imo it's better longterm to improve cases where normal new operations are recognized by JIT to not escape, do the length check there and automatically stackalloc them. And leave stackalloc as-is, a programmer controlled thing. Benefits existing code without changing heuristics. If this gets good enough in the future, at some point ImmutableArray.Builder with simple valuetypes populated by a loop, could occur only on the stack up to a certain number of items even putting the builder class on the stack.

I don't think simplifying ops like this is needed, except maybe being able to create scratch buffers inside methods that get allocated at callsite (such as ArrayBuilder.Create(int initialCapacity), that could return a ref struct and decide internally if put on stack or heap). Instead of passing in scratch area.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 25, 2024

Could maybe make an analyzer that warns on stackallocing unverified dynamic lengths.

Presumably, that requires quite complex range check analysis like nullability. Our general strategy is to avoid spending efforts on low-level/unsafe things, if someone decides to go the unsafe route, they're on their own

Imo it's better longterm to improve cases where normal new operations are recognized by JIT to not escape, do the length check there and automatically stackalloc them.

Nobody argues with that, but it might take a while till we get there, the most complicated part is the Inter-procedure analysis on top of functions which might change their behavior any time (arguments suddenly start escaping) e.g. due to profilers with ReJIT attached in the middle of the flight.

I don't think simplifying ops like this is needed

I think till we get to the proper escape analysis it could be a reasonable API to simplify these patterns, I personally don't have a strong opinion on it. Another "cons" of the escape analysis is that it might be fragile and you will never know whether it kicks in or not

@ChrML
Copy link

ChrML commented Dec 25, 2024

Could maybe make an analyzer that warns on stackallocing unverified dynamic lengths.

Presumably, that requires quite complex range check analysis like nullability. Our general strategy is to avoid spending efforts on low-level/unsafe things, if someone decides to go the unsafe route, they're on their own

Imo it's better longterm to improve cases where normal new operations are recognized by JIT to not escape, do the length check there and automatically stackalloc them.

Nobody argues with that, but it might take a while till we get there, the most complicated part is the Inter-procedure analysis on top of functions which might change their behavior any time (arguments suddenly start escaping) e.g. due to profilers with ReJIT attached in the middle of the flight.

I don't think simplifying ops like this is needed

I think till we get to the proper escape analysis it could be a reasonable API to simplify these patterns, I personally don't have a strong opinion on it. Another "cons" of the escape analysis is that it might be fragile and you will never know whether it kicks in or not

Your reasoning makes very much sense, I agree.

For me personally, I think example below would increase my productivity/safety most writing low-level / high-performant code. It would help more than any language word or implicit API. Being able to write methods like this:

static MyRefStructArrayBuilder MyRefStructArrayBuilder.CreateOnStackOrHeap();
static MyRefStructArrayBuilder MyRefStructArrayBuilder.CreateOnStackOrArrayPool();

Basically moving responsability for making the stack scratch area, and where to allocate if not fitting the stack, into a common reusable place. Within the helper / builder that's used elsewhere. Rather than having this allocation at each callsite and pass it in.

No idea how it would be solved though, as you can't stack alloc inside those and return that area without some weird caller/callee contract. It'd have to be like a macro that's always inlined to the callsite or something.

Just some thoughts, maybe someone else can spin off on it, or confirm if they have the same use case as me.

else
{
byte* pUtf8Name = stackalloc byte[cUtf8Name];
list = GetListByName(pName, cNameLen, pUtf8Name, cUtf8Name, listType, cacheType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GetListByName still need to accept pointers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed! Although, Filter and its MdUtf8String still expect pointers, I'll look into cleaning it up separately


bool* overrides = stackalloc bool[numVirtuals];
new Span<bool>(overrides, numVirtuals).Clear();
Span<bool> overrides = (uint)numVirtuals > 512 ? new bool[numVirtuals] : stackalloc bool[numVirtuals];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 512 here just a heuristic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, like everywhere else? Do you mean it should be a named constant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tended to leave a short comment about it being an arbitrary cutoff. I wasn't sure in this particular case if there was meaning to 512 with regards to how many virtuals could exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

{
int length = GetAlpnProtocolListSerializedLength(applicationProtocols);
Span<byte> buffer = length <= 256 ? stackalloc byte[256].Slice(0, length) : new byte[length];
Span<byte> buffer = (uint)length <= 256 ? stackalloc byte[256].Slice(0, length) : new byte[length];
Copy link
Member

@stephentoub stephentoub Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are we deciding to do stackalloc T[const].Slice(0, length) vs stackalloc T[length]? Both show up in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to preserve the current behavior in this PR. Although, we agreed that stackalloc T[const].Slice(0, length) pattern is generally better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, we agreed that stackalloc T[const].Slice(0, length) pattern is generally better.

Which is unfortunate as it's not intuitive or canonical: I'd hope we could make "stackalloc T[length]" do the "right thing".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, we agreed that stackalloc T[const].Slice(0, length) pattern is generally better.

Which is unfortunate as it's not intuitive or canonical: I'd hope we could make "stackalloc T[length]" do the "right thing".

@stephentoub we had a discussion in this PR about it (starting from #110864 (comment) comment and below).

TL;DR:
stackalloc T[length]:
Pros:

  • less stack consumption (if the length is less than the threshold)
  • simpler code

Cons:

  • Performance impact (unknown length leads to additional stack probes, a lot slower initialization if SkipLocalsInit is not set)
  • Constant-sized stackalloc reads better/safer

Copy link
Member

@stephentoub stephentoub Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I understand the current tradeoffs. I'm suggesting we should think through how to avoid the perf cost of the simpler approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I understand the current tradeoffs. I'm suggesting we should think through how to avoid the perf cost of the simpler approach.

I guess performance was not even the main concern in that debate? I agree with Jan that we either want an API that does what is better under the hood, or we invest into escape analysis and just rely on JIT

Copy link
Member

@stephentoub stephentoub Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can avoid having to write stackalloc altogether, all the better.

But until then, I don't think we should be proliferating changing stackalloc T[length] to stackalloc T[const].Slice(0, length). That's just creating less readable / maintainable debt that will also proliferate through the ecosystem as a "best practice" when it's really just a workaround for a perf limitation, and a small one at that. I also disagree with the latter being much safer, in particular in all of the relevant cases here, where it's immediately guarded by a length check such that it's obvious just from looking at that one line that the length is in-bounds.



Span<IntPtr> buffer = stackalloc IntPtr[1];
Span<byte> buffer = stackalloc byte[IntPtr.Size];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the codegen is better because your snippet doesn't require a GS cookie while we always emit it for stackallocs (whether we need it or not for such small stackallocs has been discussed here: #52979)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if we can safely remove the GS cookie when JIT sees that the stackalloc is consumed into Span

int numberBase = IPv4AddressHelper.Decimal;
int ch = 0;
long* parts = stackalloc long[3]; // One part per octet. Final octet doesn't have a terminator, so is stored in currentValue.
Span<long> parts = stackalloc long[3]; // One part per octet. Final octet doesn't have a terminator, so is stored in currentValue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to remember this resulted in some small regressions the last time it was attempted. Can we confirm that's no longer an issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just checked - no extra bounds checks generated

ret);

Span<byte> part2 = stackalloc byte[ret.Length];
Debug.Assert(ret.Length == 48);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does "48" come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a constant

(buffer = ArrayPool<byte>.Shared.Rent((int)_size));
span = span.Slice(0, (int)_size);
(buffer = ArrayPool<byte>.Shared.Rent((int)size));
span = span.Slice(0, (int)size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the size could actually be greater than Array.MaxValue, is this logic not busted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, no clue - I presume my changes don't change the behavior here?

@EgorBo EgorBo mentioned this pull request Jul 28, 2024
15 tasks
}

fixed (byte* pOutput = &MemoryMarshal.GetReference(notNullOutput))
// We can't pass null down to the native shim, so create a valid pointer if we have an empty span,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We own the code for the native shim, so this looks like a self-inflicted wound. Would it be better to relax the argument checking in the native shim instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Addressed

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2025

/azp run runtime-androidemulator

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2025

/ba-g "timeouts on Build android-x86 Release AllSubsets_Mono and Build Libraries Test Run release coreclr windows x64 Debug"

@EgorBo EgorBo merged commit 7790117 into dotnet:main Jan 16, 2025
148 of 152 checks passed
@EgorBo EgorBo deleted the remove-nullcheck-struct branch January 16, 2025 18:34
@github-actions github-actions bot locked and limited conversation to collaborators Feb 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CounterSet.CreateCounterSetInstance can stack overflow with excessive counters

9 participants