-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Incomplete read ordering in CastCache::Get #35597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Is there any documentation on how this works? At first glance the |
|
@benaadams - the cache is a very simple table that maps type handle pairs {source, target} to 3-value CastResult. Simplicity is a "feature" here. Casting is relatively fast, so cache must be faster. Occasionally having to compute a value is not a big deal though, so that is a trade off here. The table is available to both managed (corelib) and native (VM) code. We only have Get part on the managed side. It is isomorphic to the native implementation of Get, where possible, for maintainability. Native has both Get and Set. That is because Set is typically done after nontrivial type analysis and access to the type system on the managed side is limited (this might change some day). The table is an internal implementation detail which could change thus no explicit documentation. It is well commented (I hope), but it may be easier to see how it works by looking at the native side The size doubles up when a full bucket is encountered by Set (statistically ~50% occupancy). There is an upper bound. It was chosen mostly from "how much can we afford for this" consideration. The table should not get into degenerate behaviors when reaching the limit, just do more preemption of old data vs. expansion. I plan to add an ETW event to the resize to see what real apps use. It is not a hi priority though since it is not expected to be actionable - just to have a datapoint on how the cache behaves. |
| /// </summary> | ||
| [Intrinsic] | ||
| [MethodImpl(MethodImplOptions.InternalCall)] | ||
| internal static extern void LoadBarrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we turn this into a public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. Maybe add StoreBarrier for symmetry as well. I did not have a need for store fences in this change, so I did not add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Other similar managed APIs use Read/Write - should this follow the convention? ie ReadMemoryBarrier / WriteMemoryBarrier ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadMemoryBarrier sounds good.
These things are often called differently - Load, Read, Fence, Barrier. I was not sure what would be more consistent with the rest of APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a public API for this seems reasonable, and I agree it'd be good to have symmetry. Might be worth checking other places we use multiple volatile reads to see if it could be used there for similar gains on arm. Read/WriteMemoryBarrier sounds fine to me.
What does the JIT compile the |
|
@TamarChristinaArm Do you have any thoughts about what the correct and most performant way to implement this on ARM64 should be? |
|
There is code in the JIT to fallback to ordinary load followed by a load barrier for cases where |
|
cc @dotnet/jit-contrib for the JIT changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest renaming the method and leaving it internal as part of this PR, and then opening an issue about exposing it and the write variant publicly. Once approved, fixing that issue would presumably require making this one public, adding it to mono, adding the write variant to both, adding tests, etc.
I'm afraid I don't know the memory model in enough detail to be able to answer this one definitively. But I am equally surprised that no#1 ended up being slower. I can however try to chase this up to find an answer on why though. Which core was this tested on @VSadov ? |
CarolEidt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of minor comments, but one overriding concern. It is confusing that there is a separate CORINFO_INTRINSIC_MemoryBarrierLoad that generates the load barrier on Arm64. However, there are places in the code where we insert a load barrier on Arm64, but a full barrier on Xarch. It almost seems that you want three types of barrier:
- The load barrier that on Xarch is only an ordering constraint.
- The barrier that generates a load barrier on Arm64 but a full barrier on Xarch.
- The full barrier.
I'm not certain what you'd call the middle one, but if you declared 3 enum values you could eliminate some of the #ifdefs and it would be more clear and consistent.
|
@CarolEidt - I am surprised that we have cases where we use full barrier on Xarch while on a weaker arm64 a load barrier is sufficient. My guess would be that in Xarch these cases would be ok with just a compiler fence (reordering constraint), but due to the lack of expressiveness a full barrier was used instead. Basically - I am not sure if "The barrier that generates a load barrier on Arm64 but a full barrier on Xarch" is a thing. I can see, however, a usefullness of "The barrier that is just a compiler fence". - On Xarch a load barrier could be used for that purpose, indeed, but explicitly asking for a compiler fence would aid expressiveness and self-documentation. |
Me too, but that seems to be what's being done.
That could be the explanation, but I'm not intimately familiar with this code.
It may not be, but the code as it stands appears to handle it as "a thing" just not explicitly.
That makes sense, though it would probably best be a separate change from this one.
Could you give an example of such a case? I'm not sure I follow. |
|
@CarolEidt for an example of a place where a compiler barrier would have been useful see https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs In particular see VolatileReadMemory, which is implementing the C++ semantic for a volatile read (which is just a compiler barrier). It was implemented by using Volatile.Read on X86/X64 platforms, and via a NoInline function on Arm. |
A simpler example could be: if (a)
{
flag = 1; // assignment happens after reading "a", since stores are never speculative.
foo();
}
else
{
flag = 1;
bar();
}But JIT, in theory, could optimize this into: register = a;
flag = 1; // oops ! now this can reorder
if (register)
{
foo();
}
else
{
bar();
}user or synthetic code could do the following: if (a)
{
CompilerFence();
flag = 1; // assignment happens after reading "a", since stores are never speculative.
foo();
}
else
{
CompilerFence();
flag = 1;
bar();
}I am not saying that we actually use the pattern like above. It is just an example where compiler reordering may interfere. |
|
Anyways. My goal was adding load barier in an additive way without changing anything else. Any kind of rationalization or changes to existing barriers was not a goal of this change. That is definitely better to be done separately. And we will have to revisit this anyways if/when we make |
But the 2-value implementation is confusing and, IMO, inconsistent because it's unclear why we're generating full barriers on xarch when we're generating only load barriers on arm64, and then when we have an actual load barrier we emit nothing on xarch. Beyond that, having a 3-value enum is both clearer and avoids the |
|
Ah preexisting code was: That is not between Xarch and others, That is all ARM-specific code that handles both arm32 and arm64. On Xarch the corresponding codepath would not emit any barriers. (and we are actually in emit phase, so code reordering is no longer a concern). I did not want to change code too much for these cases. and handle arm32 specifics inside |
|
Thanks @VSadov - I was indeed confused that that code was handling only Arm32 and Arm64. Yes, I think it would be reasonable to handle the arm32 specifics inside Thanks for the clarification! |
|
@TamarChristinaArm - the testing was on Qualcomm Centriq 2400. |
CarolEidt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome - thanks for the restructuring and all the comments!
|
Entered an issue to follow up on making this |
|
I think I have addressed all the concerns/questions on this PR. |
|
Thanks!!! |
While looking at perf of cast cache, I noticed that load ordering in Get is not sufficient.
We want to do
versionsourceand readtarget+result(mutual order of these reads is unimportant)versionLater we compare values of
versionand if they are not the same, we know that the entry was concurrently changing and handle this as a cache miss.The issue is that acquire-reads (
Volatile.Read) only order the actual read vs. subsequent memory accesses. That means all the reads inside the "version sandwich" must be acquire. If we use acquire only fortarget+result, nothing formally prevents the read of thesourceto be delayed until after we read theversionfor the second time.Practically, I think observing such reordering is unlikely because of the CPU cache granularity.
Besides, to cause an incorrect result the racing update would need to have a different
targetand the samesourceand yet hash into the same table location.It is still possible, in theory.
We can have two solutions:
sourceandtarget+result.sourceandtarget+resultand issue a load barrier before readingversionfor the second time.On microbenchmarks the option
#2is faster and overall does not change the performance of Get compared to baseline.Option
#1, to my surprise, causes ~15% regression on arm64 hardware that I could try on. (surprising, because introducing acquires that are already there did not have as much impact)I went with option
#2here.By itself, it is a simple change, but needs support for load barriers, which was a bit involved on the managed side.