Incomplete read ordering in CastCache::Get #35597

VSadov · 2020-04-29T03:56:26Z

While looking at perf of cast cache, I noticed that load ordering in Get is not sufficient.

We want to do

read version
read source and read target+result (mutual order of these reads is unimportant)
read version

Later we compare values of version and if they are not the same, we know that the entry was concurrently changing and handle this as a cache miss.

The issue is that acquire-reads (Volatile.Read) only order the actual read vs. subsequent memory accesses. That means all the reads inside the "version sandwich" must be acquire. If we use acquire only for target+result, nothing formally prevents the read of the source to be delayed until after we read the version for the second time.

Practically, I think observing such reordering is unlikely because of the CPU cache granularity.
Besides, to cause an incorrect result the racing update would need to have a different target and the same source and yet hash into the same table location.

It is still possible, in theory.

We can have two solutions:

use acquire reads for both source and target+result.
use ordinary reads for source and target+result and issue a load barrier before reading version for the second time.

On microbenchmarks the option #2 is faster and overall does not change the performance of Get compared to baseline.
Option #1, to my surprise, causes ~15% regression on arm64 hardware that I could try on. (surprising, because introducing acquires that are already there did not have as much impact)

I went with option #2 here.
By itself, it is a simple change, but needs support for load barriers, which was a bit involved on the managed side.

benaadams · 2020-04-29T12:29:13Z

Is there any documentation on how this works? At first glance the s_table is a shared static in a non-generic so the cast table would be shared across all types and all casts; while also needing to be a concurrent structure. What is the size of the s_table; does it resize etc?

VSadov · 2020-04-29T16:10:04Z

@benaadams - the cache is a very simple table that maps type handle pairs {source, target} to 3-value CastResult. Simplicity is a "feature" here. Casting is relatively fast, so cache must be faster. Occasionally having to compute a value is not a big deal though, so that is a trade off here.

The table is available to both managed (corelib) and native (VM) code. We only have Get part on the managed side. It is isomorphic to the native implementation of Get, where possible, for maintainability. Native has both Get and Set. That is because Set is typically done after nontrivial type analysis and access to the type system on the managed side is limited (this might change some day).

The table is an internal implementation detail which could change thus no explicit documentation. It is well commented (I hope), but it may be easier to see how it works by looking at the native side castcache.[cpp|h].

The size doubles up when a full bucket is encountered by Set (statistically ~50% occupancy). There is an upper bound. It was chosen mostly from "how much can we afford for this" consideration. The table should not get into degenerate behaviors when reaching the limit, just do more preemption of old data vs. expansion.

I plan to add an ETW event to the resize to see what real apps use. It is not a hi priority though since it is not expected to be actionable - just to have a datapoint on how the cache behaves.

jkotas · 2020-05-01T02:31:42Z

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Interlocked.CoreCLR.cs

+        /// </summary>
+        [Intrinsic]
+        [MethodImpl(MethodImplOptions.InternalCall)]
+        internal static extern void LoadBarrier();


Should we turn this into a public API?

Perhaps. Maybe add StoreBarrier for symmetry as well. I did not have a need for store fences in this change, so I did not add it.

Nit: Other similar managed APIs use Read/Write - should this follow the convention? ie ReadMemoryBarrier / WriteMemoryBarrier ?

ReadMemoryBarrier sounds good.

These things are often called differently - Load, Read, Fence, Barrier. I was not sure what would be more consistent with the rest of APIs.

Adding a public API for this seems reasonable, and I agree it'd be good to have symmetry. Might be worth checking other places we use multiple volatile reads to see if it could be used there for similar gains on arm. Read/WriteMemoryBarrier sounds fine to me.

jkotas · 2020-05-01T02:37:37Z

Volatile.Read

What does the JIT compile the Volatile.Read to these days on ARM64?

jkotas · 2020-05-01T02:43:10Z

@TamarChristinaArm Do you have any thoughts about what the correct and most performant way to implement this on ARM64 should be?

VSadov · 2020-05-01T02:48:18Z

Volatile.Read is typically ldar

There is code in the JIT to fallback to ordinary load followed by a load barrier for cases where ldar could not be used (registers, alignment and such requirements). That seems uncommon. I have not seen that in practice.

jkotas · 2020-05-01T03:21:08Z

cc @dotnet/jit-contrib for the JIT changes.

stephentoub

I'd suggest renaming the method and leaving it internal as part of this PR, and then opening an issue about exposing it and the write variant publicly. Once approved, fixing that issue would presumably require making this one public, adding it to mono, adding the write variant to both, adding tests, etc.

TamarChristinaArm · 2020-05-01T14:23:13Z

@TamarChristinaArm Do you have any thoughts about what the correct and most performant way to implement this on ARM64 should be?

I'm afraid I don't know the memory model in enough detail to be able to answer this one definitively. But I am equally surprised that no#1 ended up being slower. I can however try to chase this up to find an answer on why though. Which core was this tested on @VSadov ?

CarolEidt

I have a couple of minor comments, but one overriding concern. It is confusing that there is a separate CORINFO_INTRINSIC_MemoryBarrierLoad that generates the load barrier on Arm64. However, there are places in the code where we insert a load barrier on Arm64, but a full barrier on Xarch. It almost seems that you want three types of barrier:

The load barrier that on Xarch is only an ordering constraint.
The barrier that generates a load barrier on Arm64 but a full barrier on Xarch.
The full barrier.

I'm not certain what you'd call the middle one, but if you declared 3 enum values you could eliminate some of the #ifdefs and it would be more clear and consistent.

src/coreclr/src/jit/codegenarm64.cpp

src/coreclr/src/jit/importer.cpp

VSadov · 2020-05-01T16:50:00Z

@CarolEidt - I am surprised that we have cases where we use full barrier on Xarch while on a weaker arm64 a load barrier is sufficient. My guess would be that in Xarch these cases would be ok with just a compiler fence (reordering constraint), but due to the lack of expressiveness a full barrier was used instead.

Basically - I am not sure if "The barrier that generates a load barrier on Arm64 but a full barrier on Xarch" is a thing. I can see, however, a usefullness of "The barrier that is just a compiler fence". - On Xarch a load barrier could be used for that purpose, indeed, but explicitly asking for a compiler fence would aid expressiveness and self-documentation.
On arm64 a compiler fence could be used intentionally where order is guaranteed by other means but changes in program order could mess that up.

CarolEidt · 2020-05-01T16:55:21Z

I am surprised that we have cases where we use full barrier on Xarch while on a weaker arm64 a load barrier is sufficient.

Me too, but that seems to be what's being done.

My guess would be that in Xarch these cases would be ok with just a compiler fence (reordering constraint), but due to the lack of expressiveness a full barrier was used instead.

That could be the explanation, but I'm not intimately familiar with this code.

Basically - I am not sure if "The barrier that generates a load barrier on Arm64 but a full barrier on Xarch" is a thing.

It may not be, but the code as it stands appears to handle it as "a thing" just not explicitly.

I can see, however, a usefullness of "The barrier that is just a compiler fence". - On Xarch a load barrier could be used for that purpose, indeed, but explicitly asking for a compiler fence would aid expressiveness and self-documentation.

That makes sense, though it would probably best be a separate change from this one.

On arm64 a compiler fence could be used intentionally where order is guaranteed by other means but changes in program order could mess that up.

Could you give an example of such a case? I'm not sure I follow.

davidwrighton · 2020-05-01T17:32:43Z

@CarolEidt for an example of a place where a compiler barrier would have been useful see https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs In particular see VolatileReadMemory, which is implementing the C++ semantic for a volatile read (which is just a compiler barrier). It was implemented by using Volatile.Read on X86/X64 platforms, and via a NoInline function on Arm.

VSadov · 2020-05-01T18:01:38Z

On arm64 a compiler fence could be used intentionally where order is guaranteed by other means but changes in program order could mess that up.

Could you give an example of such a case? I'm not sure I follow.

A simpler example could be:

if (a)
{
    flag = 1;   // assignment happens after reading "a", since stores are never speculative.
    foo();
}
else
{
    flag = 1;
    bar();
}

But JIT, in theory, could optimize this into:

register = a;
flag = 1;   // oops !  now this can reorder
if (register)
{
    foo();
}
else
{
    bar();
}

user or synthetic code could do the following:
(assuming JIT would not CSE compiler fences):

if (a)
{
    CompilerFence();
    flag = 1;   // assignment happens after reading "a", since stores are never speculative.
    foo();
}
else
{
    CompilerFence();
    flag = 1;
    bar();
}

I am not saying that we actually use the pattern like above. It is just an example where compiler reordering may interfere.

VSadov · 2020-05-01T18:04:17Z

Anyways. My goal was adding load barier in an additive way without changing anything else.
That is why I went with just two-value enum, since that is minimally sufficient and can be done in additive way.
If there are cases where we would emit different code for existing scenarios - that would be my mistake. I will review the changes carefully to make sure it does not happen.

Any kind of rationalization or changes to existing barriers was not a goal of this change. That is definitely better to be done separately. And we will have to revisit this anyways if/when we make ReadMemoryBarrier public, since it seems we would want to add WriteMemoryBarrier as well.

CarolEidt · 2020-05-01T18:18:23Z

Anyways. My goal was adding load barier in an additive way without changing anything else.
That is why I went with just two-value enum, since that is minimally sufficient and can be done in additive way.

But the 2-value implementation is confusing and, IMO, inconsistent because it's unclear why we're generating full barriers on xarch when we're generating only load barriers on arm64, and then when we have an actual load barrier we emit nothing on xarch. Beyond that, having a 3-value enum is both clearer and avoids the #ifdefs

VSadov · 2020-05-01T18:46:51Z

Ah #ifdefs. I think I know why the confusion. That is about cases like

#ifdef TARGET_ARM64
        instGen_MemoryBarrier(BARRIER_LOAD_ONLY);
#else
        instGen_MemoryBarrier();
#endif

preexisting code was:

#ifdef TARGET_ARM64
        instGen_MemoryBarrier(INS_BARRIER_ISHLD);
#else
        instGen_MemoryBarrier();
#endif

That is not between Xarch and others, That is all ARM-specific code that handles both arm32 and arm64.
arm32 does not have half-barriers, so any barrier has to be ultimately emitted as a full barrier. Since the barrier instruction (INS_BARRIER_ISHLD etc..) was explicit in the API, an #ifdef was necessary. Not any more.

On Xarch the corresponding codepath would not emit any barriers. (and we are actually in emit phase, so code reordering is no longer a concern).

I did not want to change code too much for these cases.
I guess, I could change it to just

        instGen_MemoryBarrier(BARRIER_LOAD_ONLY);

and handle arm32 specifics inside instGen_MemoryBarrier.
Would that be cleaner?

CarolEidt · 2020-05-01T18:50:03Z

Thanks @VSadov - I was indeed confused that that code was handling only Arm32 and Arm64. Yes, I think it would be reasonable to handle the arm32 specifics inside instGen_MemoryBarrier - it would also be clearer that the reason for the difference is just that arm32 doesn't have half-barriers.

Thanks for the clarification!

VSadov · 2020-05-01T19:18:10Z

@TamarChristinaArm - the testing was on Qualcomm Centriq 2400.

CarolEidt

Awesome - thanks for the restructuring and all the comments!

VSadov · 2020-05-02T23:18:15Z

Entered an issue to follow up on making this Interlocked.ReadMemoryBarrier a public API - #35761

VSadov · 2020-05-03T18:37:48Z

I think I have addressed all the concerns/questions on this PR.
Just want to make sure I am not missing something.

VSadov · 2020-05-04T00:09:50Z

Thanks!!!

Dotnet-GitSync-Bot added the area-VM-coreclr label Apr 29, 2020

VSadov force-pushed the fences branch from ea10a32 to c0b43a9 Compare April 29, 2020 05:12

VSadov added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Apr 29, 2020

VSadov added 2 commits April 30, 2020 10:36

Incomplete read ordering in CastCache::Get

a5d5932

Use LoadBarrier on the managed side as well.

3f4e17a

VSadov force-pushed the fences branch from a80677b to 3f4e17a Compare April 30, 2020 17:37

VSadov removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels May 1, 2020

VSadov marked this pull request as ready for review May 1, 2020 01:25

VSadov requested review from davidwrighton and jkotas May 1, 2020 01:27

jkotas reviewed May 1, 2020

View reviewed changes

jkotas requested a review from stephentoub May 1, 2020 02:31

stephentoub approved these changes May 1, 2020

View reviewed changes

CarolEidt reviewed May 1, 2020

View reviewed changes

src/coreclr/src/jit/codegenarm64.cpp Outdated Show resolved Hide resolved

src/coreclr/src/jit/importer.cpp Outdated Show resolved Hide resolved

VSadov added 2 commits May 1, 2020 15:56

Rename to ReadMemoryBarrier

9b7e8a2

PR feedback in the JIT

8d66aa8

CarolEidt approved these changes May 1, 2020

View reviewed changes

VSadov mentioned this pull request May 2, 2020

Consider making Interlocked.ReadMemoryBarrier public #35761

Closed

jkotas approved these changes May 3, 2020

View reviewed changes

VSadov merged commit 8527a99 into dotnet:master May 4, 2020

VSadov deleted the fences branch May 4, 2020 00:10

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Incomplete read ordering in CastCache::Get #35597

Incomplete read ordering in CastCache::Get #35597

Uh oh!

Conversation

VSadov commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Apr 29, 2020

Uh oh!

VSadov commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas May 1, 2020

Choose a reason for hiding this comment

Uh oh!

VSadov May 1, 2020

Choose a reason for hiding this comment

Uh oh!

jkotas May 1, 2020

Choose a reason for hiding this comment

Uh oh!

VSadov May 1, 2020

Choose a reason for hiding this comment

Uh oh!

stephentoub May 1, 2020

Choose a reason for hiding this comment

Uh oh!

jkotas commented May 1, 2020

Uh oh!

jkotas commented May 1, 2020

Uh oh!

VSadov commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented May 1, 2020

Uh oh!

stephentoub left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TamarChristinaArm commented May 1, 2020

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

VSadov commented May 1, 2020

Uh oh!

CarolEidt commented May 1, 2020

Uh oh!

davidwrighton commented May 1, 2020

Uh oh!

VSadov commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented May 1, 2020

Uh oh!

CarolEidt commented May 1, 2020

Uh oh!

VSadov commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CarolEidt commented May 1, 2020

Uh oh!

VSadov commented May 1, 2020

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

VSadov commented May 2, 2020

Uh oh!

VSadov commented May 3, 2020

Uh oh!

VSadov commented May 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

VSadov commented Apr 29, 2020 •

edited

Loading

VSadov commented Apr 29, 2020 •

edited

Loading

VSadov commented May 1, 2020 •

edited

Loading

stephentoub left a comment •

edited

Loading

VSadov commented May 1, 2020 •

edited

Loading

VSadov commented May 1, 2020 •

edited

Loading