Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Apr 29, 2020

While looking at perf of cast cache, I noticed that load ordering in Get is not sufficient.

We want to do

  • read version
  • read source and read target+result (mutual order of these reads is unimportant)
  • read version

Later we compare values of version and if they are not the same, we know that the entry was concurrently changing and handle this as a cache miss.

The issue is that acquire-reads (Volatile.Read) only order the actual read vs. subsequent memory accesses. That means all the reads inside the "version sandwich" must be acquire. If we use acquire only for target+result, nothing formally prevents the read of the source to be delayed until after we read the version for the second time.

Practically, I think observing such reordering is unlikely because of the CPU cache granularity.
Besides, to cause an incorrect result the racing update would need to have a different target and the same source and yet hash into the same table location.

It is still possible, in theory.

We can have two solutions:

  1. use acquire reads for both source and target+result.
  2. use ordinary reads for source and target+result and issue a load barrier before reading version for the second time.

On microbenchmarks the option #2 is faster and overall does not change the performance of Get compared to baseline.
Option #1, to my surprise, causes ~15% regression on arm64 hardware that I could try on. (surprising, because introducing acquires that are already there did not have as much impact)

I went with option #2 here.
By itself, it is a simple change, but needs support for load barriers, which was a bit involved on the managed side.

@benaadams
Copy link
Member

Is there any documentation on how this works? At first glance the s_table is a shared static in a non-generic so the cast table would be shared across all types and all casts; while also needing to be a concurrent structure. What is the size of the s_table; does it resize etc?

@VSadov VSadov added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Apr 29, 2020
@VSadov
Copy link
Member Author

VSadov commented Apr 29, 2020

@benaadams - the cache is a very simple table that maps type handle pairs {source, target} to 3-value CastResult. Simplicity is a "feature" here. Casting is relatively fast, so cache must be faster. Occasionally having to compute a value is not a big deal though, so that is a trade off here.

The table is available to both managed (corelib) and native (VM) code. We only have Get part on the managed side. It is isomorphic to the native implementation of Get, where possible, for maintainability. Native has both Get and Set. That is because Set is typically done after nontrivial type analysis and access to the type system on the managed side is limited (this might change some day).

The table is an internal implementation detail which could change thus no explicit documentation. It is well commented (I hope), but it may be easier to see how it works by looking at the native side castcache.[cpp|h].

The size doubles up when a full bucket is encountered by Set (statistically ~50% occupancy). There is an upper bound. It was chosen mostly from "how much can we afford for this" consideration. The table should not get into degenerate behaviors when reaching the limit, just do more preemption of old data vs. expansion.

I plan to add an ETW event to the resize to see what real apps use. It is not a hi priority though since it is not expected to be actionable - just to have a datapoint on how the cache behaves.

@VSadov VSadov removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels May 1, 2020
@VSadov VSadov marked this pull request as ready for review May 1, 2020 01:25
@VSadov VSadov requested review from davidwrighton and jkotas May 1, 2020 01:27
/// </summary>
[Intrinsic]
[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern void LoadBarrier();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we turn this into a public API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. Maybe add StoreBarrier for symmetry as well. I did not have a need for store fences in this change, so I did not add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Other similar managed APIs use Read/Write - should this follow the convention? ie ReadMemoryBarrier / WriteMemoryBarrier ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadMemoryBarrier sounds good.

These things are often called differently - Load, Read, Fence, Barrier. I was not sure what would be more consistent with the rest of APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a public API for this seems reasonable, and I agree it'd be good to have symmetry. Might be worth checking other places we use multiple volatile reads to see if it could be used there for similar gains on arm. Read/WriteMemoryBarrier sounds fine to me.

@jkotas jkotas requested a review from stephentoub May 1, 2020 02:31
@jkotas
Copy link
Member

jkotas commented May 1, 2020

Volatile.Read

What does the JIT compile the Volatile.Read to these days on ARM64?

@jkotas
Copy link
Member

jkotas commented May 1, 2020

@TamarChristinaArm Do you have any thoughts about what the correct and most performant way to implement this on ARM64 should be?

@VSadov
Copy link
Member Author

VSadov commented May 1, 2020

Volatile.Read is typically ldar

There is code in the JIT to fallback to ordinary load followed by a load barrier for cases where ldar could not be used (registers, alignment and such requirements). That seems uncommon. I have not seen that in practice.

@jkotas
Copy link
Member

jkotas commented May 1, 2020

cc @dotnet/jit-contrib for the JIT changes.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest renaming the method and leaving it internal as part of this PR, and then opening an issue about exposing it and the write variant publicly. Once approved, fixing that issue would presumably require making this one public, adding it to mono, adding the write variant to both, adding tests, etc.

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm Do you have any thoughts about what the correct and most performant way to implement this on ARM64 should be?

I'm afraid I don't know the memory model in enough detail to be able to answer this one definitively. But I am equally surprised that no#1 ended up being slower. I can however try to chase this up to find an answer on why though. Which core was this tested on @VSadov ?

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of minor comments, but one overriding concern. It is confusing that there is a separate CORINFO_INTRINSIC_MemoryBarrierLoad that generates the load barrier on Arm64. However, there are places in the code where we insert a load barrier on Arm64, but a full barrier on Xarch. It almost seems that you want three types of barrier:

  • The load barrier that on Xarch is only an ordering constraint.
  • The barrier that generates a load barrier on Arm64 but a full barrier on Xarch.
  • The full barrier.

I'm not certain what you'd call the middle one, but if you declared 3 enum values you could eliminate some of the #ifdefs and it would be more clear and consistent.

@VSadov
Copy link
Member Author

VSadov commented May 1, 2020

@CarolEidt - I am surprised that we have cases where we use full barrier on Xarch while on a weaker arm64 a load barrier is sufficient. My guess would be that in Xarch these cases would be ok with just a compiler fence (reordering constraint), but due to the lack of expressiveness a full barrier was used instead.

Basically - I am not sure if "The barrier that generates a load barrier on Arm64 but a full barrier on Xarch" is a thing. I can see, however, a usefullness of "The barrier that is just a compiler fence". - On Xarch a load barrier could be used for that purpose, indeed, but explicitly asking for a compiler fence would aid expressiveness and self-documentation.
On arm64 a compiler fence could be used intentionally where order is guaranteed by other means but changes in program order could mess that up.

@CarolEidt
Copy link
Contributor

I am surprised that we have cases where we use full barrier on Xarch while on a weaker arm64 a load barrier is sufficient.

Me too, but that seems to be what's being done.

My guess would be that in Xarch these cases would be ok with just a compiler fence (reordering constraint), but due to the lack of expressiveness a full barrier was used instead.

That could be the explanation, but I'm not intimately familiar with this code.

Basically - I am not sure if "The barrier that generates a load barrier on Arm64 but a full barrier on Xarch" is a thing.

It may not be, but the code as it stands appears to handle it as "a thing" just not explicitly.

I can see, however, a usefullness of "The barrier that is just a compiler fence". - On Xarch a load barrier could be used for that purpose, indeed, but explicitly asking for a compiler fence would aid expressiveness and self-documentation.

That makes sense, though it would probably best be a separate change from this one.

On arm64 a compiler fence could be used intentionally where order is guaranteed by other means but changes in program order could mess that up.

Could you give an example of such a case? I'm not sure I follow.

@davidwrighton
Copy link
Member

@CarolEidt for an example of a place where a compiler barrier would have been useful see https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs In particular see VolatileReadMemory, which is implementing the C++ semantic for a volatile read (which is just a compiler barrier). It was implemented by using Volatile.Read on X86/X64 platforms, and via a NoInline function on Arm.

@VSadov
Copy link
Member Author

VSadov commented May 1, 2020

On arm64 a compiler fence could be used intentionally where order is guaranteed by other means but changes in program order could mess that up.

Could you give an example of such a case? I'm not sure I follow.

A simpler example could be:

if (a)
{
    flag = 1;   // assignment happens after reading "a", since stores are never speculative.
    foo();
}
else
{
    flag = 1;
    bar();
}

But JIT, in theory, could optimize this into:

register = a;
flag = 1;   // oops !  now this can reorder
if (register)
{
    foo();
}
else
{
    bar();
}

user or synthetic code could do the following:
(assuming JIT would not CSE compiler fences):

if (a)
{
    CompilerFence();
    flag = 1;   // assignment happens after reading "a", since stores are never speculative.
    foo();
}
else
{
    CompilerFence();
    flag = 1;
    bar();
}

I am not saying that we actually use the pattern like above. It is just an example where compiler reordering may interfere.

@VSadov
Copy link
Member Author

VSadov commented May 1, 2020

Anyways. My goal was adding load barier in an additive way without changing anything else.
That is why I went with just two-value enum, since that is minimally sufficient and can be done in additive way.
If there are cases where we would emit different code for existing scenarios - that would be my mistake. I will review the changes carefully to make sure it does not happen.

Any kind of rationalization or changes to existing barriers was not a goal of this change. That is definitely better to be done separately. And we will have to revisit this anyways if/when we make ReadMemoryBarrier public, since it seems we would want to add WriteMemoryBarrier as well.

@CarolEidt
Copy link
Contributor

Anyways. My goal was adding load barier in an additive way without changing anything else.
That is why I went with just two-value enum, since that is minimally sufficient and can be done in additive way.

But the 2-value implementation is confusing and, IMO, inconsistent because it's unclear why we're generating full barriers on xarch when we're generating only load barriers on arm64, and then when we have an actual load barrier we emit nothing on xarch. Beyond that, having a 3-value enum is both clearer and avoids the #ifdefs

@VSadov
Copy link
Member Author

VSadov commented May 1, 2020

Ah #ifdefs. I think I know why the confusion. That is about cases like

#ifdef TARGET_ARM64
        instGen_MemoryBarrier(BARRIER_LOAD_ONLY);
#else
        instGen_MemoryBarrier();
#endif

preexisting code was:

#ifdef TARGET_ARM64
        instGen_MemoryBarrier(INS_BARRIER_ISHLD);
#else
        instGen_MemoryBarrier();
#endif

That is not between Xarch and others, That is all ARM-specific code that handles both arm32 and arm64.
arm32 does not have half-barriers, so any barrier has to be ultimately emitted as a full barrier. Since the barrier instruction (INS_BARRIER_ISHLD etc..) was explicit in the API, an #ifdef was necessary. Not any more.

On Xarch the corresponding codepath would not emit any barriers. (and we are actually in emit phase, so code reordering is no longer a concern).

I did not want to change code too much for these cases.
I guess, I could change it to just

        instGen_MemoryBarrier(BARRIER_LOAD_ONLY);

and handle arm32 specifics inside instGen_MemoryBarrier.
Would that be cleaner?

@CarolEidt
Copy link
Contributor

Thanks @VSadov - I was indeed confused that that code was handling only Arm32 and Arm64. Yes, I think it would be reasonable to handle the arm32 specifics inside instGen_MemoryBarrier - it would also be clearer that the reason for the difference is just that arm32 doesn't have half-barriers.

Thanks for the clarification!

@VSadov
Copy link
Member Author

VSadov commented May 1, 2020

@TamarChristinaArm - the testing was on Qualcomm Centriq 2400.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome - thanks for the restructuring and all the comments!

@VSadov
Copy link
Member Author

VSadov commented May 2, 2020

Entered an issue to follow up on making this Interlocked.ReadMemoryBarrier a public API - #35761

@VSadov
Copy link
Member Author

VSadov commented May 3, 2020

I think I have addressed all the concerns/questions on this PR.
Just want to make sure I am not missing something.

@VSadov
Copy link
Member Author

VSadov commented May 4, 2020

Thanks!!!

@VSadov VSadov merged commit 8527a99 into dotnet:master May 4, 2020
@VSadov VSadov deleted the fences branch May 4, 2020 00:10
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants