- 
                Notifications
    You must be signed in to change notification settings 
- Fork 5.2k
[NativeAOT] Using the same CastCache implementation as in CoreClr #84430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsFixes: #75111 Reasons: 
 
 | 
|  | ||
| #if TARGET_64BIT | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| private static ulong RotateLeft(ulong value, int offset) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be deleted and replaced by BitOperations.RotateLeft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've introduced this because NativeAOT was not building in the test configuration that was missing a lot of things - like Volatile, Interlocked, BitOperations.
Later I moved custom implementations under Test.CoreLib, but missed this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to omit this cache for Test.CoreLib. Test.CoreLib has simplistic implementation of number of other subsystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point it would be easy to move these helpers to Test.CoreLib - as custom BitOperations implementations. Then we can have the cache in Test.CorLib and have extra test coverage.
However, if having a minimal implementation is the point, then I can instead add a Test.CoreLib - specific implementation of the cache that does not do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can have the cache in Test.CorLib and have extra test coverage.
There are no relevant tests running against Test.CorLib. All tests that matter for this run against real CoreLib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted additional API implementations for Test.CorLib and added a trivial no-op cache instead.
| // - issue a load barrier before reading _version | ||
| // benchmarks on available hardware (Jan 2020) show that use of a read barrier is cheaper. | ||
|  | ||
| #if CORECLR | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is required for correctness, it should not be under ifdef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is either a load fence here or the two loads above need to be acquires. As it was measured, when this was implemented, one fence was noticeably cheaper.
We need to add Interlocked.ReadMemoryBarrier() to NativeAOT. And it needs to be an intrinsic - no point in implementing it as an internal call.
I will log an issue on Interlocked.ReadMemoryBarrier(). Once that is added, we can remove ifdefs here and around two acquires above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logged an issue for adding Interlocked.ReadMemoryBarrier() intrinsic - #84445
| } | ||
|  | ||
| // the rest is the support for updating the cache. | ||
| // in CoreClr the cache is only updated in the native code | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for CoreCLR to just call this managed implementation instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea. I think what could prevent that is:
- 
if we need to use the cache before we can run managed code. 
 JIT calls into casting machinery, so it is hard to tell if we may need the cache too early.
- 
if we need to use the cache before managed cctor has run. 
 I'd guess the cctor could be forced to run early enough.
- 
if we need to use cache from some mode that does not allow managed code. 
 Most likely this is not a requirement, but I am not sure.
- 
if calling managed code from native has high overhead. 
 Adding to the cache is reasonably fast. We do not see the cost of adding stuff even at start up, when we'd expect cache misses.
 I do not have a good sense of how expensive it is to call managed code from the native runtime in this context.
Would we want to call managed TryGet as well?
(that would be a lot more sensitive to the overhead and TryGet is GC_NOTRIGGER, I vaguely remember there were some reasons for that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth considering, but it should be a separate change.
It does not look like a straightforward tweak and may need some experimenting.
Also the impact of such change will be mostly in CoreClr, while this PR mostly affects NativeAOT, so in terms of watching for failures or stress/perf consequences, it would be better to have separate changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logged an issue to follow up on switching CoreCLR to use managed TrySet and maybe TryGet - #84448
| I have run some simple scenarios with this change to see if there are no obvious performance regressions. It looks like cached casting is slightly faster with the new cache, but not by much. The key point is to have a cache in the first place and NativeAOT already had it. I also noticed that CoreClr is faster than NativeAOT. I think the reason is the code that runs before we hit the cache. That is - the code that lives in  It may be worth looking at the cast entry points. Perhaps there are some opportunities there. The microbenchmark that I used to see impact of the cache in "easy" case:     internal class Program
    {
        const int iters = 1000000;
        static void Main(string[] args)
        {
            for(; ; )
            {
                Time(TestLStringToIROCstring);
            }
        }
        
        static void Time(Action a)
        {
            var sw = Stopwatch.StartNew();
            for (int i = 0; i < 100; i++)
            {
                a();
            }
            sw.Stop();
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }
        static object o = new List<string>();
        static void TestLStringToIROCstring()
        {
            for (int i = 0; i < iters; i++)
            {
                if (o as IReadOnlyCollection<object> == null)
                    throw null;
                if (o as IReadOnlyCollection<string> == null)
                    throw null;
                if (o as IEnumerable<object> == null)
                    throw null;
                if (o as IEnumerable<string> == null)
                    throw null;
            }
        }
    }On my machine (x64), I see:  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More code sharing. Yay!
| 
 IsCloned support is not used. I would not feel bad about deleting it. I think it is unlikely that we will ever use it for anything. CoreCLR has the methodtable flags specifically shaped to make casting fast. We may want to copy/unify the shape. | 
| 
 Right. Also the CoreClr entry points were carefully crafted to minimize branches, benefit from tail calling, if possible. Perhaps at some costs to readability, but for that code it is acceptable. I will look at what we can borrow from there. | 
| 
 We also have some unnecessary checks like "is this null" or "is this exactly the same type" for things that RyuJIT already tests for. The code was written for UTC, not RyuJIT. We just need to validate all the non-codegen callees (i.e. when we call into casting from our regular C# code) also do the appropriate checks and delete the ifs. | 
| Thanks!! | 
Fixes: #75111
Reasons: