Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Apr 6, 2023

Fixes: #75111

Reasons:

  • code sharing
  • CoreClr implementation does not require a Crst , or any kind of locks.

@VSadov VSadov requested a review from MichalStrehovsky as a code owner April 6, 2023 17:29
@ghost ghost assigned VSadov Apr 6, 2023
@ghost
Copy link

ghost commented Apr 6, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #75111

Reasons:

  • code sharing
  • CoreClr implementation does not require a Crst , or any kind of locks.
Author: VSadov
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -


#if TARGET_64BIT
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static ulong RotateLeft(ulong value, int offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be deleted and replaced by BitOperations.RotateLeft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've introduced this because NativeAOT was not building in the test configuration that was missing a lot of things - like Volatile, Interlocked, BitOperations.
Later I moved custom implementations under Test.CoreLib, but missed this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to omit this cache for Test.CoreLib. Test.CoreLib has simplistic implementation of number of other subsystems.

Copy link
Member Author

@VSadov VSadov Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it would be easy to move these helpers to Test.CoreLib - as custom BitOperations implementations. Then we can have the cache in Test.CorLib and have extra test coverage.

However, if having a minimal implementation is the point, then I can instead add a Test.CoreLib - specific implementation of the cache that does not do anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can have the cache in Test.CorLib and have extra test coverage.

There are no relevant tests running against Test.CorLib. All tests that matter for this run against real CoreLib.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted additional API implementations for Test.CorLib and added a trivial no-op cache instead.

// - issue a load barrier before reading _version
// benchmarks on available hardware (Jan 2020) show that use of a read barrier is cheaper.

#if CORECLR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is required for correctness, it should not be under ifdef.

Copy link
Member Author

@VSadov VSadov Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is either a load fence here or the two loads above need to be acquires. As it was measured, when this was implemented, one fence was noticeably cheaper.

We need to add Interlocked.ReadMemoryBarrier() to NativeAOT. And it needs to be an intrinsic - no point in implementing it as an internal call.

I will log an issue on Interlocked.ReadMemoryBarrier(). Once that is added, we can remove ifdefs here and around two acquires above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logged an issue for adding Interlocked.ReadMemoryBarrier() intrinsic - #84445

}

// the rest is the support for updating the cache.
// in CoreClr the cache is only updated in the native code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for CoreCLR to just call this managed implementation instead?

Copy link
Member Author

@VSadov VSadov Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. I think what could prevent that is:

  • if we need to use the cache before we can run managed code.
    JIT calls into casting machinery, so it is hard to tell if we may need the cache too early.

  • if we need to use the cache before managed cctor has run.
    I'd guess the cctor could be forced to run early enough.

  • if we need to use cache from some mode that does not allow managed code.
    Most likely this is not a requirement, but I am not sure.

  • if calling managed code from native has high overhead.
    Adding to the cache is reasonably fast. We do not see the cost of adding stuff even at start up, when we'd expect cache misses.
    I do not have a good sense of how expensive it is to call managed code from the native runtime in this context.

Would we want to call managed TryGet as well?
(that would be a lot more sensitive to the overhead and TryGet is GC_NOTRIGGER, I vaguely remember there were some reasons for that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth considering, but it should be a separate change.

It does not look like a straightforward tweak and may need some experimenting.
Also the impact of such change will be mostly in CoreClr, while this PR mostly affects NativeAOT, so in terms of watching for failures or stress/perf consequences, it would be better to have separate changes.

Copy link
Member Author

@VSadov VSadov Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logged an issue to follow up on switching CoreCLR to use managed TrySet and maybe TryGet - #84448

@VSadov
Copy link
Member Author

VSadov commented Apr 6, 2023

I have run some simple scenarios with this change to see if there are no obvious performance regressions.

It looks like cached casting is slightly faster with the new cache, but not by much. The key point is to have a cache in the first place and NativeAOT already had it.

I also noticed that CoreClr is faster than NativeAOT. I think the reason is the code that runs before we hit the cache. That is - the code that lives in nativeaot\Runtime.Base\src\System\Runtime\TypeCast.cs. It looks like the entry points are a bit more complex than in CoreClr cast helpers.
Cached casts are fast and couple extra calls, indirections or branches may have an impact.
Also it could be that some extra complexity is necessary. Dealing with IsCloned appears to be specific to NativeAOT.
There are also couple TODOs about IsIDynamicInterfaceCastable that seem relevant to the overall perf.

It may be worth looking at the cast entry points. Perhaps there are some opportunities there.

The microbenchmark that I used to see impact of the cache in "easy" case:

    internal class Program
    {
        const int iters = 1000000;

        static void Main(string[] args)
        {
            for(; ; )
            {
                Time(TestLStringToIROCstring);
            }
        }
        
        static void Time(Action a)
        {
            var sw = Stopwatch.StartNew();
            for (int i = 0; i < 100; i++)
            {
                a();
            }

            sw.Stop();
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static object o = new List<string>();

        static void TestLStringToIROCstring()
        {
            for (int i = 0; i < iters; i++)
            {
                if (o as IReadOnlyCollection<object> == null)
                    throw null;

                if (o as IReadOnlyCollection<string> == null)
                    throw null;

                if (o as IEnumerable<object> == null)
                    throw null;

                if (o as IEnumerable<string> == null)
                    throw null;
            }
        }
    }

On my machine (x64), I see:
(smaller is better)

==== before change:
871
868
874
870
870
870

==== after the change
858
860
863
864
859
856

==== CoreClr (JIT)
586
589
583
587
583
585

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More code sharing. Yay!

@jkotas
Copy link
Member

jkotas commented Apr 7, 2023

Dealing with IsCloned appears to be specific to NativeAOT.

IsCloned support is not used. I would not feel bad about deleting it. I think it is unlikely that we will ever use it for anything.

CoreCLR has the methodtable flags specifically shaped to make casting fast. We may want to copy/unify the shape.

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

CoreCLR has the methodtable flags specifically shaped to make casting fast. We may want to copy/unify the shape.

Right. Also the CoreClr entry points were carefully crafted to minimize branches, benefit from tail calling, if possible. Perhaps at some costs to readability, but for that code it is acceptable.

I will look at what we can borrow from there.

@MichalStrehovsky
Copy link
Member

Right. Also the CoreClr entry points were carefully crafted to minimize branches, benefit from tail calling, if possible. Perhaps at some costs to readability, but for that code it is acceptable.

We also have some unnecessary checks like "is this null" or "is this exactly the same type" for things that RyuJIT already tests for. The code was written for UTC, not RyuJIT. We just need to validate all the non-codegen callees (i.e. when we call into casting from our regular C# code) also do the appropriate checks and delete the ifs.

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

Thanks!!

@VSadov VSadov merged commit 38b81ba into dotnet:main Apr 7, 2023
@VSadov VSadov deleted the castCacheN branch April 7, 2023 02:04
@ghost ghost locked as resolved and limited conversation to collaborators May 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NativeAOT] Consider porting/sharing CastCache form CoreClr

3 participants