Speed up calling interface methods #93100

MichalStrehovsky · 2023-10-06T08:58:28Z

Continuation of #86238.

Looks like we might already have all we need on the RyuJIT side because once I started to look at it, it seems like what we want is similar to how generic virtual methods are done.

Calling interface methods (that could not be devirtualized) currently works like this:

At the callsite, codegen loads all arguments for the call and in addition to those, loads address of an "interface dispatch cell" into a non-argument register. It then performs an indirect call to the address stored at the beginning of the "interface dispatch cell". This dispatch cell is a data structure generated in the compiler.

This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata:

In the first field, there needs to be a callable code address (because this is where codegen calls into). Initially, each dispatch cell has this set to RhpInitialDynamicInterfaceDispatch.
In the second field, there is the address of the interface MethodTable (with a low bit set for a purpose that will be clear later).

You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found.

Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy.

When RhpInitialDynamicInterfaceDispatch runs and computes the target of dispatch, it will allocate a small cache that has the type of this and the address of the method that implements the interface method. It will modify the code address stored in the dispatch cell (that was pointing to RhpInitialDynamicInterfaceDispatch) to call a different method, and also update the second field to point to the cache.

Next time the dispatch cell is invoked, the specialized helper will compare the type of this with the previously seen this and if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types of this or so.

The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time.

This PR explores a different option:

At the callsite, codegen loads the target object and the address of the dispatch cell and calls a helper.
Magic happens in the helper and helper returns the address to call.
Codegen generates a call to the returned value.

This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.

Cc @dotnet/ilc-contrib

ghost · 2023-10-06T08:58:33Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Continuation of #86238.

Looks like we might already have all we need on the RyuJIT side because once I started to look at it, it seems like what we want is similar to how generic virtual methods are done.

Calling interface methods (that could not be devirtualized) currently works like this:

At the callsite, codegen loads all arguments for the call and in addition to those, loads address of an "interface dispatch cell" into a non-argument register. It then performs an indirect call to the address stored at the beginning of the "interface dispatch cell". This dispatch cell is a data structure generated in the compiler.

This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata:

In the first field, there needs to be a callable code address (because this is where codegen calls into). Initially, each dispatch cell has this set to RhpInitialDynamicInterfaceDispatch.
In the second field, there is the address of the interface MethodTable (with a low bit set for a purpose that will be clear later).

You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found.

Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy.

When RhpInitialDynamicInterfaceDispatch runs and computes the target of dispatch, it will allocate a small cache that has the type of this and the address of the method that implements the interface method. It will modify the code address stored in the dispatch cell (that was pointing to RhpInitialDynamicInterfaceDispatch) to call a different method, and also update the second field to point to the cache.

Next time the dispatch cell is invoked, the specialized helper will compare the type of this with the previously seen this and if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types of this or so.

The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time.

This PR explores a different option:

At the callsite, codegen loads the target object and the address of the dispatch cell and calls a helper.
Magic happens in the helper and helper returns the address to call.
Codegen generates a call to the returned value.

This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.

Author:	MichalStrehovsky
Assignees:	-
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

MichalStrehovsky · 2023-10-06T09:06:42Z

Draft because there's many unresolved issues and I don't expect CI to pass, but a hello world works.

How do we want the new dispatch cell to look like? Currently it looks similar to the old one, except we don't update it. If it stays like this, we don't even need a dispatch cell per callsite, just per interface+slot pair. But then I don't see how we could make a super efficient monomorphic path (I think we do want a super efficient monomorphic path).
The cache key is the dispatch cell address which may or may not be what we want. It's not what we want in the current shape of the prototype.
The slow helper is not getting tail called. This is a pre-existing issue: slow helper for GVM dispatch is not tail called either, but we probably really want a tail call for this.
Some logic in RyuJIT that aborts inlining anything with this kind of dispatch (I put a comment on it).
IDynamicInterfaceCastable should not end up in the cache.
Do we want to keep sharing the cache with GVM?
Type loader work.
Delete tons of code associated with the previous approach and managing the caches.
JitInterface guid, general JIT review. We might not be doing devirtualizations anymore for this shape from what I saw.

The main question at this point is how the dispatch cell should look like. I'm open to suggestions as I don't want to have to redo it too many times.

jkotas · 2023-10-06T15:55:07Z

If it stays like this, we don't even need a dispatch cell per callsite, just per interface+slot pair. But then I don't see how we could make a super efficient monomorphic path (I think we do want a super efficient monomorphic path).

The code for the super-efficient monomorphic path can be inlined with this plan, similar how we inline it for guarded devirtualization. This monomorphic path inlining can be skipped when optimizing for size or on cold paths.

Are you thinking about having large static hashtable for polymorphic case, similar to how it is done in regular CoreCLR?

MichalStrehovsky · 2023-10-08T09:14:20Z

The code for the super-efficient monomorphic path can be inlined with this plan, similar how we inline it for guarded devirtualization. This monomorphic path inlining can be skipped when optimizing for size or on cold paths.

I'm thinking how that would work:

We'd need to load the address of the cache. Then we'd need to check the cache inline (dereference this, compare with the first field of the cache, if equal, use the address in the second field). If they're not equal, we'd need to load another data structure (that captures interface type and slot number) and call another helper that takes this as argument, and also the pointer to the cache so that it can be updated. It feels like this would increase the size of the callsite by quite a lot compared to what we have there now. It might be better to just just go with storing all of this in the dispatch cell for now (keep the dispatch cell per callsite and keep the cache accessible from there).

Are you thinking about having large static hashtable for polymorphic case, similar to how it is done in regular CoreCLR?

Large hashtable - like we do for generic virtual methods. I'm not sure about the "static" part - we need to be able to to account for interface types loaded at runtime by the type loader.

jkotas · 2023-10-08T13:31:25Z

I'm not sure about the "static" part - we need to be able to to account for interface types loaded at runtime by the type loader.

I meant singleton, similar to what we have for GVMs. I did not mean that the cache would be pre-initialized at build time.

Calling interface methods (that could not be devirtualized) currently works like this: * At the callsite, codegen loads all arguments for the call and in addition to those, loads address of an "interface dispatch cell" into a non-argument register. It then performs an indirect call to the address stored at the beginning of the "interface dispatch cell". This dispatch cell is a data structure generated in the compiler. This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata: * In the first field, there needs to be a callable code address (because this is where codegen calls into). Initially, each dispatch cell has this set to `RhpInitialDynamicInterfaceDispatch`. * In the second field, there is the address of the interface MethodTable (with a low bit set for a purpose that will be clear later). You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found. Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy. When `RhpInitialDynamicInterfaceDispatch` runs and computes the target of dispatch, it will allocate a small cache that has the type of `this` and the address of the method that implements the interface method. It will _modify_ the code address stored in the dispatch cell (that was pointing to `RhpInitialDynamicInterfaceDispatch`) to call a different method, and also update the second field to point to the cache. Next time the dispatch cell is invoked, the specialized helper will compare the type of `this` with the previously seen `this` and if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types of `this` or so. The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time. This PR explores a different option: * At the callsite, codegen loads the target object and the address of the dispatch cell and calls a helper. * Magic happens in the helper and helper returns the address to call. * Codegen generates a call to the returned value. This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.

jkotas · 2023-10-10T17:43:09Z

src/coreclr/nativeaot/Runtime/amd64/StubDispatch.S

-        // the universal transition thunk as an argument to RhpCidResolve
-        mov r11, r10
-        mov r10, [rip + REL_C_FUNC(RhpCidResolve)]
-        jmp qword ptr [rip + REL_C_FUNC(RhpUniversalTransition_DebugStepTailCall)]


UniversalTransition and everything related to it (including conservative GC support) can go too.

am11 · 2023-10-10T17:49:40Z

src/coreclr/nativeaot/Runtime/i386/StubDispatch.asm

i386/StubDispatch.S can be deleted as well.

Whole i386 could probably be deleted since NativeAOT doesn't support it.

VSadov · 2023-10-10T19:21:58Z

How do we want the new dispatch cell to look like? Currently it looks similar to the old one, except we don't update it. If it stays like this, we don't even need a dispatch cell per callsite, just per interface+slot pair. But then I don't see how we could make a super efficient monomorphic path (I think we do want a super efficient monomorphic path).

Yes, you would need a cell per each call site. This is where {lastObjType, lastMatchingMethod} would live, which would work as monomorphic cache.

VSadov · 2023-10-10T19:53:59Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Runtime/TypeLoaderExports.cs


+        public static unsafe IntPtr InterfaceLookupForSlot(object obj, IntPtr slot)
+        {
+            // TODO: do something more optimal for monomorphic case


I assume here you will dereference the slot/cell and if obj.GetMethodTable() matches the last seen MT, just return the last used callee?

Perhaps if the lookup in the bigger cache and the cell update are in a separate method, the monomorphic check may even be small enough method to be inlineable into the caller. Can these kinds of helper calls be inlined by the JIT?

Can these kinds of helper calls be inlined by the JIT?

These helpers look the same to the JIT as e.g. casting helpers. IIRC JIT doesn't know they're in managed code and it could ask for the IL.

VSadov · 2023-10-10T20:09:51Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Runtime/TypeLoaderExports.cs

+
+        private static unsafe IntPtr InterfaceLookupForSlotSlow(object obj, IntPtr slot)
+        {
+            // TODO: this is wrong for IDynIntfCastable


Perhaps just do not cache the lookup result for IDynIntfCastable and only update the monomorphic data when there is a cache hit in the larger cache.

It would mean that IDynIntfCastable will always result in 2 misses - in the monomorphic check and then in a larger cache and then will have to resolve. However non-IDynIntfCastable case will be fast.

MichalStrehovsky · 2023-10-11T10:01:59Z

I ran a couple benchmarks. Although for Jan's test cases https://gist.github.com/jkotas/2d8feea834bdbc7c24c8f9c395117e1d and https://gist.github.com/jkotas/d140ad43d9766134f36c15395706878c what's in this PR absolutely crushes the old cache, we're obviously slower for the less polymorphic cases.

Trying something more real world (Json serialization) I see an 8% slowdown (I gimped the baseline to have similar characteristics - disabled whole-program-based devirtualization, and hobbled inlining of methods that do an interface call to make it more apples-to-apples since these are known problems in this PR). I also applied Jakob's old JIT change to allow JIT to tail call the slow helper. I'm going to assume this is going to be fixed - we'll need every instruction saving we can get.

The polymorphism around 2 to ~maybe 8 possible implementing classes is probably a lost cause. We'll not be as fast as the existing cache is.

I'll look into the monomorphic case. Implementing this is a bit annoying because just doing the primitive thing (just add the space for two pointers!) makes it impractical to use the cell address as the cache key (we could use it, but the cache is going to grow by a lot because these are per callsite). Also I'm struggling to align things so that the dispatch cells don't straddle CPU cache lines (since we'll need to read their contents). So I'm meditating on this...

ghost · 2023-11-10T11:01:21Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

MichalStrehovsky added the area-NativeAOT-coreclr label Oct 6, 2023

ghost assigned MichalStrehovsky Oct 6, 2023

MichalStrehovsky added 2 commits October 9, 2023 11:30

See if CI is green

de632c6

MichalStrehovsky force-pushed the intfinvoke branch from 4111af4 to de632c6 Compare October 10, 2023 08:14

GC.Collect

bbd4636

jkotas reviewed Oct 10, 2023

View reviewed changes

am11 reviewed Oct 10, 2023

View reviewed changes

VSadov reviewed Oct 10, 2023

View reviewed changes

ghost closed this Nov 10, 2023

github-actions bot locked and limited conversation to collaborators Dec 11, 2023

MichalStrehovsky deleted the intfinvoke branch January 29, 2024 07:25

This pull request was closed.

Speed up calling interface methods #93100

Speed up calling interface methods #93100

Uh oh!

Conversation

MichalStrehovsky commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Oct 6, 2023

Uh oh!

MichalStrehovsky commented Oct 6, 2023

Uh oh!

jkotas commented Oct 6, 2023

Uh oh!

MichalStrehovsky commented Oct 8, 2023

Uh oh!

jkotas commented Oct 8, 2023

Uh oh!

jkotas Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

am11 Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

MichalPetryka Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

VSadov commented Oct 10, 2023

Uh oh!

VSadov Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichalStrehovsky Oct 11, 2023

Choose a reason for hiding this comment

Uh oh!

VSadov Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichalStrehovsky commented Oct 11, 2023

Uh oh!

ghost commented Nov 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MichalStrehovsky commented Oct 6, 2023 •

edited

Loading

VSadov Oct 10, 2023 •

edited

Loading

VSadov Oct 10, 2023 •

edited

Loading