Skip to content

Conversation

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky MichalStrehovsky commented Oct 6, 2023

Continuation of #86238.

Looks like we might already have all we need on the RyuJIT side because once I started to look at it, it seems like what we want is similar to how generic virtual methods are done.

Calling interface methods (that could not be devirtualized) currently works like this:

  • At the callsite, codegen loads all arguments for the call and in addition to those, loads address of an "interface dispatch cell" into a non-argument register. It then performs an indirect call to the address stored at the beginning of the "interface dispatch cell". This dispatch cell is a data structure generated in the compiler.

This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata:

  • In the first field, there needs to be a callable code address (because this is where codegen calls into). Initially, each dispatch cell has this set to RhpInitialDynamicInterfaceDispatch.
  • In the second field, there is the address of the interface MethodTable (with a low bit set for a purpose that will be clear later).

You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found.

Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy.

When RhpInitialDynamicInterfaceDispatch runs and computes the target of dispatch, it will allocate a small cache that has the type of this and the address of the method that implements the interface method. It will modify the code address stored in the dispatch cell (that was pointing to RhpInitialDynamicInterfaceDispatch) to call a different method, and also update the second field to point to the cache.

Next time the dispatch cell is invoked, the specialized helper will compare the type of this with the previously seen this and if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types of this or so.

The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time.

This PR explores a different option:

  • At the callsite, codegen loads the target object and the address of the dispatch cell and calls a helper.
  • Magic happens in the helper and helper returns the address to call.
  • Codegen generates a call to the returned value.

This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.

Cc @dotnet/ilc-contrib

@ghost
Copy link

ghost commented Oct 6, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Continuation of #86238.

Looks like we might already have all we need on the RyuJIT side because once I started to look at it, it seems like what we want is similar to how generic virtual methods are done.

Calling interface methods (that could not be devirtualized) currently works like this:

  • At the callsite, codegen loads all arguments for the call and in addition to those, loads address of an "interface dispatch cell" into a non-argument register. It then performs an indirect call to the address stored at the beginning of the "interface dispatch cell". This dispatch cell is a data structure generated in the compiler.

This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata:

  • In the first field, there needs to be a callable code address (because this is where codegen calls into). Initially, each dispatch cell has this set to RhpInitialDynamicInterfaceDispatch.
  • In the second field, there is the address of the interface MethodTable (with a low bit set for a purpose that will be clear later).

You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found.

Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy.

When RhpInitialDynamicInterfaceDispatch runs and computes the target of dispatch, it will allocate a small cache that has the type of this and the address of the method that implements the interface method. It will modify the code address stored in the dispatch cell (that was pointing to RhpInitialDynamicInterfaceDispatch) to call a different method, and also update the second field to point to the cache.

Next time the dispatch cell is invoked, the specialized helper will compare the type of this with the previously seen this and if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types of this or so.

The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time.

This PR explores a different option:

  • At the callsite, codegen loads the target object and the address of the dispatch cell and calls a helper.
  • Magic happens in the helper and helper returns the address to call.
  • Codegen generates a call to the returned value.

This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.

Author: MichalStrehovsky
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@ghost ghost assigned MichalStrehovsky Oct 6, 2023
@MichalStrehovsky
Copy link
Member Author

Draft because there's many unresolved issues and I don't expect CI to pass, but a hello world works.

  • How do we want the new dispatch cell to look like? Currently it looks similar to the old one, except we don't update it. If it stays like this, we don't even need a dispatch cell per callsite, just per interface+slot pair. But then I don't see how we could make a super efficient monomorphic path (I think we do want a super efficient monomorphic path).
  • The cache key is the dispatch cell address which may or may not be what we want. It's not what we want in the current shape of the prototype.
  • The slow helper is not getting tail called. This is a pre-existing issue: slow helper for GVM dispatch is not tail called either, but we probably really want a tail call for this.
  • Some logic in RyuJIT that aborts inlining anything with this kind of dispatch (I put a comment on it).
  • IDynamicInterfaceCastable should not end up in the cache.
  • Do we want to keep sharing the cache with GVM?
  • Type loader work.
  • Delete tons of code associated with the previous approach and managing the caches.
  • JitInterface guid, general JIT review. We might not be doing devirtualizations anymore for this shape from what I saw.

The main question at this point is how the dispatch cell should look like. I'm open to suggestions as I don't want to have to redo it too many times.

@jkotas
Copy link
Member

jkotas commented Oct 6, 2023

If it stays like this, we don't even need a dispatch cell per callsite, just per interface+slot pair. But then I don't see how we could make a super efficient monomorphic path (I think we do want a super efficient monomorphic path).

The code for the super-efficient monomorphic path can be inlined with this plan, similar how we inline it for guarded devirtualization. This monomorphic path inlining can be skipped when optimizing for size or on cold paths.

Are you thinking about having large static hashtable for polymorphic case, similar to how it is done in regular CoreCLR?

@MichalStrehovsky
Copy link
Member Author

The code for the super-efficient monomorphic path can be inlined with this plan, similar how we inline it for guarded devirtualization. This monomorphic path inlining can be skipped when optimizing for size or on cold paths.

I'm thinking how that would work:

We'd need to load the address of the cache. Then we'd need to check the cache inline (dereference this, compare with the first field of the cache, if equal, use the address in the second field). If they're not equal, we'd need to load another data structure (that captures interface type and slot number) and call another helper that takes this as argument, and also the pointer to the cache so that it can be updated. It feels like this would increase the size of the callsite by quite a lot compared to what we have there now. It might be better to just just go with storing all of this in the dispatch cell for now (keep the dispatch cell per callsite and keep the cache accessible from there).

Are you thinking about having large static hashtable for polymorphic case, similar to how it is done in regular CoreCLR?

Large hashtable - like we do for generic virtual methods. I'm not sure about the "static" part - we need to be able to to account for interface types loaded at runtime by the type loader.

@jkotas
Copy link
Member

jkotas commented Oct 8, 2023

I'm not sure about the "static" part - we need to be able to to account for interface types loaded at runtime by the type loader.

I meant singleton, similar to what we have for GVMs. I did not mean that the cache would be pre-initialized at build time.

Calling interface methods (that could not be devirtualized) currently works like this:

* At the callsite, codegen loads all arguments for the call and in addition to those, loads address of an "interface dispatch cell" into a non-argument register. It then performs an indirect call to the address stored at the beginning of the "interface dispatch cell". This dispatch cell is a data structure generated in the compiler.

This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata:

* In the first field, there needs to be a callable code address (because this is where codegen calls into). Initially, each dispatch cell has this set to `RhpInitialDynamicInterfaceDispatch`.
* In the second field, there is the address of the interface MethodTable (with a low bit set for a purpose that will be clear later).

You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found.

Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy.

When `RhpInitialDynamicInterfaceDispatch` runs and computes the target of dispatch, it will allocate a small cache that has the type of `this` and the address of the method that implements the interface method. It will _modify_ the code address stored in the dispatch cell (that was pointing to `RhpInitialDynamicInterfaceDispatch`) to call a different method, and also update the second field to point to the cache.

Next time the dispatch cell is invoked, the specialized helper will compare the type of `this` with the previously seen `this` and if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types of `this` or so.

The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time.

This PR explores a different option:

* At the callsite, codegen loads the target object and the address of the dispatch cell and calls a helper.
* Magic happens in the helper and helper returns the address to call.
* Codegen generates a call to the returned value.

This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.
// the universal transition thunk as an argument to RhpCidResolve
mov r11, r10
mov r10, [rip + REL_C_FUNC(RhpCidResolve)]
jmp qword ptr [rip + REL_C_FUNC(RhpUniversalTransition_DebugStepTailCall)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UniversalTransition and everything related to it (including conservative GC support) can go too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i386/StubDispatch.S can be deleted as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole i386 could probably be deleted since NativeAOT doesn't support it.

@VSadov
Copy link
Member

VSadov commented Oct 10, 2023

How do we want the new dispatch cell to look like? Currently it looks similar to the old one, except we don't update it. If it stays like this, we don't even need a dispatch cell per callsite, just per interface+slot pair. But then I don't see how we could make a super efficient monomorphic path (I think we do want a super efficient monomorphic path).

Yes, you would need a cell per each call site. This is where {lastObjType, lastMatchingMethod} would live, which would work as monomorphic cache.


public static unsafe IntPtr InterfaceLookupForSlot(object obj, IntPtr slot)
{
// TODO: do something more optimal for monomorphic case
Copy link
Member

@VSadov VSadov Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume here you will dereference the slot/cell and if obj.GetMethodTable() matches the last seen MT, just return the last used callee?

Perhaps if the lookup in the bigger cache and the cell update are in a separate method, the monomorphic check may even be small enough method to be inlineable into the caller. Can these kinds of helper calls be inlined by the JIT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these kinds of helper calls be inlined by the JIT?

These helpers look the same to the JIT as e.g. casting helpers. IIRC JIT doesn't know they're in managed code and it could ask for the IL.


private static unsafe IntPtr InterfaceLookupForSlotSlow(object obj, IntPtr slot)
{
// TODO: this is wrong for IDynIntfCastable
Copy link
Member

@VSadov VSadov Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just do not cache the lookup result for IDynIntfCastable and only update the monomorphic data when there is a cache hit in the larger cache.

It would mean that IDynIntfCastable will always result in 2 misses - in the monomorphic check and then in a larger cache and then will have to resolve. However non-IDynIntfCastable case will be fast.

@MichalStrehovsky
Copy link
Member Author

I ran a couple benchmarks. Although for Jan's test cases https://gist.github.com/jkotas/2d8feea834bdbc7c24c8f9c395117e1d and https://gist.github.com/jkotas/d140ad43d9766134f36c15395706878c what's in this PR absolutely crushes the old cache, we're obviously slower for the less polymorphic cases.

Trying something more real world (Json serialization) I see an 8% slowdown (I gimped the baseline to have similar characteristics - disabled whole-program-based devirtualization, and hobbled inlining of methods that do an interface call to make it more apples-to-apples since these are known problems in this PR). I also applied Jakob's old JIT change to allow JIT to tail call the slow helper. I'm going to assume this is going to be fixed - we'll need every instruction saving we can get.

The polymorphism around 2 to ~maybe 8 possible implementing classes is probably a lost cause. We'll not be as fast as the existing cache is.

I'll look into the monomorphic case. Implementing this is a bit annoying because just doing the primitive thing (just add the space for two pointers!) makes it impractical to use the cell address as the cache key (we could use it, but the cache is going to grow by a lot because these are per callsite). Also I'm struggling to align things so that the dispatch cells don't straddle CPU cache lines (since we'll need to read their contents). So I'm meditating on this...

@ghost ghost closed this Nov 10, 2023
@ghost
Copy link

ghost commented Nov 10, 2023

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2023
@MichalStrehovsky MichalStrehovsky deleted the intfinvoke branch January 29, 2024 07:25
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants