-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Speed up calling interface methods #93100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsContinuation of #86238. Looks like we might already have all we need on the RyuJIT side because once I started to look at it, it seems like what we want is similar to how generic virtual methods are done. Calling interface methods (that could not be devirtualized) currently works like this:
This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata:
You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found. Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy. When Next time the dispatch cell is invoked, the specialized helper will compare the type of The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time. This PR explores a different option:
This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.
|
|
Draft because there's many unresolved issues and I don't expect CI to pass, but a hello world works.
The main question at this point is how the dispatch cell should look like. I'm open to suggestions as I don't want to have to redo it too many times. |
The code for the super-efficient monomorphic path can be inlined with this plan, similar how we inline it for guarded devirtualization. This monomorphic path inlining can be skipped when optimizing for size or on cold paths. Are you thinking about having large static hashtable for polymorphic case, similar to how it is done in regular CoreCLR? |
I'm thinking how that would work: We'd need to load the address of the cache. Then we'd need to check the cache inline (dereference
Large hashtable - like we do for generic virtual methods. I'm not sure about the "static" part - we need to be able to to account for interface types loaded at runtime by the type loader. |
I meant singleton, similar to what we have for GVMs. I did not mean that the cache would be pre-initialized at build time. |
Calling interface methods (that could not be devirtualized) currently works like this: * At the callsite, codegen loads all arguments for the call and in addition to those, loads address of an "interface dispatch cell" into a non-argument register. It then performs an indirect call to the address stored at the beginning of the "interface dispatch cell". This dispatch cell is a data structure generated in the compiler. This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata: * In the first field, there needs to be a callable code address (because this is where codegen calls into). Initially, each dispatch cell has this set to `RhpInitialDynamicInterfaceDispatch`. * In the second field, there is the address of the interface MethodTable (with a low bit set for a purpose that will be clear later). You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found. Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy. When `RhpInitialDynamicInterfaceDispatch` runs and computes the target of dispatch, it will allocate a small cache that has the type of `this` and the address of the method that implements the interface method. It will _modify_ the code address stored in the dispatch cell (that was pointing to `RhpInitialDynamicInterfaceDispatch`) to call a different method, and also update the second field to point to the cache. Next time the dispatch cell is invoked, the specialized helper will compare the type of `this` with the previously seen `this` and if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types of `this` or so. The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time. This PR explores a different option: * At the callsite, codegen loads the target object and the address of the dispatch cell and calls a helper. * Magic happens in the helper and helper returns the address to call. * Codegen generates a call to the returned value. This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.
4111af4 to
de632c6
Compare
| // the universal transition thunk as an argument to RhpCidResolve | ||
| mov r11, r10 | ||
| mov r10, [rip + REL_C_FUNC(RhpCidResolve)] | ||
| jmp qword ptr [rip + REL_C_FUNC(RhpUniversalTransition_DebugStepTailCall)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UniversalTransition and everything related to it (including conservative GC support) can go too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i386/StubDispatch.S can be deleted as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole i386 could probably be deleted since NativeAOT doesn't support it.
Yes, you would need a cell per each call site. This is where {lastObjType, lastMatchingMethod} would live, which would work as monomorphic cache. |
|
|
||
| public static unsafe IntPtr InterfaceLookupForSlot(object obj, IntPtr slot) | ||
| { | ||
| // TODO: do something more optimal for monomorphic case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume here you will dereference the slot/cell and if obj.GetMethodTable() matches the last seen MT, just return the last used callee?
Perhaps if the lookup in the bigger cache and the cell update are in a separate method, the monomorphic check may even be small enough method to be inlineable into the caller. Can these kinds of helper calls be inlined by the JIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these kinds of helper calls be inlined by the JIT?
These helpers look the same to the JIT as e.g. casting helpers. IIRC JIT doesn't know they're in managed code and it could ask for the IL.
|
|
||
| private static unsafe IntPtr InterfaceLookupForSlotSlow(object obj, IntPtr slot) | ||
| { | ||
| // TODO: this is wrong for IDynIntfCastable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just do not cache the lookup result for IDynIntfCastable and only update the monomorphic data when there is a cache hit in the larger cache.
It would mean that IDynIntfCastable will always result in 2 misses - in the monomorphic check and then in a larger cache and then will have to resolve. However non-IDynIntfCastable case will be fast.
|
I ran a couple benchmarks. Although for Jan's test cases https://gist.github.com/jkotas/2d8feea834bdbc7c24c8f9c395117e1d and https://gist.github.com/jkotas/d140ad43d9766134f36c15395706878c what's in this PR absolutely crushes the old cache, we're obviously slower for the less polymorphic cases. Trying something more real world (Json serialization) I see an 8% slowdown (I gimped the baseline to have similar characteristics - disabled whole-program-based devirtualization, and hobbled inlining of methods that do an interface call to make it more apples-to-apples since these are known problems in this PR). I also applied Jakob's old JIT change to allow JIT to tail call the slow helper. I'm going to assume this is going to be fixed - we'll need every instruction saving we can get. The polymorphism around 2 to ~maybe 8 possible implementing classes is probably a lost cause. We'll not be as fast as the existing cache is. I'll look into the monomorphic case. Implementing this is a bit annoying because just doing the primitive thing (just add the space for two pointers!) makes it impractical to use the cell address as the cache key (we could use it, but the cache is going to grow by a lot because these are per callsite). Also I'm struggling to align things so that the dispatch cells don't straddle CPU cache lines (since we'll need to read their contents). So I'm meditating on this... |
|
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Continuation of #86238.
Looks like we might already have all we need on the RyuJIT side because once I started to look at it, it seems like what we want is similar to how generic virtual methods are done.
Calling interface methods (that could not be devirtualized) currently works like this:
This is pretty simple. All the magic happens in the interface dispatch cell. This cell has a couple pieces of metadata:
RhpInitialDynamicInterfaceDispatch.You probably noticed there's no slot number. Slot numbers are stored in a compact way: all dispatch cells are grouped together based on slot number. Every 32 dispatch cells, there's a special dispatch cell with a null code address and interface slot number in the second field. So finding the slot number is just about walking the dispatch cell group until this null code address is found.
Computing interface destinations on each call would be pretty expensive. The above setup actually allows for a very interesting caching strategy.
When
RhpInitialDynamicInterfaceDispatchruns and computes the target of dispatch, it will allocate a small cache that has the type ofthisand the address of the method that implements the interface method. It will modify the code address stored in the dispatch cell (that was pointing toRhpInitialDynamicInterfaceDispatch) to call a different method, and also update the second field to point to the cache.Next time the dispatch cell is invoked, the specialized helper will compare the type of
thiswith the previously seenthisand if they match, simply call the precomputed destination. If they don't match, a lookup is performed and cache is resized. The pointer to the helper method is also updated to instead point to a helper that handles two cached values. This can continue until 32 different types ofthisor so.The main problem with this strategy is in the helpers - the helpers are specialized for different cache sizes, but not different callsites. This means that the CPU will very likely mispredict the target of the call from the helper 100% of time.
This PR explores a different option:
This moves the the call to the target method to the original callsite (it's no longer indirected through a shared helper). The hope is that branch predictor getting the target right will make things a lot faster.
Cc @dotnet/ilc-contrib