-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Add support for delegate GDV and method-based vtable GDV #68703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for delegate GDV and method-based vtable GDV #68703
Conversation
Allow method handle histograms in .mibc files and in the PGO text format. Contributes to dotnet#44610.
…ethod-handle-instrumentation
|
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsAdd support for instrumenting delegate calls and vtable calls into method handle histograms. Use these histograms to do GDV for delegate calls and also support method-based GDV for vtable calls. For instrumentation we now support class probes at interface call sites, method probes at delegate call sites and both class probes and method probes at vtable call sites. For vtable calls, when turned on, instrumentation produces both histograms as PGO data so that the JIT can later make the choice about what is the best form of guard to use at that site. For guarding, there are some things to take into account. Delegate calls currently (practically) always point to precode, so this prototype is just guarding on For vtable calls the runtime will backpatch the slots when tiering, so the JIT guards the address retrieved from the vtable against an indirection of the slot, which is slightly more expensive than a class-based guard. Currently the instrumentation is enabled conditionally with Simple microbenchmark: public static void Main()
{
long[] nums = Enumerable.Range(0, 100000).Select(i => (long)i).ToArray();
long finalSum = 0;
Stopwatch timer = Stopwatch.StartNew();
for (int i = 0; i < 10000; i++)
{
finalSum += Sum(nums, i => i + i);
}
Console.WriteLine("{0} in {1}ms", finalSum, timer.ElapsedMilliseconds);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static long Sum(long[] arr, Func<long, long> t)
{
long sum = 0;
foreach (long l in arr)
{
sum += t(l);
}
return sum;
}With Code diff: ; Assembly listing for method Program:Sum(System.Int64[],System.Func`2[Int64,Int64]):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
-; partially interruptible
-; with PGO: edge weights are valid, and fgCalledCount is 133
+; fully interruptible
+; with PGO: edge weights are valid, and fgCalledCount is 2
+; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
-; V00 arg0 [V00,T05] ( 3, 3 ) ref -> rcx class-hnd single-def
-; V01 arg1 [V01,T01] ( 4,200408 ) ref -> rsi class-hnd single-def
-; V02 loc0 [V02,T02] ( 4,200408 ) long -> rdi
-; V03 loc1 [V03,T03] ( 3,100205 ) ref -> rbx class-hnd single-def
-; V04 loc2 [V04,T00] ( 5,400813 ) int -> rbp
-;* V05 loc3 [V05 ] ( 0, 0 ) long -> zero-ref
+; V00 arg0 [V00,T07] ( 3, 3 ) ref -> rcx class-hnd single-def
+; V01 arg1 [V01,T01] ( 6,200332 ) ref -> rsi class-hnd single-def
+; V02 loc0 [V02,T02] ( 4,200332 ) long -> rdi
+; V03 loc1 [V03,T05] ( 3,100167 ) ref -> rbx class-hnd single-def
+; V04 loc2 [V04,T00] ( 5,400661 ) int -> rbp
+; V05 loc3 [V05,T03] ( 3,200330 ) long -> rdx
; V06 OutArgs [V06 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
-; V07 cse0 [V07,T04] ( 3,100205 ) int -> r14 "CSE - aggressive"
+; V07 tmp1 [V07,T04] ( 3,200330 ) long -> rdx "guarded devirt return temp"
+;* V08 tmp2 [V08 ] ( 0, 0 ) ref -> zero-ref class-hnd "guarded devirt this exact temp"
+; V09 cse0 [V09,T06] ( 3,100167 ) int -> r14 "CSE - aggressive"
;
; Lcl frame size = 32
G_M53331_IG01: ;; offset=0000H
4156 push r14
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC20 sub rsp, 32
488BF2 mov rsi, rdx
- ;; size=13 bbWeight=1 PerfScore 5.50
+ ;; size=13 bbWeight=0 PerfScore 0.00
G_M53331_IG02: ;; offset=000DH
33FF xor edi, edi
488BD9 mov rbx, rcx
33ED xor ebp, ebp
448B7308 mov r14d, dword ptr [rbx+8]
4585F6 test r14d, r14d
- 7E18 jle SHORT G_M53331_IG04
+ 7E2A jle SHORT G_M53331_IG05
;; size=16 bbWeight=1 PerfScore 4.00
G_M53331_IG03: ;; offset=001DH
- 8BD5 mov edx, ebp
- 488B54D310 mov rdx, qword ptr [rbx+8*rdx+16]
- 488B4E08 mov rcx, gword ptr [rsi+8]
- FF5618 call [rsi+24]System.Func`2[Int64,Int64][System.Int64,System.Int64]:Invoke(long):long:this
- 4803F8 add rdi, rax
+ 8BC5 mov eax, ebp
+ 488B54C310 mov rdx, qword ptr [rbx+8*rax+16]
+ 48B8E8E23905FC7F0000 mov rax, 0x7FFC0539E2E8
+ 48394618 cmp qword ptr [rsi+24], rax
+ 7521 jne SHORT G_M53331_IG07
+ 488B4608 mov rax, gword ptr [rsi+8]
+ 3800 cmp byte ptr [rax], al
+ 4803D2 add rdx, rdx
+ ;; size=32 bbWeight=100165 PerfScore 1176938.75
+G_M53331_IG04: ;; offset=003DH
+ 4803FA add rdi, rdx
FFC5 inc ebp
443BF5 cmp r14d, ebp
- 7FE8 jg SHORT G_M53331_IG03
- ;; size=24 bbWeight=100203 PerfScore 901827.00
-G_M53331_IG04: ;; offset=0035H
+ 7FD6 jg SHORT G_M53331_IG03
+ ;; size=10 bbWeight=100165 PerfScore 175288.75
+G_M53331_IG05: ;; offset=0047H
488BC7 mov rax, rdi
;; size=3 bbWeight=1 PerfScore 0.25
-G_M53331_IG05: ;; offset=0038H
+G_M53331_IG06: ;; offset=004AH
4883C420 add rsp, 32
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
415E pop r14
C3 ret
;; size=11 bbWeight=1 PerfScore 3.75
+G_M53331_IG07: ;; offset=0055H
+ 488B4E08 mov rcx, gword ptr [rsi+8]
+ FF5618 call [rsi+24]System.Func`2[Int64,Int64][System.Int64,System.Int64]:Invoke(long):long:this
+ 488BD0 mov rdx, rax
+ EBDC jmp SHORT G_M53331_IG04
+ ;; size=12 bbWeight=0 PerfScore 0.00
-; Total bytes of code 67, prolog size 10, PerfScore 901847.20, instruction count 29, allocated bytes for code 67 (MethodHash=59ea2fac) for method Program:Sum(System.Int64[],System.Func`2[Int64,Int64]):long
+; Total bytes of code 97, prolog size 13, PerfScore 1352245.20, instruction count 37, allocated bytes for code 97 (MethodHash=59ea2fac) for method Program:Sum(System.Int64[],System.Func`2[Int64,Int64]):long
; ============================================================There are some unexpected layout issues due to chained GDV, for example in simple examples TODO:
This is currently based on top of #67919 cc @dotnet/jit-contrib
|
|
This is amazing. In theory could it hoist the check, clone the loop and then have a loop with a simple add, and a loop with the fallback? |
Yes, we're working on that too. See #65206 which includes a link to a prototype that does this (at least for type tests -- should not be hard to make it work for delegate tests as well). |
|
Very cool to see this! I just skimmed the changes; will take a closer look -- may not get to it until next week. We should push to get the prereq #67919 merged.
Not sure what kind of messes you're seeing, but GDV even without chaining can cause problems, eg #67318.
Would be interesting to see how many sites are type-polymorphic but method-monomorphic.
We should probably work on streamlining the cost of the class/method probes. Simple thing would be to hoist up the check where we decide to update the table; if we're not going to do a table update we don't need to do all the work to figure out what values we'd record. That should help somewhat, especially for the class/method version where those computations are more complex. |
That optimization makes sense. On another note I'm not sure if the combined profile helper is worth much, since from the call site and class we should be able to get the resolved method anyway. So it might just go away in the final version. |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Impact for select TechEmpower benchmarks: We don't gain that much, which is not that unexpected since we do not use delegates very much in those benchmarks. Here's the data in terms of how many times a probe at a specific IL offset was hit: The same data for type GDV is https://gist.github.com/jakobbotsch/0903b426eddb69fd830662b228d5ca32 |
This makes us support generating and consuming the method handle histograms in the same way as the type handle histograms by compressing them before they are put in the R2R format. Also finish some SPMI support.
|
This should be ready. I am enabling delegate GDV by default in this PR, while keeping the vtable profiling disabled. The impact on size of PGO data and SPC.dll with delegate GDV is:
For TechEmpower impact on PGO, see the comment above. cc @dotnet/jit-contrib PTAL @AndyAyersMS @EgorBo |
| timeoutInMinutes: 390 | ||
| ${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter') }}: | ||
| ${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter', 'pgo') }}: | ||
| timeoutInMinutes: 510 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking to keep these timeout changes until we get around to splitting up the jobs, it fixes the problem we are seeing with timeouts and I have not seen any issues with making the timeout this high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
|
|
||
| // Possibly instrument. Note for OSR+PGO we will instrument when | ||
| // optimizing and (currently) won't devirtualize. We may want | ||
| // to revisit -- if we can devirtualize we should be able to | ||
| // suppress the probe. | ||
| // | ||
| // We strip BBINSTR from inlinees currently, so we'll only | ||
| // do this for the root method calls. | ||
| // | ||
| if (opts.jitFlags->IsSet(JitFlags::JIT_FLAG_BBINSTR)) | ||
| { | ||
| assert(opts.OptimizationDisabled() || opts.IsOSR()); | ||
| assert(!compIsForInlining()); | ||
|
|
||
| // During importation, optionally flag this block as one that | ||
| // contains calls requiring class profiling. Ideally perhaps | ||
| // we'd just keep track of the calls themselves, so we don't | ||
| // have to search for them later. | ||
| // | ||
| if ((call->gtCallType != CT_INDIRECT) && opts.jitFlags->IsSet(JitFlags::JIT_FLAG_BBINSTR) && | ||
| !opts.jitFlags->IsSet(JitFlags::JIT_FLAG_PREJIT) && (JitConfig.JitClassProfiling() > 0) && | ||
| !isLateDevirtualization) | ||
| { | ||
| JITDUMP("\n ... marking [%06u] in " FMT_BB " for class profile instrumentation\n", dspTreeID(call), | ||
| compCurBB->bbNum); | ||
| ClassProfileCandidateInfo* pInfo = new (this, CMK_Inlining) ClassProfileCandidateInfo; | ||
|
|
||
| // Record some info needed for the class profiling probe. | ||
| // | ||
| pInfo->ilOffset = ilOffset; | ||
| pInfo->probeIndex = info.compClassProbeCount++; | ||
| call->gtClassProfileCandidateInfo = pInfo; | ||
|
|
||
| // Flag block as needing scrutiny | ||
| // | ||
| compCurBB->bbFlags |= BBF_HAS_CLASS_PROFILE; | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| // Bail if optimizations are disabled. | ||
| if (opts.OptimizationDisabled()) | ||
| { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was extracted into impConsiderCallProbe which is called in impImportCall instead. Unlike impDevirtualizeCall, impConsiderCallProbe is called for delegates too.
| // TODO-GDV: This can be simplified to just use likelyClasses and | ||
| // likelyMethods now that we have multiple candidates here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this in a follow-up change.
| uint32_t likelyClassAttribs = 0; | ||
| if (likelyClass != NO_CLASS_HANDLE) | ||
| { | ||
| likelyClassAttribs = info.compCompHnd->getClassAttribs(likelyClass); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff is easier to review with whitespace ignored.
|
Test failure is #68376. superpmi failures are expected due to JIT-EE GUID change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great.
| timeoutInMinutes: 390 | ||
| ${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter') }}: | ||
| ${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter', 'pgo') }}: | ||
| timeoutInMinutes: 510 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
| // Reusing the call target for delegates is more | ||
| // complicated. Essentially we need to do the | ||
| // transformation done in LowerDelegateInvoke by converting | ||
| // the call to CT_INDIRECT and reusing the target address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point I convinced myself we should do this lowering much earlier anyways, so we had a shot at propagating the method of locally created delegate to the call site.
Not sure if that would help here or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a code-base standpoint I also think that would be nice, we have many places we need to special case IsDelegateInvoke() since that looks just like a normal managed function call but isn't.
It's possible that would help here, assuming CSE would then handle the shared call target for the check and cold case. In that case we perhaps wouldn't even need the special handling here. Maybe something to experiment with in a future PR.
|
Also PTAL @davidwrighton for the non-JIT/SPMI changes. I have bumped the R2R minor version again since we now may see One thing I would like to point attention to is that old versions of dotnet-pgo (from before #67919) fail with an "Unknown PGO type" error if they see this new data. So one needs to use a version of dotnet-pgo that matches the version of the runtime used to produce the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
|
@AndyAyersMS I believe I addressed your feedback if you can sign off. I hope to merge this tonight to have it included in the weekend jit-pgo runs. |
|
I missed this change. This is a big deal 😀 ! |

Add support for instrumenting delegate calls and vtable calls into method handle histograms. Use these histograms to do GDV for delegate calls and also support method-based GDV for vtable calls.
For instrumentation we now support class probes at interface call sites, method probes at delegate call sites and both class probes and method probes at vtable call sites. For vtable calls, when turned on, instrumentation produces both histograms as PGO data so that the JIT can later make the choice about what is the best form of guard to use at that site.
For guarding, there are some things to take into account. Delegate calls currently (practically) always point to precode, so this PR is just guarding on
getFunctionFixedEntryPointwhich returns the precode address, and this is generally quite cheap (same cost as class-based GDV). That's the case for delegates pointing to instance methods anyway, this PR does not support static methods yet -- those will be more expensive.For vtable calls the runtime will backpatch the slots when tiering, so the JIT guards the address retrieved from the vtable against an indirection of the slot, which is slightly more expensive than a class-based guard.
Currently the instrumentation is enabled conditionally with
COMPlus_JitDelegateProfiling=1(for delegates) andCOMPlus_JitVTableProfiling=1(for vtable calls). Currently delegate profiling is turned on by default while vtable profiling is off by default.Simple microbenchmark:
TODO:
Support delegates pointing to value type instance methods(Follow-up)Support delegates pointing to static functions(Follow-up)R2R support(Follow-up)Contributes to #44610
cc @dotnet/jit-contrib