Skip to content

Conversation

@jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Apr 29, 2022

Add support for instrumenting delegate calls and vtable calls into method handle histograms. Use these histograms to do GDV for delegate calls and also support method-based GDV for vtable calls.

For instrumentation we now support class probes at interface call sites, method probes at delegate call sites and both class probes and method probes at vtable call sites. For vtable calls, when turned on, instrumentation produces both histograms as PGO data so that the JIT can later make the choice about what is the best form of guard to use at that site.

For guarding, there are some things to take into account. Delegate calls currently (practically) always point to precode, so this PR is just guarding on getFunctionFixedEntryPoint which returns the precode address, and this is generally quite cheap (same cost as class-based GDV). That's the case for delegates pointing to instance methods anyway, this PR does not support static methods yet -- those will be more expensive.

For vtable calls the runtime will backpatch the slots when tiering, so the JIT guards the address retrieved from the vtable against an indirection of the slot, which is slightly more expensive than a class-based guard.

Currently the instrumentation is enabled conditionally with COMPlus_JitDelegateProfiling=1 (for delegates) and COMPlus_JitVTableProfiling=1 (for vtable calls). Currently delegate profiling is turned on by default while vtable profiling is off by default.
Simple microbenchmark:

public class Benchmark
{
    private readonly long[] _nums;
    public Benchmark()
    {
        _nums = Enumerable.Range(0, 100000).Select(i => (long)i).ToArray();
    }

    [Benchmark]
    public long Sum() => _nums.Sum(l => l * l);
}
Method Job Toolchain Mean Error StdDev Ratio
Sum Job-QWNDLL \nopgo\corerun.exe 406.65 us 0.718 us 0.560 us 1.00
Sum Job-PNPEDU \tieredpgo_no_delegate_gdv\corerun.exe 172.77 us 0.819 us 0.766 us 0.42
Sum Job-KFFWQK \tieredpgo_delegate_gdv\corerun.exe 91.38 us 0.263 us 0.219 us 0.22

TODO:

  • Work out why chained GDV regresses block layout sometimes (resolved with JIT: defer some flow graph reordering until after loop recognition #69878)
  • Add to PGO CI pipeline
  • Support randomized GDVs for method handle histograms
  • Investigate some OSR interactions
  • Evaluate proper likelihood heuristics for delegates
  • Optimize cold path for delegate GDV -- we should reuse the target
  • Devirtualized delegate calls should not null check
  • Optimize profiling helpers
  • Add signature compatibility check
  • Add method address handle comment in disasm (Add disasm comments for field data addresses and code addresses #70437)
  • Separate ability to do method profiling for delegates and vtable calls
  • Support delegates pointing to value type instance methods (Follow-up)
  • Support delegates pointing to static functions (Follow-up)
  • R2R support (Follow-up)

Contributes to #44610

cc @dotnet/jit-contrib

@ghost ghost assigned jakobbotsch Apr 29, 2022
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 29, 2022
@ghost
Copy link

ghost commented Apr 29, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Add support for instrumenting delegate calls and vtable calls into method handle histograms. Use these histograms to do GDV for delegate calls and also support method-based GDV for vtable calls.

For instrumentation we now support class probes at interface call sites, method probes at delegate call sites and both class probes and method probes at vtable call sites. For vtable calls, when turned on, instrumentation produces both histograms as PGO data so that the JIT can later make the choice about what is the best form of guard to use at that site.

For guarding, there are some things to take into account. Delegate calls currently (practically) always point to precode, so this prototype is just guarding on getFunctionFixedEntryPoint which returns the precode address, and this is generally quite cheap (same cost as class-based GDV). That's the case for delegates pointing to instance methods anyway, this prototype does not support static methods yet -- those will be more expensive.

For vtable calls the runtime will backpatch the slots when tiering, so the JIT guards the address retrieved from the vtable against an indirection of the slot, which is slightly more expensive than a class-based guard.

Currently the instrumentation is enabled conditionally with COMPlus_JitMethodProfiling=1 and is off by default.

Simple microbenchmark:

public static void Main()
{
    long[] nums = Enumerable.Range(0, 100000).Select(i => (long)i).ToArray();
    long finalSum = 0;
    Stopwatch timer = Stopwatch.StartNew();
    for (int i = 0; i < 10000; i++)
    {
        finalSum += Sum(nums, i => i + i);
    }

    Console.WriteLine("{0} in {1}ms", finalSum, timer.ElapsedMilliseconds);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static long Sum(long[] arr, Func<long, long> t)
{
    long sum = 0;
    foreach (long l in arr)
    {
        sum += t(l);
    }

    return sum;
}

With COMPlus_JitMethodProfiling=1: 713 ms
Without COMPlus_JitMethodProfiling=0: 1305 ms

Code diff:

 ; Assembly listing for method Program:Sum(System.Int64[],System.Func`2[Int64,Int64]):long
 ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
 ; Tier-1 compilation
 ; optimized code
 ; optimized using profile data
 ; rsp based frame
-; partially interruptible
-; with PGO: edge weights are valid, and fgCalledCount is 133
+; fully interruptible
+; with PGO: edge weights are valid, and fgCalledCount is 2
+; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00,T05] (  3,     3   )     ref  ->  rcx         class-hnd single-def
-;  V01 arg1         [V01,T01] (  4,200408   )     ref  ->  rsi         class-hnd single-def
-;  V02 loc0         [V02,T02] (  4,200408   )    long  ->  rdi        
-;  V03 loc1         [V03,T03] (  3,100205   )     ref  ->  rbx         class-hnd single-def
-;  V04 loc2         [V04,T00] (  5,400813   )     int  ->  rbp        
-;* V05 loc3         [V05    ] (  0,     0   )    long  ->  zero-ref   
+;  V00 arg0         [V00,T07] (  3,     3   )     ref  ->  rcx         class-hnd single-def
+;  V01 arg1         [V01,T01] (  6,200332   )     ref  ->  rsi         class-hnd single-def
+;  V02 loc0         [V02,T02] (  4,200332   )    long  ->  rdi        
+;  V03 loc1         [V03,T05] (  3,100167   )     ref  ->  rbx         class-hnd single-def
+;  V04 loc2         [V04,T00] (  5,400661   )     int  ->  rbp        
+;  V05 loc3         [V05,T03] (  3,200330   )    long  ->  rdx        
 ;  V06 OutArgs      [V06    ] (  1,     1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
-;  V07 cse0         [V07,T04] (  3,100205   )     int  ->  r14         "CSE - aggressive"
+;  V07 tmp1         [V07,T04] (  3,200330   )    long  ->  rdx         "guarded devirt return temp"
+;* V08 tmp2         [V08    ] (  0,     0   )     ref  ->  zero-ref    class-hnd "guarded devirt this exact temp"
+;  V09 cse0         [V09,T06] (  3,100167   )     int  ->  r14         "CSE - aggressive"
 ;
 ; Lcl frame size = 32
 
 G_M53331_IG01:              ;; offset=0000H
        4156                 push     r14
        57                   push     rdi
        56                   push     rsi
        55                   push     rbp
        53                   push     rbx
        4883EC20             sub      rsp, 32
        488BF2               mov      rsi, rdx
-						;; size=13 bbWeight=1    PerfScore 5.50
+						;; size=13 bbWeight=0    PerfScore 0.00
 G_M53331_IG02:              ;; offset=000DH
        33FF                 xor      edi, edi
        488BD9               mov      rbx, rcx
        33ED                 xor      ebp, ebp
        448B7308             mov      r14d, dword ptr [rbx+8]
        4585F6               test     r14d, r14d
-       7E18                 jle      SHORT G_M53331_IG04
+       7E2A                 jle      SHORT G_M53331_IG05
 						;; size=16 bbWeight=1    PerfScore 4.00
 G_M53331_IG03:              ;; offset=001DH
-       8BD5                 mov      edx, ebp
-       488B54D310           mov      rdx, qword ptr [rbx+8*rdx+16]
-       488B4E08             mov      rcx, gword ptr [rsi+8]
-       FF5618               call     [rsi+24]System.Func`2[Int64,Int64][System.Int64,System.Int64]:Invoke(long):long:this
-       4803F8               add      rdi, rax
+       8BC5                 mov      eax, ebp
+       488B54C310           mov      rdx, qword ptr [rbx+8*rax+16]
+       48B8E8E23905FC7F0000 mov      rax, 0x7FFC0539E2E8
+       48394618             cmp      qword ptr [rsi+24], rax
+       7521                 jne      SHORT G_M53331_IG07
+       488B4608             mov      rax, gword ptr [rsi+8]
+       3800                 cmp      byte  ptr [rax], al
+       4803D2               add      rdx, rdx
+						;; size=32 bbWeight=100165    PerfScore 1176938.75
+G_M53331_IG04:              ;; offset=003DH
+       4803FA               add      rdi, rdx
        FFC5                 inc      ebp
        443BF5               cmp      r14d, ebp
-       7FE8                 jg       SHORT G_M53331_IG03
-						;; size=24 bbWeight=100203    PerfScore 901827.00
-G_M53331_IG04:              ;; offset=0035H
+       7FD6                 jg       SHORT G_M53331_IG03
+						;; size=10 bbWeight=100165    PerfScore 175288.75
+G_M53331_IG05:              ;; offset=0047H
        488BC7               mov      rax, rdi
 						;; size=3 bbWeight=1    PerfScore 0.25
-G_M53331_IG05:              ;; offset=0038H
+G_M53331_IG06:              ;; offset=004AH
        4883C420             add      rsp, 32
        5B                   pop      rbx
        5D                   pop      rbp
        5E                   pop      rsi
        5F                   pop      rdi
        415E                 pop      r14
        C3                   ret      
 						;; size=11 bbWeight=1    PerfScore 3.75
+G_M53331_IG07:              ;; offset=0055H
+       488B4E08             mov      rcx, gword ptr [rsi+8]
+       FF5618               call     [rsi+24]System.Func`2[Int64,Int64][System.Int64,System.Int64]:Invoke(long):long:this
+       488BD0               mov      rdx, rax
+       EBDC                 jmp      SHORT G_M53331_IG04
+						;; size=12 bbWeight=0    PerfScore 0.00
 
-; Total bytes of code 67, prolog size 10, PerfScore 901847.20, instruction count 29, allocated bytes for code 67 (MethodHash=59ea2fac) for method Program:Sum(System.Int64[],System.Func`2[Int64,Int64]):long
+; Total bytes of code 97, prolog size 13, PerfScore 1352245.20, instruction count 37, allocated bytes for code 97 (MethodHash=59ea2fac) for method Program:Sum(System.Int64[],System.Func`2[Int64,Int64]):long
 ; ============================================================

There are some unexpected layout issues due to chained GDV, for example in simple examples Enumerable.Sum regresses when method profiling is on because the JIT sees an extra chained GDV opportunity, and ends up making a mess of the flow graph.

TODO:

  • Work out why chained GDV regresses block layout sometimes
  • Support delegates pointing to static functions
  • Add to PGO CI pipeline
  • Investigate some OSR interactions

This is currently based on top of #67919
Contributes to #44610

cc @dotnet/jit-contrib

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@bencyoung-Fignum
Copy link

This is amazing. In theory could it hoist the check, clone the loop and then have a loop with a simple add, and a loop with the fallback?

@AndyAyersMS
Copy link
Member

This is amazing. In theory could it hoist the check, clone the loop and then have a loop with a simple add, and a loop with the fallback?

Yes, we're working on that too. See #65206 which includes a link to a prototype that does this (at least for type tests -- should not be hard to make it work for delegate tests as well).

@AndyAyersMS
Copy link
Member

Very cool to see this! I just skimmed the changes; will take a closer look -- may not get to it until next week.

We should push to get the prereq #67919 merged.

There are some unexpected layout issues due to chained GDV, for example in simple examples Enumerable.Sum regresses when method profiling is on because the JIT sees an extra chained GDV opportunity and ends up making a mess of the flow graph.

Not sure what kind of messes you're seeing, but GDV even without chaining can cause problems, eg #67318.

so that the JIT can later make the choice about what is the best form of guard to use at that site.

Would be interesting to see how many sites are type-polymorphic but method-monomorphic.

both class probes and method probes at vtable call sites.

We should probably work on streamlining the cost of the class/method probes. Simple thing would be to hoist up the check where we decide to update the table; if we're not going to do a table update we don't need to do all the work to figure out what values we'd record. That should help somewhat, especially for the class/method version where those computations are more complex.

@jakobbotsch
Copy link
Member Author

We should probably work on streamlining the cost of the class/method probes. Simple thing would be to hoist up the check where we decide to update the table; if we're not going to do a table update we don't need to do all the work to figure out what values we'd record. That should help somewhat, especially for the class/method version where those computations are more complex.

That optimization makes sense. On another note I'm not sure if the combined profile helper is worth much, since from the call site and class we should be able to get the resolved method anyway. So it might just go away in the final version.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 14, 2022

Impact for select TechEmpower benchmarks:
image

We don't gain that much, which is not that unexpected since we do not use delegates very much in those benchmarks. Here's the data in terms of how many times a probe at a specific IL offset was hit:
https://gist.github.com/jakobbotsch/1ef169922db9be5b3740e05a5ea9544f
For example, the data shows that the function in database-fortunes that has the most collected delegate GDV data was Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.IOQueue.System.Threading.IThreadPoolWorkItem.Execute, which had a delegate call site at IL offset 27 that was probed 150243 times (I increased the call counting threshold when gathering the data).

The same data for type GDV is https://gist.github.com/jakobbotsch/0903b426eddb69fd830662b228d5ca32
which has way more data (~12x more lines in that text output) and also the call sites are much hotter.

This makes us support generating and consuming the method handle
histograms in the same way as the type handle histograms by compressing
them before they are put in the R2R format.

Also finish some SPMI support.
@jakobbotsch
Copy link
Member Author

This should be ready. I am enabling delegate GDV by default in this PR, while keeping the vtable profiling disabled. The impact on size of PGO data and SPC.dll with delegate GDV is:

File Before After PDIFF
DotNet_FirstTimeXP.mibc 2538 KB 2570 KB +1.26%
DotNet_HelloWorld.mibc 5239 KB 5263 KB +0.46%
DotNet_OrchardCore.mibc 2339 KB 2383 KB +1.88%
DotNet_TechEmpower.mibc 976 KB 992 KB +1.64%
StandardOptimizationData.mibc 7377 KB 7494 KB +1.59%
System.Private.CoreLib.dll 11168 KB 11168 KB +0.00%

For TechEmpower impact on PGO, see the comment above.

cc @dotnet/jit-contrib PTAL @AndyAyersMS @EgorBo

@jakobbotsch jakobbotsch marked this pull request as ready for review June 15, 2022 15:38
Comment on lines -236 to 237
timeoutInMinutes: 390
${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter') }}:
${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter', 'pgo') }}:
timeoutInMinutes: 510
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking to keep these timeout changes until we get around to splitting up the jobs, it fixes the problem we are seeing with timeouts and I have not seen any issues with making the timeout this high.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

Comment on lines -21392 to -21436

// Possibly instrument. Note for OSR+PGO we will instrument when
// optimizing and (currently) won't devirtualize. We may want
// to revisit -- if we can devirtualize we should be able to
// suppress the probe.
//
// We strip BBINSTR from inlinees currently, so we'll only
// do this for the root method calls.
//
if (opts.jitFlags->IsSet(JitFlags::JIT_FLAG_BBINSTR))
{
assert(opts.OptimizationDisabled() || opts.IsOSR());
assert(!compIsForInlining());

// During importation, optionally flag this block as one that
// contains calls requiring class profiling. Ideally perhaps
// we'd just keep track of the calls themselves, so we don't
// have to search for them later.
//
if ((call->gtCallType != CT_INDIRECT) && opts.jitFlags->IsSet(JitFlags::JIT_FLAG_BBINSTR) &&
!opts.jitFlags->IsSet(JitFlags::JIT_FLAG_PREJIT) && (JitConfig.JitClassProfiling() > 0) &&
!isLateDevirtualization)
{
JITDUMP("\n ... marking [%06u] in " FMT_BB " for class profile instrumentation\n", dspTreeID(call),
compCurBB->bbNum);
ClassProfileCandidateInfo* pInfo = new (this, CMK_Inlining) ClassProfileCandidateInfo;

// Record some info needed for the class profiling probe.
//
pInfo->ilOffset = ilOffset;
pInfo->probeIndex = info.compClassProbeCount++;
call->gtClassProfileCandidateInfo = pInfo;

// Flag block as needing scrutiny
//
compCurBB->bbFlags |= BBF_HAS_CLASS_PROFILE;
}
return;
}

// Bail if optimizations are disabled.
if (opts.OptimizationDisabled())
{
return;
}
Copy link
Member Author

@jakobbotsch jakobbotsch Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was extracted into impConsiderCallProbe which is called in impImportCall instead. Unlike impDevirtualizeCall, impConsiderCallProbe is called for delegates too.

Comment on lines +22426 to +22427
// TODO-GDV: This can be simplified to just use likelyClasses and
// likelyMethods now that we have multiple candidates here.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in a follow-up change.

Comment on lines +22604 to +22607
uint32_t likelyClassAttribs = 0;
if (likelyClass != NO_CLASS_HANDLE)
{
likelyClassAttribs = info.compCompHnd->getClassAttribs(likelyClass);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff is easier to review with whitespace ignored.

@jakobbotsch
Copy link
Member Author

Test failure is #68376. superpmi failures are expected due to JIT-EE GUID change.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great.

Comment on lines -236 to 237
timeoutInMinutes: 390
${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter') }}:
${{ if in(parameters.testGroup, 'gcstress-extra', 'r2r-extra', 'clrinterpreter', 'pgo') }}:
timeoutInMinutes: 510
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

// Reusing the call target for delegates is more
// complicated. Essentially we need to do the
// transformation done in LowerDelegateInvoke by converting
// the call to CT_INDIRECT and reusing the target address.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point I convinced myself we should do this lowering much earlier anyways, so we had a shot at propagating the method of locally created delegate to the call site.

Not sure if that would help here or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code-base standpoint I also think that would be nice, we have many places we need to special case IsDelegateInvoke() since that looks just like a normal managed function call but isn't.

It's possible that would help here, assuming CSE would then handle the shared call target for the check and cold case. In that case we perhaps wouldn't even need the special handling here. Maybe something to experiment with in a future PR.

@jakobbotsch
Copy link
Member Author

Also PTAL @davidwrighton for the non-JIT/SPMI changes. I have bumped the R2R minor version again since we now may see GetLikelyMethod entries in the embedded PGO data.

One thing I would like to point attention to is that old versions of dotnet-pgo (from before #67919) fail with an "Unknown PGO type" error if they see this new data. So one needs to use a version of dotnet-pgo that matches the version of the runtime used to produce the data.

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jakobbotsch
Copy link
Member Author

@AndyAyersMS I believe I addressed your feedback if you can sign off. I hope to merge this tonight to have it included in the weekend jit-pgo runs.

@jakobbotsch jakobbotsch merged commit cae8546 into dotnet:main Jun 17, 2022
@jakobbotsch jakobbotsch deleted the method-handle-instrumentation branch June 17, 2022 23:58
@ghost ghost locked as resolved and limited conversation to collaborators Jul 18, 2022
@davidfowl
Copy link
Member

I missed this change. This is a big deal 😀 !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants