Skip to content

Conversation

@hez2010
Copy link
Contributor

@hez2010 hez2010 commented Mar 7, 2025

Boost inlining for more derived returns.

Example:

foreach (var number in GetNumbers())
{
    Console.WriteLine(number);
}

static IEnumerable<int> GetNumbers()
{
    for (int i = 0; i < 10; i++)
    {
        yield return i;
    }
}

Before:

*************** In fgFindBasicBlocks() for Program+<GetNumbers>d__1:System.Collections.Generic.IEnumerable<System.Int32>.GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 41 : state  32 [ ldc.i4.s ]
weight= 12 : state  51 [ bne.un.s ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 79 : state  40 [ call ]
weight= 12 : state  51 [ bne.un.s ]
weight= 32 : state 218 [ ldarg.0 -> ldc.i4.0 -> stfld ]
weight= 10 : state   3 [ ldarg.0 ]
weight=  6 : state  11 [ stloc.0 ]
weight= 44 : state  43 [ br.s ]
weight= 15 : state  23 [ ldc.i4.0 ]
weight=227 : state 103 [ newobj ]
weight= 20 : state 199 [ stloc.0 -> ldloc.0 ]
weight= 19 : state  42 [ ret ]

2 ldfld or stfld over arguments which are structs.  Multiplier increased to 1.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 2.
Inline candidate callsite is boring.  Multiplier increased to 3.3.
calleeNativeSizeEstimate=579
callsiteNativeSizeEstimate=85
benefit multiplier=3.3
threshold=280
Native estimate for function size exceeds threshold for inlining 57.9 > 28 (multiplier = 3.3)


Inline expansion aborted, inline not profitable

Resulting in virtual calls:

G_M27646_IG01:  ;; offset=0x0000
       push     rbp
       push     rbx
       sub      rsp, 56
       lea      rbp, [rsp+0x40]
       mov      qword ptr [rbp-0x20], rsp
                                                ;; size=15 bbWeight=1 PerfScore 3.75
G_M27646_IG02:  ;; offset=0x000F
       mov      rcx, 0x7FFA216DFD28      ; Program+<GetNumbers>d__1
       call     CORINFO_HELP_NEWSFAST
       mov      rbx, rax
       mov      dword ptr [rbx+0x08], -2
       call     System.Environment:get_CurrentManagedThreadId():int
       mov      dword ptr [rbx+0x10], eax
       mov      rcx, rbx
       call     [Program+<GetNumbers>d__1:System.Collections.Generic.IEnumerable<System.Int32>.GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this]
       mov      rbx, rax
       mov      gword ptr [rbp-0x10], rbx
                                                ;; size=49 bbWeight=1 PerfScore 9.00
G_M27646_IG03:  ;; offset=0x0040
       jmp      SHORT G_M27646_IG05
                                                ;; size=2 bbWeight=1 PerfScore 2.00
G_M27646_IG04:  ;; offset=0x0042
       mov      rcx, rbx
       mov      r11, 0x7FFA20670378      ; code for System.Collections.Generic.IEnumerator`1[int]:get_Current():int:this
       call     [r11]System.Collections.Generic.IEnumerator`1[int]:get_Current():int:this
       mov      ecx, eax
       call     [System.Console:WriteLine(int)]
                                                ;; size=24 bbWeight=4 PerfScore 27.00
G_M27646_IG05:  ;; offset=0x005A
       mov      rcx, rbx
       mov      r11, 0x7FFA20670370      ; code for System.Collections.IEnumerator:MoveNext():ubyte:this
       call     [r11]System.Collections.IEnumerator:MoveNext():ubyte:this
       test     eax, eax
       jne      SHORT G_M27646_IG04
                                                ;; size=20 bbWeight=8 PerfScore 38.00
G_M27646_IG06:  ;; offset=0x006E
       mov      rcx, rbx
       mov      r11, 0x7FFA20670380      ; code for System.IDisposable:Dispose():this
       call     [r11]System.IDisposable:Dispose():this
       nop
                                                ;; size=17 bbWeight=1 PerfScore 3.75
G_M27646_IG07:  ;; offset=0x007F
       add      rsp, 56
       pop      rbx
       pop      rbp
       ret
                                                ;; size=7 bbWeight=1 PerfScore 2.25
G_M27646_IG08:  ;; offset=0x0086
       push     rbp
       push     rbx
       sub      rsp, 40
       mov      rbp, qword ptr [rcx+0x20]
       mov      qword ptr [rsp+0x20], rbp
       lea      rbp, [rbp+0x40]
                                                ;; size=19 bbWeight=0 PerfScore 0.00
G_M27646_IG09:  ;; offset=0x0099
       cmp      gword ptr [rbp-0x10], 0
       je       SHORT G_M27646_IG10
       mov      rcx, gword ptr [rbp-0x10]
       mov      r11, 0x7FFA20670380      ; code for System.IDisposable:Dispose():this
       call     [r11]System.IDisposable:Dispose():this
                                                ;; size=24 bbWeight=0 PerfScore 0.00
G_M27646_IG10:  ;; offset=0x00B1
       nop
                                                ;; size=1 bbWeight=0 PerfScore 0.00
G_M27646_IG11:  ;; offset=0x00B2
       add      rsp, 40
       pop      rbx
       pop      rbp
       ret
                                                ;; size=7 bbWeight=0 PerfScore 0.00

; Total bytes of code 185, prolog size 15, PerfScore 85.75, instruction count 50, allocated bytes for code 185 (MethodHash=cb019401) for method Program:Main() (FullOpts)

After:

*************** In fgFindBasicBlocks() for Program+<GetNumbers>d__1:System.Collections.Generic.IEnumerable<System.Int32>.GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 41 : state  32 [ ldc.i4.s ]
weight= 12 : state  51 [ bne.un.s ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 79 : state  40 [ call ]
weight= 12 : state  51 [ bne.un.s ]
weight= 32 : state 218 [ ldarg.0 -> ldc.i4.0 -> stfld ]
weight= 10 : state   3 [ ldarg.0 ]
weight=  6 : state  11 [ stloc.0 ]
weight= 44 : state  43 [ br.s ]
weight= 15 : state  23 [ ldc.i4.0 ]
weight=227 : state 103 [ newobj ]
weight= 20 : state 199 [ stloc.0 -> ldloc.0 ]
weight= 19 : state  42 [ ret ]

multiplier in methods that return a more derived type increased to 4.
2 ldfld or stfld over arguments which are structs.  Multiplier increased to 5.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 6.
Inline candidate callsite is boring.  Multiplier increased to 7.3.
calleeNativeSizeEstimate=579
callsiteNativeSizeEstimate=85
benefit multiplier=7.3
threshold=620
Native estimate for function size is within threshold for inlining 57.9 <= 62 (multiplier = 7.3)

All virtual calls get devirtualized thanks to late devirtualization:

G_M27646_IG01:  ;; offset=0x0000
       push     rbp
       push     rsi
       push     rbx
       sub      rsp, 48
       lea      rbp, [rsp+0x40]
       mov      qword ptr [rbp-0x20], rsp
                                                ;; size=16 bbWeight=1 PerfScore 4.75
G_M27646_IG02:  ;; offset=0x0010
       mov      rcx, 0x7FFA216CF020      ; Program+<GetNumbers>d__1
       call     CORINFO_HELP_NEWSFAST
       mov      rbx, rax
       mov      dword ptr [rbx+0x08], -2
       call     System.Environment:get_CurrentManagedThreadId():int
       mov      dword ptr [rbx+0x10], eax
       cmp      dword ptr [rbx+0x08], -2
       jne      SHORT G_M27646_IG04
                                                ;; size=39 bbWeight=1 PerfScore 8.50
G_M27646_IG03:  ;; offset=0x0037
       mov      esi, dword ptr [rbx+0x10]
       call     System.Environment:get_CurrentManagedThreadId():int
       cmp      esi, eax
       je       SHORT G_M27646_IG05
                                                ;; size=12 bbWeight=0.50 PerfScore 2.12
G_M27646_IG04:  ;; offset=0x0043
       mov      rcx, 0x7FFA216CF020      ; Program+<GetNumbers>d__1
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       xor      eax, eax
       mov      dword ptr [rsi+0x08], eax
       call     System.Environment:get_CurrentManagedThreadId():int
       mov      dword ptr [rsi+0x10], eax
       jmp      SHORT G_M27646_IG06
                                                ;; size=33 bbWeight=0.50 PerfScore 3.38
G_M27646_IG05:  ;; offset=0x0064
       xor      ecx, ecx
       mov      dword ptr [rbx+0x08], ecx
       mov      rsi, rbx
                                                ;; size=8 bbWeight=0.50 PerfScore 0.75
G_M27646_IG06:  ;; offset=0x006C
       mov      gword ptr [rbp-0x18], rsi
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M27646_IG07:  ;; offset=0x0070
       jmp      SHORT G_M27646_IG09
                                                ;; size=2 bbWeight=1 PerfScore 2.00
G_M27646_IG08:  ;; offset=0x0072
       mov      ecx, dword ptr [rsi+0x0C]
       call     [System.Console:WriteLine(int)]
                                                ;; size=9 bbWeight=4 PerfScore 20.00
G_M27646_IG09:  ;; offset=0x007B
       mov      rcx, rsi
       call     [Program+<GetNumbers>d__1:MoveNext():ubyte:this]
       test     eax, eax
       jne      SHORT G_M27646_IG08
                                                ;; size=13 bbWeight=8 PerfScore 36.00
G_M27646_IG10:  ;; offset=0x0088
       mov      dword ptr [rsi+0x08], -2
                                                ;; size=7 bbWeight=1 PerfScore 1.00
G_M27646_IG11:  ;; offset=0x008F
       add      rsp, 48
       pop      rbx
       pop      rsi
       pop      rbp
       ret
                                                ;; size=8 bbWeight=1 PerfScore 2.75
G_M27646_IG12:  ;; offset=0x0097
       push     rbp
       push     rsi
       push     rbx
       sub      rsp, 48
       mov      rbp, qword ptr [rcx+0x20]
       mov      qword ptr [rsp+0x20], rbp
       lea      rbp, [rbp+0x40]
                                                ;; size=20 bbWeight=0 PerfScore 0.00
G_M27646_IG13:  ;; offset=0x00AB
       mov      rsi, gword ptr [rbp-0x18]
       mov      dword ptr [rsi+0x08], -2
                                                ;; size=11 bbWeight=0 PerfScore 0.00
G_M27646_IG14:  ;; offset=0x00B6
       add      rsp, 48
       pop      rbx
       pop      rsi
       pop      rbp
       ret
                                                ;; size=8 bbWeight=0 PerfScore 0.00

; Total bytes of code 190, prolog size 16, PerfScore 82.25, instruction count 57, allocated bytes for code 190 (MethodHash=cb019401) for method Program:Main() (FullOpts)

cc @AndyAyersMS

Copilot AI review requested due to automatic review settings March 7, 2025 14:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 7, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 7, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@hez2010
Copy link
Contributor Author

hez2010 commented Mar 7, 2025

@MihuBot

if (sig.retType == CORINFO_TYPE_CLASS)
{
exactSigRet = info.compCompHnd->isExactType(sig.retTypeClass);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid calling VM apis during inlining - this potentially can be slow/trigger type load events for every inline attempt (and we do many of them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant not this one, but eeGetMethodSig(methodHnd, &methodSig); below

Copy link
Contributor Author

@hez2010 hez2010 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this potentially can be slow/trigger type load events for every inline attempt (and we do many of them)

We are already resolving method tokens for calls while scanning the inlinee, which loads all the type dependencies, so I don't see this to be a problem.

Copy link
Member

@EgorBo EgorBo Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this potentially can be slow/trigger type load events for every inline attempt (and we do many of them)

We are already resolving method tokens for calls while scanning the inlinee, which loads all the type dependencies, so I don't see this to be a problem.

Yes and the initial impl of that caused a massive TP regression as far as I remember, so it's not an excuse to continue calling other VM stuff, if we want to land this, we need to measure/make sure it's cheap.

The main bottleneck in application startup are VM calls triggered by JIT's importer

@hez2010
Copy link
Contributor Author

hez2010 commented Mar 7, 2025

Retrigger a @MihuBot as void returns were not filtered before.

@EgorBo
Copy link
Member

EgorBo commented Mar 7, 2025

I'm not entirely sure I understand this heuristic, but the multiplier and the JIT-diff is too massive to land as is, e.g. it's 10x bigger than Andy's PR to enable EH inlining.

Can you extract the CALLEE_LOOKS_LIKE_WRAPPER part to a separate PR? I think that one makes total sense

@hez2010
Copy link
Contributor Author

hez2010 commented Mar 7, 2025

I'm not entirely sure I understand this heuristic, but the multiplier and the JIT-diff is too massive to land as is, e.g. it's 10x bigger than Andy's PR to enable EH inlining.

Some diffs seem unexpected to me as well, will investigate later.

Can you extract the CALLEE_LOOKS_LIKE_WRAPPER part to a separate PR? I think that one makes total sense

Sure.

@hez2010
Copy link
Contributor Author

hez2010 commented Mar 10, 2025

I think we should resolve the issue here to enable yield devirt instead of aggressively bumping the multiplier in the JIT.

@JulieLeeMSFT JulieLeeMSFT requested a review from EgorBo April 7, 2025 16:19
@EgorBo
Copy link
Member

EgorBo commented May 19, 2025

Some diffs seem unexpected to me as well, will investigate later.

let's convert it to draft then (it shows up in our system as a stale pr)

@EgorBo EgorBo marked this pull request as draft May 19, 2025 07:48
@dotnet-policy-service
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 19, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants