-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Moving discussion from the PR #67049
@gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I've tried have shown .NET 7 to have regressed, e.g.
const string Input = """
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
The only other sound’s the sweep
Of easy wind and downy flake.
The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
""";
[Benchmark]
public string Replace() => Input.Replace('I', 'U');Method Runtime Mean Ratio
Replace .NET 6.0 108.1 ns 1.00
Replace .NET 7.0 136.0 ns 1.26
Do you see otherwise?
gfoidl commented yesterday
Hm, that is not expected...
When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT
Method Mean Error StdDev Median Ratio RatioSD
Default 142.0 ns 3.48 ns 9.98 ns 138.6 ns 1.00 0.00
PR 132.9 ns 2.68 ns 3.40 ns 132.8 ns 0.92 0.07
so a result I'd expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass.
I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK.
Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7?
How can I investigate this further -- need some guidance on how to check code-layout please.
@stephentoub
stephentoub commented yesterday •
Thanks, @gfoidl. Do you see a similar 6 vs 7 difference as I do? (It might not be specific to this PR.) @EgorBo, can you advise?
@tannergooding
tannergooding commented yesterday
When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see
This could be related to stale PGO data
@danmoseley
danmoseley commented yesterday
Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it.
@danmoseley
danmoseley commented yesterday
Also, it wouldn't matter here, but are we consuming POGO data trained on main bits in the release branches?
@stephentoub
stephentoub commented yesterday •
I don't think this particular case is related to stale PGO data. I set COMPlus_JitDisablePGO=1, and I still see an ~20% regression from .NET 6 to .NET 7.
@danmoseley
danmoseley commented 21 hours ago •
I ran the example above with
var config = DefaultConfig.Instance
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0").AsBaseline())
.AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"));
BenchmarkRunner.Run(typeof(Program).Assembly, args: args, config: config);
and got
BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2)
Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-rc.2.22426.5
[Host] : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2
Job-DGTURM : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2
Job-PYGDYG : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2
Job-ZEPFOF : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2
Job-PSEWWK : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256
Job-WGVIGL : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2
Job-HBSVYM : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2
Job-VWWZUC : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2
Job-LDCOEC : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256
| Method | EnvironmentVariables | Runtime | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Replace | COMPlus_JitDisablePGO=0 | .NET 6.0 | 130.5 ns | 6.76 ns | 18.51 ns | 124.0 ns | 0.92 | 0.17 | 0.3269 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=0 | .NET 7.0 | 144.0 ns | 2.95 ns | 5.54 ns | 142.5 ns | 1.00 | 0.00 | 0.3271 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=0 | .NET Core 3.1 | 822.1 ns | 16.09 ns | 23.07 ns | 814.0 ns | 5.69 | 0.31 | 0.3262 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=0 | .NET Framework 4.8 | 750.2 ns | 28.86 ns | 82.82 ns | 730.3 ns | 4.97 | 0.49 | 0.3262 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1 | .NET 6.0 | 127.1 ns | 2.64 ns | 4.75 ns | 126.4 ns | 0.88 | 0.05 | 0.3269 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1 | .NET 7.0 | 144.5 ns | 2.96 ns | 5.97 ns | 144.1 ns | 1.01 | 0.06 | 0.3271 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1 | .NET Core 3.1 | 936.2 ns | 17.96 ns | 22.06 ns | 931.9 ns | 6.50 | 0.37 | 0.3262 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1 | .NET Framework 4.8 | 673.2 ns | 12.41 ns | 23.91 ns | 670.5 ns | 4.68 | 0.23 | 0.3262 | 1.34 KB | 1.00 |
| code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251 |
(should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )
@danmoseley
danmoseley commented 21 hours ago
and just for interest
| Method | EnvironmentVariables | Runtime | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Replace | COMPlus_JitDisablePGO=1 | .NET 6.0 | 127.8 ns | 2.55 ns | 5.91 ns | 125.8 ns | 0.95 | 0.05 | 0.3266 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1 | .NET 7.0 | 141.0 ns | 2.73 ns | 2.42 ns | 141.1 ns | 1.00 | 0.00 | 0.3271 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 6.0 | 163.9 ns | 3.35 ns | 4.81 ns | 163.8 ns | 1.15 | 0.05 | 0.3269 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 7.0 | 184.9 ns | 3.59 ns | 4.79 ns | 183.7 ns | 1.32 | 0.05 | 0.3271 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 6.0 | 176.1 ns | 3.44 ns | 4.09 ns | 175.9 ns | 1.25 | 0.03 | 0.3269 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 7.0 | 192.1 ns | 3.81 ns | 4.53 ns | 190.1 ns | 1.37 | 0.05 | 0.3271 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 6.0 | 1,057.4 ns | 20.95 ns | 40.86 ns | 1,047.2 ns | 7.65 | 0.35 | 0.3262 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 7.0 | 947.1 ns | 13.34 ns | 11.83 ns | 948.3 ns | 6.72 | 0.15 | 0.3262 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 6.0 | 496.0 ns | 51.61 ns | 152.17 ns | 463.3 ns | 3.67 | 1.67 | 0.3269 | 1.34 KB | 1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 7.0 | 395.3 ns | 14.32 ns | 41.10 ns | 388.4 ns | 2.95 | 0.27 | 0.3271 | 1.34 KB | 1.00 |
@gfoidl
gfoidl commented 9 hours ago
Do you see a similar 6 vs 7 difference as I do?
Yes (sorry for slow response, was Sunday...).
@danmoseley thanks for your numbers.
This is the machine code I get (from BDN) when run @danmoseley's benchmark (.NET 7 only). Left there some comments.
; Program.Replace()
mov rcx,1C003C090A0
mov rcx,[rcx]
mov edx,49
mov r8d,55
jmp qword ptr [7FFEFA7430C0]
; Total bytes of code 30
; System.String.Replace(Char, Char)
push r15
push r14
push rdi
push rsi
push rbp
push rbx
sub rsp,28
vzeroupper
mov rsi,rcx
mov edi,edx
mov ebx,r8d
movzx ecx,di
movzx r8d,bx
cmp ecx,r8d
je near ptr M01_L09
lea rcx,[rsi+0C]
mov r8d,[rsi+8]
movsx rdx,di
call qword ptr [7FFEFA7433C0]
mov ebp,eax
test ebp,ebp
jge short M01_L00
mov rax,rsi ; uncommon case, could jump to M01_L09 instead
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L00:
mov ecx,[rsi+8]
sub ecx,ebp
mov r14d,ecx
mov ecx,[rsi+8]
call System.String.FastAllocateString(Int32)
mov r15,rax
test ebp,ebp
jg near ptr M01_L10 ; should be common path, I don't expect to jump to the end, then back to here
M01_L01:
mov eax,ebp
lea rax,[rsi+rax*2+0C]
cmp [r15],r15b
mov edx,ebp
lea rdx,[r15+rdx*2+0C]
xor ecx,ecx
cmp dword ptr [rsi+8],10
jl near ptr M01_L07
movzx r8d,di
imul r8d,10001 ; this is tracked in https://github.com/dotnet/runtime/issues/67038, .NET 6 has the same issue, so no difference expected
vmovd xmm0,r8d
vpbroadcastd ymm0,xmm0 ; should be vpbroadcastb, see comment above
movzx r8d,bx
imul r8d,10001
vmovd xmm1,r8d
vpbroadcastd ymm1,xmm1 ; vpbroadcastb (see above)
cmp r14,10
jbe short M01_L03
add r14,0FFFFFFFFFFFFFFF0
M01_L02:
lea r8,[rax+rcx*2]
vmovupd ymm2,[r8]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm4,ymm3,ymm1 ; the vpand, vpandn, vpor series should be vpblendvb, https://github.com/dotnet/runtime/issues/67039 tracked this
vpandn ymm2,ymm3,ymm2 ; the "duplicated code for string.Replace" method emits vpblendvb as expected, but
vpor ymm2,ymm4,ymm2 ; if string.Replace from .NET 7.0.0 (7.0.22.42212) (.NET SDK=7.0.100-rc.2.22426.5) is used, then it's this series
lea r8,[rdx+rcx*2]
vmovupd [r8],ymm2
add rcx,10
cmp rcx,r14
jb short M01_L02
M01_L03:
mov ecx,[rsi+8]
add ecx,0FFFFFFF0
add rsi,0C
lea rsi,[rsi+rcx*2]
vmovupd ymm2,[rsi]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm0,ymm3,ymm1
vpandn ymm1,ymm3,ymm2
vpor ymm2,ymm0,ymm1
lea rax,[r15+0C]
lea rax,[rax+rcx*2]
vmovupd [rax],ymm2
jmp short M01_L08
M01_L04:
movzx r8d,word ptr [rax+rcx*2]
lea r9,[rdx+rcx*2]
movzx r10d,di
cmp r8d,r10d
je short M01_L05 ; not relevant for .NET 6 -> .NET 7 comparison in this test-case, but
jmp short M01_L06 ; one jump could be avoided?!
M01_L05:
movzx r8d,bx
M01_L06:
mov [r9],r8w
inc rcx
M01_L07:
cmp rcx,r14
jb short M01_L04
M01_L08:
mov rax,r15
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L09: ; expect the mov rax,{r15,rsi} the epilogs are the same, can they be collapsed to
mov rax,rsi ; get less machine code?
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L10: ; this block should be common enough, so should be on the jump-root (see comment above)
cmp [r15],r15b ; it's the Memmove-call
lea rcx,[r15+0C]
lea rdx,[rsi+0C]
mov r8d,ebp
add r8,r8
call qword ptr [7FFEFA7399F0]
jmp near ptr M01_L01
; Total bytes of code 383So from code-layout one major difference to .NET 6 is that the call to System.Buffer.Memmove is moved out of the hot-path.
But I doubt that this allone is the cause for the regression.
I also wonder why vpblendvb is gone when using string.Replace in the benchmark from .NET-bits.
If I use a string.Replace-duplicated code for the benchmark, then it's emitted which is what I expect as 10d8a36 got merged on 2022-05-25.
But that shouldn't cause the regression either, as for .NET 6 the same series of vector-instruction are emitted.
The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn't change anything here. I don't expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong).
So far the "static analysis", but I doubt this is enough.
With Intel VTune I see some results, but with my interpretation the conclusions are just the same as stated in this comment.
I hope some JIT experts can shed some light on this (and give some advices on how to investigate, as I'm eager to learn).
Machine code for .NET 6 (for reference)
; System.String.Replace(Char, Char)
push r15
push r14
push rdi
push rsi
push rbp
push rbx
sub rsp,28
vzeroupper
mov rsi,rcx
movzx edi,dx
movzx ebx,r8w
cmp edi,ebx
jne short M01_L00
mov rax,rsi
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L00:
lea rbp,[rsi+0C]
mov rcx,rbp
mov r14d,[rsi+8]
mov r8d,r14d
mov edx,edi
call System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
mov r15d,eax
test r15d,r15d
jge short M01_L01
mov rax,rsi
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L01:
mov esi,r14d
sub esi,r15d
mov ecx,r14d
call System.String.FastAllocateString(Int32)
mov r14,rax
test r15d,r15d
jle short M01_L02
cmp [r14],r14d
lea rcx,[r14+0C]
mov rdx,rbp
mov r8d,r15d
add r8,r8
call System.Buffer.Memmove(Byte ByRef, Byte ByRef, UIntPtr)
M01_L02:
movsxd rax,r15d
add rax,rax
add rbp,rax
cmp [r14],r14d
lea rdx,[r14+0C]
add rdx,rax
cmp esi,10
jl short M01_L04
imul eax,edi,10001
vmovd xmm0,eax
vpbroadcastd ymm0,xmm0
imul eax,ebx,10001
vmovd xmm1,eax
vpbroadcastd ymm1,xmm1
M01_L03:
vmovupd ymm2,[rbp]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm4,ymm1,ymm3
vpandn ymm2,ymm3,ymm2
vpor ymm2,ymm4,ymm2
vmovupd [rdx],ymm2
add rbp,20
add rdx,20
add esi,0FFFFFFF0
cmp esi,10
jge short M01_L03
M01_L04:
test esi,esi
jle short M01_L08
nop word ptr [rax+rax]
M01_L05:
movzx eax,word ptr [rbp]
mov rcx,rdx
cmp eax,edi
je short M01_L06
jmp short M01_L07
M01_L06:
mov eax,ebx
M01_L07:
mov [rcx],ax
add rbp,2
add rdx,2
dec esi
test esi,esi
jg short M01_L05
M01_L08:
mov rax,r14
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
; Total bytes of code 307AndyAyersMS commented 2 hours ago
(should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )
Hard to say without looking deeper -- from the .NET 7 code above I would guess PGO is driving the code layout changes.
For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).