Port SequenceEqual to crossplat Vectors, optimize vector compare on x64 #67202

EgorBo · 2022-03-27T14:08:19Z

This PR does:

Ports SequenceEqual to use cross-plat vectors
Optimizes it for arm64 for <16 elements (the path to handle them was guarded with Sse2.IsSupported for some reason)
Optimizes vec1 == vec2 with xor+vptest if available:

bool Test(Vector128<int> v1, Vector128<int> v2) => v1 == v2;
bool Test(Vector256<int> v1, Vector256<int> v2) => v1 == v2;

codegen diff:

; Method Proga:Test
G_M56888_IG01:
       vzeroupper 
G_M56888_IG02:
       vmovupd  xmm0, xmmword ptr [rdx]
-      vpcmpeqd xmm0, xmm0, xmmword ptr [r8]
-      vpmovmskb eax, xmm0
-      cmp      eax, 0xFFFF
+      vpxor    xmm0, xmm0, xmmword ptr [r8]
+      vptest   xmm0, xmm0
       sete     al
       movzx    rax, al
G_M56888_IG03:
       ret      
-; Total bytes of code: 28
+; Total bytes of code: 24


; Method Proga:Test
G_M5176_IG01:
       vzeroupper 
G_M5176_IG02:
       vmovupd  ymm0, ymmword ptr[rdx]
-      vpcmpeqd ymm0, ymm0, ymmword ptr[r8]
-      vpmovmskb eax, ymm0
-      cmp      eax, -1
+      vpxor    ymm0, ymm0, ymmword ptr[r8]
+      vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
G_M5176_IG03:
       vzeroupper 
       ret      
-; Total bytes of code: 29
+; Total bytes of code: 27

However, it seems like in some cases/on some CPUs movmsk is faster 🤔

cc @tannergooding

PS: seems like the main loop in SequenceEqual is not properly aligned and hits JCC erratum each iteration

ghost · 2022-03-27T14:08:28Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR does:

Ports SequenceEqual to use cross-plat vectors
Optimizes it for arm64 for <16 elements (the path to handle them was guarded with Sse2.IsSupported for some reason)
Optimizes vec == vec with xor+vptest if available:

bool Test(Vector128<int> v1, Vector128<int> v2) => v1 == v2;
bool Test(Vector256<int> v1, Vector256<int> v2) => v1 == v2;

codegen diff:

; Method Proga:Test
G_M56888_IG01:
       vzeroupper 
G_M56888_IG02:
       vmovupd  xmm0, xmmword ptr [rdx]
-      vpcmpeqd xmm0, xmm0, xmmword ptr [r8]
-      vpmovmskb eax, xmm0
-      cmp      eax, 0xFFFF
+      vpxor    xmm0, xmm0, xmmword ptr [r8]
+      vptest   xmm0, xmm0
       sete     al
       movzx    rax, al
G_M56888_IG03:
       ret      
-; Total bytes of code: 28
+; Total bytes of code: 24


; Method Proga:Test
G_M5176_IG01:
       vzeroupper 
G_M5176_IG02:
       vmovupd  ymm0, ymmword ptr[rdx]
-      vpcmpeqd ymm0, ymm0, ymmword ptr[r8]
-      vpmovmskb eax, ymm0
-      cmp      eax, -1
+      vpxor    ymm0, ymm0, ymmword ptr[r8]
+      vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
G_M5176_IG03:
       vzeroupper 
       ret      
-; Total bytes of code: 29
+; Total bytes of code: 27

However, it seems like in some cases/on some CPUs movmsk is faster 🤔

cc @tannergooding

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding

LGTM.

…64 (dotnet#67202)

EgorBo added 3 commits March 27, 2022 15:24

Use TestZ for vector cmp

ca43472

Clean up

b6e1551

Clean up

285f9df

ghost assigned EgorBo Mar 27, 2022

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 27, 2022

tannergooding approved these changes Mar 27, 2022

View reviewed changes

EgorBo merged commit c3c0223 into dotnet:main Mar 28, 2022

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022

Port SequenceEqual to crossplat Vectors, optimize vector compare on x…

624b1ab

…64 (dotnet#67202)

This was referenced Apr 5, 2022

Regressions in SequenceEqual #67596

Closed

[Perf] Changes at 3/28/2022 10:41:36 AM dotnet/perf-autofiling-issues#4363

Closed

EgorBo added a commit to EgorBo/runtime-1 that referenced this pull request Apr 12, 2022

Revert dotnet#67202

b04ebef

This was referenced Apr 12, 2022

Revert Vector.Equals optimization #67202 #67902

Merged

[Perf] Changes at 3/28/2022 10:41:36 AM dotnet/perf-autofiling-issues#4301

Closed

[Perf] Changes at 3/28/2022 10:41:36 AM dotnet/perf-autofiling-issues#4288

Closed

EgorBo added a commit that referenced this pull request Apr 14, 2022

Revert #67202 (#67902)

489b034

kunalspathak mentioned this pull request Apr 22, 2022

Regressions in System.Globalization.Tests #68409

Closed

ghost locked as resolved and limited conversation to collaborators Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port SequenceEqual to crossplat Vectors, optimize vector compare on x64 #67202

Port SequenceEqual to crossplat Vectors, optimize vector compare on x64 #67202

Uh oh!

EgorBo commented Mar 27, 2022 •

edited

Loading

Uh oh!

ghost commented Mar 27, 2022

Uh oh!

tannergooding left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Port SequenceEqual to crossplat Vectors, optimize vector compare on x64 #67202

Port SequenceEqual to crossplat Vectors, optimize vector compare on x64 #67202

Uh oh!

Conversation

EgorBo commented Mar 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Mar 27, 2022

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EgorBo commented Mar 27, 2022 •

edited

Loading