Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 27, 2022

This PR does:

  1. Ports SequenceEqual to use cross-plat vectors
  2. Optimizes it for arm64 for <16 elements (the path to handle them was guarded with Sse2.IsSupported for some reason)
  3. Optimizes vec1 == vec2 with xor+vptest if available:
bool Test(Vector128<int> v1, Vector128<int> v2) => v1 == v2;
bool Test(Vector256<int> v1, Vector256<int> v2) => v1 == v2;

codegen diff:

; Method Proga:Test
G_M56888_IG01:
       vzeroupper 
G_M56888_IG02:
       vmovupd  xmm0, xmmword ptr [rdx]
-      vpcmpeqd xmm0, xmm0, xmmword ptr [r8]
-      vpmovmskb eax, xmm0
-      cmp      eax, 0xFFFF
+      vpxor    xmm0, xmm0, xmmword ptr [r8]
+      vptest   xmm0, xmm0
       sete     al
       movzx    rax, al
G_M56888_IG03:
       ret      
-; Total bytes of code: 28
+; Total bytes of code: 24


; Method Proga:Test
G_M5176_IG01:
       vzeroupper 
G_M5176_IG02:
       vmovupd  ymm0, ymmword ptr[rdx]
-      vpcmpeqd ymm0, ymm0, ymmword ptr[r8]
-      vpmovmskb eax, ymm0
-      cmp      eax, -1
+      vpxor    ymm0, ymm0, ymmword ptr[r8]
+      vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
G_M5176_IG03:
       vzeroupper 
       ret      
-; Total bytes of code: 29
+; Total bytes of code: 27

However, it seems like in some cases/on some CPUs movmsk is faster 🤔

cc @tannergooding

PS: seems like the main loop in SequenceEqual is not properly aligned and hits JCC erratum each iteration

@ghost ghost assigned EgorBo Mar 27, 2022
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 27, 2022
@ghost
Copy link

ghost commented Mar 27, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR does:

  1. Ports SequenceEqual to use cross-plat vectors
  2. Optimizes it for arm64 for <16 elements (the path to handle them was guarded with Sse2.IsSupported for some reason)
  3. Optimizes vec == vec with xor+vptest if available:
bool Test(Vector128<int> v1, Vector128<int> v2) => v1 == v2;
bool Test(Vector256<int> v1, Vector256<int> v2) => v1 == v2;

codegen diff:

; Method Proga:Test
G_M56888_IG01:
       vzeroupper 
G_M56888_IG02:
       vmovupd  xmm0, xmmword ptr [rdx]
-      vpcmpeqd xmm0, xmm0, xmmword ptr [r8]
-      vpmovmskb eax, xmm0
-      cmp      eax, 0xFFFF
+      vpxor    xmm0, xmm0, xmmword ptr [r8]
+      vptest   xmm0, xmm0
       sete     al
       movzx    rax, al
G_M56888_IG03:
       ret      
-; Total bytes of code: 28
+; Total bytes of code: 24


; Method Proga:Test
G_M5176_IG01:
       vzeroupper 
G_M5176_IG02:
       vmovupd  ymm0, ymmword ptr[rdx]
-      vpcmpeqd ymm0, ymm0, ymmword ptr[r8]
-      vpmovmskb eax, ymm0
-      cmp      eax, -1
+      vpxor    ymm0, ymm0, ymmword ptr[r8]
+      vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
G_M5176_IG03:
       vzeroupper 
       ret      
-; Total bytes of code: 29
+; Total bytes of code: 27

However, it seems like in some cases/on some CPUs movmsk is faster 🤔

cc @tannergooding

Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@EgorBo EgorBo merged commit c3c0223 into dotnet:main Mar 28, 2022
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
EgorBo added a commit to EgorBo/runtime-1 that referenced this pull request Apr 12, 2022
EgorBo added a commit that referenced this pull request Apr 14, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants