-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Vector128.ShiftRightLogical has overloads for all primitive numeric types, and so it puts into mind that these operations are backed by efficient hw-code. This may be a trap for some types, as it falls back to a software solution.
While porting some code from SSE to xplat-instrinsics that made me wonder (why are there movsxd which shouldn't be there, and from where comes a loop where none should be) until I remembered that in Sse2 there's no such instruction...
This table summarizes the CQ for the numeric types (I don't know about AdvSimd, thus left that column out):
| Type | codegen | Sse2-method available |
|---|---|---|
byte |
❌ | ❌ |
short |
✔️ | ✔️ |
int |
✔️ | ✔️ |
long |
✔️ | ✔️ |
nint |
✔️ | ❌ |
nuint |
✔️ | ❌ |
sbyte |
❌ | ❌ |
ushort |
✔️ | ✔️ |
uint |
✔️ | ✔️ |
ulong |
✔️ | ✔️ |
As can be seen there's no strict correlation between codegen for the xplat-ShiftRightLogical and the availability of a Sse2-method (in both directions).
If this is not just a JIT's CQ-issue, then there should be any measure / indicator for the user to don't be surprised by the codegen or more important by perf.
We now push users towards using the xplat-instrinsics (which makes sense IMO), but I think we can't expect that every user inspects the generated machine-code.
Generalizing a bit: if there are known methods that aren't really intrinsified (by architecture, by design, ran out of time, etc.) maybe it's possible to provide any kind of hint in intellisense like it does for supported APIs:

Where it states something like:
SSE2 - intrinsified
AdvSimd - software fallback
Would it be enough for such info to have them in the xml doc-comments?
Having an analyzer that issues an suggestion / hint that software fallback will be used is another option.
To keep this up-to-date maybe some kind of meta-data would be needed on each Vector128-method that indicates the state of intrinsification. Maybe that overkill, though.
Repro
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoSse(Vector128<sbyte> vec)
{
return Sse2.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplatNaive(Vector128<sbyte> vec)
{
return Vector128.ShiftRightLogical(vec, 4);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplat(Vector128<sbyte> vec)
{
return Vector128.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}produces on .NET 7 RC 2
; Assembly listing for method Program:<<Main>$>g__DoSse|0_0(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91002 vmovupd xmm0, xmmword ptr [rdx]
C5F972D004 vpsrld xmm0, xmm0, 4
C5F91101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0013H
C3 ret
; Total bytes of code 20
; Assembly listing for method Program:<<Main>$>g__DoXplatNaive|0_1(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
57 push rdi
56 push rsi
4883EC48 sub rsp, 72
C5F877 vzeroupper
488BF1 mov rsi, rcx
G_M000_IG02: ;; offset=000CH
C5F91002 vmovupd xmm0, xmmword ptr [rdx]
C5F929442420 vmovapd xmmword ptr [rsp+20H], xmm0
33FF xor edi, edi
G_M000_IG03: ;; offset=0018H
488D4C2420 lea rcx, bword ptr [rsp+20H]
4863D7 movsxd rdx, edi
480FBE0C11 movsx rcx, byte ptr [rcx+rdx]
BA04000000 mov edx, 4
FF1538261A00 call [Scalar`1:ShiftRightLogical(byte,int):byte]
488D542430 lea rdx, bword ptr [rsp+30H]
4863CF movsxd rcx, edi
88040A mov byte ptr [rdx+rcx], al
FFC7 inc edi
83FF10 cmp edi, 16
7CD6 jl SHORT G_M000_IG03
G_M000_IG04: ;; offset=0042H
C5F928442430 vmovapd xmm0, xmmword ptr [rsp+30H]
C5F91106 vmovupd xmmword ptr [rsi], xmm0
488BC6 mov rax, rsi
G_M000_IG05: ;; offset=004FH
4883C448 add rsp, 72
5E pop rsi
5F pop rdi
C3 ret
; Total bytes of code 86
; Assembly listing for method Program:<<Main>$>g__DoXplat|0_2(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91002 vmovupd xmm0, xmmword ptr [rdx]
C5F972D004 vpsrld xmm0, xmm0, 4
C5F91101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0013H
C3 ret
; Total bytes of code 20PS: didn't check Vector256 and the other shift-variants.
category:cq
theme:vector-codegen
skill-level:beginner
cost:small
impact:small