Vector128.ShiftRightLogical doesn't use machine-instructions for some types (`byte` / `sbyte`)

`Vector128.ShiftRightLogical` has overloads for all primitive numeric types, and so it puts into mind that these operations are backed by efficient hw-code. This may be a trap for some types, as it falls back to a software solution.

While porting some code from SSE to xplat-instrinsics that made me wonder (why are there `movsxd` which shouldn't be there, and from where comes a loop where none should be) until I remembered that in Sse2 there's no such instruction...

This table summarizes the CQ for the numeric types (I don't know about AdvSimd, thus left that column out):
| Type | codegen | Sse2-method available |
| -- | -- | -- |
| `byte` | ❌ | ❌ |
| `short` | ✔️ | ✔️ |
| `int` | ✔️ | ✔️ |
| `long` | ✔️ | ✔️ |
| `nint` | ✔️ | ❌ |
| `nuint` | ✔️ | ❌ |
| `sbyte` | ❌ | ❌ |
| `ushort` | ✔️ | ✔️ 
| `uint` | ✔️ | ✔️ |
| `ulong` | ✔️ | ✔️ |

As can be seen there's no strict correlation between codegen for the xplat-ShiftRightLogical and the availability of a Sse2-method (in both directions).

If this is not just a JIT's CQ-issue, then there should be any measure / indicator for the user to don't be surprised by the codegen or more important by perf.

We now push users towards using the xplat-instrinsics (which makes sense IMO), but I think we can't expect that every user inspects the generated machine-code.

Generalizing a bit: if there are known methods that aren't really intrinsified (by architecture, by design, ran out of time, etc.) maybe it's possible to provide any kind of hint in intellisense like it does for supported APIs:
![grafik](https://user-images.githubusercontent.com/5755834/190703302-fdaa91e0-05fb-49c1-91a3-bd5825eb7733.png)

Where it states something like:
```
SSE2 - intrinsified
AdvSimd - software fallback
```
Would it be enough for such info to have them in the xml doc-comments?

Having an analyzer that issues an suggestion / hint that software fallback will be used is another option.
To keep this up-to-date maybe some kind of meta-data would be needed on each Vector128-method that indicates the state of intrinsification. Maybe that overkill, though.

#### Repro

```c#
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoSse(Vector128<sbyte> vec)
{
    return Sse2.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplatNaive(Vector128<sbyte> vec)
{
    return Vector128.ShiftRightLogical(vec, 4);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplat(Vector128<sbyte> vec)
{
    return Vector128.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}
```
produces on .NET 7 RC 2

```asm
; Assembly listing for method Program:<<Main>$>g__DoSse|0_0(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F972D004           vpsrld   xmm0, xmm0, 4
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

; Total bytes of code 20

; Assembly listing for method Program:<<Main>$>g__DoXplatNaive|0_1(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       4883EC48             sub      rsp, 72
       C5F877               vzeroupper
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000CH
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F929442420         vmovapd  xmmword ptr [rsp+20H], xmm0
       33FF                 xor      edi, edi

G_M000_IG03:                ;; offset=0018H
       488D4C2420           lea      rcx, bword ptr [rsp+20H]
       4863D7               movsxd   rdx, edi
       480FBE0C11           movsx    rcx, byte  ptr [rcx+rdx]
       BA04000000           mov      edx, 4
       FF1538261A00         call     [Scalar`1:ShiftRightLogical(byte,int):byte]
       488D542430           lea      rdx, bword ptr [rsp+30H]
       4863CF               movsxd   rcx, edi
       88040A               mov      byte  ptr [rdx+rcx], al
       FFC7                 inc      edi
       83FF10               cmp      edi, 16
       7CD6                 jl       SHORT G_M000_IG03

G_M000_IG04:                ;; offset=0042H
       C5F928442430         vmovapd  xmm0, xmmword ptr [rsp+30H]
       C5F91106             vmovupd  xmmword ptr [rsi], xmm0
       488BC6               mov      rax, rsi

G_M000_IG05:                ;; offset=004FH
       4883C448             add      rsp, 72
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

; Total bytes of code 86

; Assembly listing for method Program:<<Main>$>g__DoXplat|0_2(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F972D004           vpsrld   xmm0, xmm0, 4
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

; Total bytes of code 20
```

PS: didn't check `Vector256` and the other shift-variants.

category:cq
theme:vector-codegen
skill-level:beginner
cost:small
impact:small

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector128.ShiftRightLogical doesn't use machine-instructions for some types (`byte` / `sbyte`) #75770

Repro

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Type	codegen	Sse2-method available
`byte`	❌	❌
`short`	✔️	✔️
`int`	✔️	✔️
`long`	✔️	✔️
`nint`	✔️	❌
`nuint`	✔️	❌
`sbyte`	❌	❌
`ushort`	✔️	✔️
`uint`	✔️	✔️
`ulong`	✔️	✔️

Vector128.ShiftRightLogical doesn't use machine-instructions for some types (byte / sbyte) #75770

Description

Repro

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Vector128.ShiftRightLogical doesn't use machine-instructions for some types (`byte` / `sbyte`) #75770