Skip to content

Improve codegen for Vector128.Shift* operations where a direct intrinsic is not available #82564

@MihaZupan

Description

@MihaZupan

(applies to Vector256 as well)

Consider Vector128.ShiftRightLogical(ref byte) where X86 does not have a ShiftRightLogical instruction that operates on bytes:

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0, 4);

Which currently emits a scalar fallback

TestClass.Foo(Byte ByRef)
    L0000: push rsi
    L0001: sub rsp, 0x40
    L0005: vzeroupper
    L0008: vmovdqu xmm0, [rcx]
    L000c: vmovapd [rsp+0x20], xmm0
    L0012: xor esi, esi
    L0014: lea rcx, [rsp+0x20]
    L0019: movsxd rdx, esi
    L001c: movzx ecx, byte ptr [rcx+rdx]
    L0020: mov edx, 4
    L0025: mov rax, 0x7ffa0845bc60
    L002f: call qword ptr [rax]
    L0031: lea rdx, [rsp+0x30]
    L0036: movsxd rcx, esi
    L0039: mov [rdx+rcx], al
    L003c: inc esi
    L003e: cmp esi, 0x10
    L0041: jl short L0014
    L0043: vmovapd xmm0, [rsp+0x30]
    L0049: vpmovmskb eax, xmm0
    L004d: add rsp, 0x40
    L0051: pop rsi
    L0052: ret

where it could instead emit a 32-bit shift and an AND to clear the overlapping bits

Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
TestClass.Bar(Byte ByRef)
    L0000: vzeroupper
    L0003: vmovdqu xmm0, [rcx]
    L0007: vpsrld xmm0, xmm0, 4
    L000c: vpand xmm0, xmm0, [0x7ffa087600d0]
    L0014: vpmovmskb eax, xmm0
    L0018: ret

We have a few places in runtime that are aware of this issue and employ workarounds, e.g.:

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions