- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.9k
Open
Labels
A-autovectorizationArea: Autovectorization, which can impact perf or code sizeArea: Autovectorization, which can impact perf or code sizeA-codegenArea: Code generationArea: Code generationC-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchCategory: An issue highlighting optimization opportunities or PRs implementing suchS-has-mcveStatus: A Minimal Complete and Verifiable Example has been found for this issueStatus: A Minimal Complete and Verifiable Example has been found for this issueT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Description
Originally reported here
rust-lang/rust-clippy#12826
Related PR spawned from that issue
#125455
cc @blyxyas
Clamping and casting from i32 to u8, using clamp(0, 255) as u8 produces unnecessary instructions compared to .max(0).min(255) as u8 when a loop is autovectorized.
clippy's manual_clamp lint in the beta toolchain warns on this pattern to use clamp instead which can regress performance.
Minimal example
#[inline(never)]
pub fn manual_clamp(input: &[i32], output: &mut [u8]) {
    for (&i, o) in input.iter().zip(output.iter_mut()) {
        *o = i.max(0).min(255) as u8;
    }
}
#[inline(never)]
pub fn clamp(input: &[i32], output: &mut [u8]) {
    for (&i, o) in input.iter().zip(output.iter_mut()) {
        *o = i.clamp(0, 255) as u8;
    }
}https://rust.godbolt.org/z/zf73jsqjq
Manual clamp
.LBB0_4:
        movdqu  xmm0, xmmword ptr [rdi + 4*r8]
        packssdw        xmm0, xmm0
        packuswb        xmm0, xmm0
        movdqu  xmm1, xmmword ptr [rdi + 4*r8 + 16]
        packssdw        xmm1, xmm1
        packuswb        xmm1, xmm1
        movd    dword ptr [rdx + r8], xmm0
        movd    dword ptr [rdx + r8 + 4], xmm1
        add     r8, 8
        cmp     rsi, r8
        jne     .LBB0_4`Ord::clamp`
.LBB0_4:
        movdqu  xmm6, xmmword ptr [rdi + 4*r8]
        movdqu  xmm5, xmmword ptr [rdi + 4*r8 + 16]
        pxor    xmm3, xmm3
        pcmpgtd xmm3, xmm6
        packssdw        xmm3, xmm3
        packsswb        xmm3, xmm3
        pxor    xmm4, xmm4
        pcmpgtd xmm4, xmm5
        packssdw        xmm4, xmm4
        packsswb        xmm4, xmm4
        movdqa  xmm7, xmm6
        pxor    xmm7, xmm0
        movdqa  xmm8, xmm1
        pcmpgtd xmm8, xmm7
        pand    xmm6, xmm8
        pandn   xmm8, xmm2
        por     xmm8, xmm6
        packuswb        xmm8, xmm8
        packuswb        xmm8, xmm8
        pandn   xmm3, xmm8
        movdqa  xmm6, xmm5
        pxor    xmm6, xmm0
        movdqa  xmm7, xmm1
        pcmpgtd xmm7, xmm6
        pand    xmm5, xmm7
        pandn   xmm7, xmm2
        por     xmm7, xmm5
        packuswb        xmm7, xmm7
        packuswb        xmm7, xmm7
        pandn   xmm4, xmm7
        movd    dword ptr [rdx + r8], xmm3
        movd    dword ptr [rdx + r8 + 4], xmm4
        add     r8, 8
        cmp     rsi, r8
        jne     .LBB0_4Real code examples from functions in the image-webp crate
https://rust.godbolt.org/z/3rnY8d94v
https://rust.godbolt.org/z/53T7n9PGx
RunDevelopment
Metadata
Metadata
Assignees
Labels
A-autovectorizationArea: Autovectorization, which can impact perf or code sizeArea: Autovectorization, which can impact perf or code sizeA-codegenArea: Code generationArea: Code generationC-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchCategory: An issue highlighting optimization opportunities or PRs implementing suchS-has-mcveStatus: A Minimal Complete and Verifiable Example has been found for this issueStatus: A Minimal Complete and Verifiable Example has been found for this issueT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.