-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Closed
Description
Description
The current implementation for arm64 in the .NET runtime isn’t optimized. Since arm64 lacks a direct intrinsic equivalent to _mm_movemask_epi8, an emulation is used, which negatively impacts performance:
runtime/src/native/containers/dn-simdhash-arch.h
Lines 93 to 124 in 367cf39
| // returns an index in range 0-13 on match, 14-32 if no match | |
| static DN_FORCEINLINE(uint32_t) | |
| find_first_matching_suffix_simd ( | |
| dn_simdhash_search_vector needle, | |
| // Only used by the vectorized implementations; discarded by scalar. | |
| dn_simdhash_suffixes haystack | |
| ) { | |
| #if defined(__wasm_simd128__) | |
| return ctz(wasm_i8x16_bitmask(wasm_i8x16_eq(needle.vec, haystack.vec))); | |
| #elif defined(_M_AMD64) || defined(_M_X64) || (_M_IX86_FP == 2) || defined(__SSE2__) | |
| return ctz(_mm_movemask_epi8(_mm_cmpeq_epi8(needle.m128, haystack.m128))); | |
| #elif defined(__ARM_NEON) | |
| dn_simdhash_suffixes match_vector; | |
| // Completely untested. | |
| static const dn_simdhash_suffixes byte_mask = { | |
| { 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 } | |
| }; | |
| union { | |
| uint8_t b[4]; | |
| uint32_t u; | |
| } msb; | |
| match_vector.vec = vceqq_u8(needle.vec, haystack.vec); | |
| dn_simdhash_suffixes masked; | |
| masked.vec = vandq_u8(match_vector.vec, byte_mask.vec); | |
| msb.b[0] = vaddv_u8(vget_low_u8(masked.vec)); | |
| msb.b[1] = vaddv_u8(vget_high_u8(masked.vec)); | |
| return ctz(msb.u); | |
| #else | |
| dn_simdhash_assert(!"Scalar fallback should be in use here"); | |
| return 32; | |
| #endif | |
| } |
This optimization can improve AOT compilation (build time) on macOS-arm64 host of MAUI template app in debug config by ~80%:
- SIMD emulation implementation:
- AOT compilation of the dedup assembly: 247,423 ms
- Isolated lookup for 1,000,000 iterations: 172 ms
- Software lookup implementation (
g_hash_table_lookup):- AOT compilation the dedup assembly: 47,692 ms
- Isolated lookup for 1,000,000 iterations: 66 ms
Alternative implementations
- https://github.com/f4exb/cm256cc/blob/master/sse2neon.h
- https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon
Tasks
mahara