-
Notifications
You must be signed in to change notification settings - Fork 326
Optimize gf N-vect dot product SVE functions #367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is intended to replace #349. Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.) The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%. The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it. |
cdde062
to
2859288
Compare
Replace hand-written SVE assembly implementations of n-vector dot product functions with optimized SVE intrinsics that provide better performance through 4x loop unrolling. Key improvements over the original SVE assembly: - 4x unrolled loops processing 64 bytes per iteration (vs single vector) - Unified implementation supports 1-7 vector operations by using the compiler to generate each version. - The compiler also generates SVE2 versions of the same functions which make use of the EOR3 instruction. The implementation maintains the existing nibble-based Galois Field multiplication with 32-byte lookup tables while adding significant performance optimizations. reverts: aedcd37 This change also reverts the above commit which configured systems with SVE width of 128 bits to use the path use NEON instead. NEON was faster since it had more unrolling, but now the SVE also has the same level of unrolling and the availablily of SVE2 makes that path faster still for systems which support it. Signed-off-by: Jonathan Swinney <[email protected]>
… combinations Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal 4 and 5 vector combinations based on benchmark results showing these achieve the highest performance. Key optimizations: - Loop over 4-vector operations when rows > 7 (peak performance) - Use 4+3 combination for 7 vectors instead of single 7-vector call - Use 4+2 combination for 6 vectors instead of single 6-vector call - Keep 5-vector for 5 vectors (second-best performance) - Applies to both SVE and SVE2 variants for consistent optimization This leverages the benchmark findings that 4 and 5 vector operations achieve 40+ GB/s performance, significantly better than 6-7 vector operations which drop to 30-36 GB/s. Signed-off-by: Jonathan Swinney <[email protected]>
2859288
to
b01e834
Compare
I haven't been able to reproduce the failure on macOS, but I pushed a change which I'm hoping will do the trick. |
So #349 can be closed? Thanks |
Yes. I just closed it! |
@liuqinfei could you review this PR? |
of course. |
aarch64/erasure_code: SVE intrinsics implementation
Replace hand-written SVE assembly implementations of n-vector dot
product functions with optimized SVE intrinsics that provide better
performance through 4x loop unrolling.
Key improvements over the original SVE assembly:
compiler to generate each version.
make use of the EOR3 instruction.
The implementation maintains the existing nibble-based Galois Field
multiplication with 32-byte lookup tables while adding significant
performance optimizations.
reverts: aedcd37
This change also reverts the above commit which configured systems with
SVE width of 128 bits to use the path use NEON instead. NEON was faster
since it had more unrolling, but now the SVE also has the same level of
unrolling and the availablily of SVE2 makes that path faster still for
systems which support it.
aarch64: Optimize SVE encode functions to use peak-performance vector combinations
Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal
4 and 5 vector combinations based on benchmark results showing these
achieve the highest performance.
Key optimizations:
This leverages the benchmark findings that 4 and 5 vector operations
achieve 40+ GB/s performance, significantly better than 6-7 vector
operations which drop to 30-36 GB/s.