Optimize gf N-vect dot product SVE functions #367

AWSjswinney · 2025-10-14T21:44:00Z

aarch64/erasure_code: SVE intrinsics implementation

Replace hand-written SVE assembly implementations of n-vector dot
product functions with optimized SVE intrinsics that provide better
performance through 4x loop unrolling.

Key improvements over the original SVE assembly:

4x unrolled loops processing 64 bytes per iteration (vs single vector)
Unified implementation supports 1-7 vector operations by using the
compiler to generate each version.
The compiler also generates SVE2 versions of the same functions which
make use of the EOR3 instruction.

The implementation maintains the existing nibble-based Galois Field
multiplication with 32-byte lookup tables while adding significant
performance optimizations.

reverts: aedcd37
This change also reverts the above commit which configured systems with
SVE width of 128 bits to use the path use NEON instead. NEON was faster
since it had more unrolling, but now the SVE also has the same level of
unrolling and the availablily of SVE2 makes that path faster still for
systems which support it.

aarch64: Optimize SVE encode functions to use peak-performance vector combinations

Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal
4 and 5 vector combinations based on benchmark results showing these
achieve the highest performance.

Key optimizations:

Loop over 4-vector operations when rows > 7 (peak performance)
Use 4+3 combination for 7 vectors instead of single 7-vector call
Use 4+2 combination for 6 vectors instead of single 6-vector call
Keep 5-vector for 5 vectors (second-best performance)
Applies to both SVE and SVE2 variants for consistent optimization

This leverages the benchmark findings that 4 and 5 vector operations
achieve 40+ GB/s performance, significantly better than 6-7 vector
operations which drop to 30-36 GB/s.

AWSjswinney · 2025-10-14T21:51:32Z

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

Replace hand-written SVE assembly implementations of n-vector dot product functions with optimized SVE intrinsics that provide better performance through 4x loop unrolling. Key improvements over the original SVE assembly: - 4x unrolled loops processing 64 bytes per iteration (vs single vector) - Unified implementation supports 1-7 vector operations by using the compiler to generate each version. - The compiler also generates SVE2 versions of the same functions which make use of the EOR3 instruction. The implementation maintains the existing nibble-based Galois Field multiplication with 32-byte lookup tables while adding significant performance optimizations. reverts: aedcd37 This change also reverts the above commit which configured systems with SVE width of 128 bits to use the path use NEON instead. NEON was faster since it had more unrolling, but now the SVE also has the same level of unrolling and the availablily of SVE2 makes that path faster still for systems which support it. Signed-off-by: Jonathan Swinney <[email protected]>

… combinations Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal 4 and 5 vector combinations based on benchmark results showing these achieve the highest performance. Key optimizations: - Loop over 4-vector operations when rows > 7 (peak performance) - Use 4+3 combination for 7 vectors instead of single 7-vector call - Use 4+2 combination for 6 vectors instead of single 6-vector call - Keep 5-vector for 5 vectors (second-best performance) - Applies to both SVE and SVE2 variants for consistent optimization This leverages the benchmark findings that 4 and 5 vector operations achieve 40+ GB/s performance, significantly better than 6-7 vector operations which drop to 30-36 GB/s. Signed-off-by: Jonathan Swinney <[email protected]>

AWSjswinney · 2025-10-16T18:52:50Z

I haven't been able to reproduce the failure on macOS, but I pushed a change which I'm hoping will do the trick.

pablodelara · 2025-10-17T08:25:55Z

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

So #349 can be closed? Thanks

AWSjswinney · 2025-10-17T13:43:45Z

So #349 can be closed? Thanks

Yes. I just closed it!

pablodelara · 2025-10-20T08:51:54Z

@liuqinfei could you review this PR?

liuqinfei · 2025-10-21T14:18:59Z

@liuqinfei could you review this PR?

of course.

AWSjswinney mentioned this pull request Oct 14, 2025

Optimize instruction scheduling in gf_*vect_dot_prod_neon #349

Closed

AWSjswinney force-pushed the jswinney/2025-10-14-sve2-pr branch from cdde062 to 2859288 Compare October 14, 2025 22:10

AWSjswinney added 2 commits October 16, 2025 13:51

AWSjswinney force-pushed the jswinney/2025-10-14-sve2-pr branch from 2859288 to b01e834 Compare October 16, 2025 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize gf N-vect dot product SVE functions #367

Optimize gf N-vect dot product SVE functions #367

AWSjswinney commented Oct 14, 2025

Uh oh!

AWSjswinney commented Oct 14, 2025

Uh oh!

AWSjswinney commented Oct 16, 2025

Uh oh!

pablodelara commented Oct 17, 2025

Uh oh!

AWSjswinney commented Oct 17, 2025

Uh oh!

pablodelara commented Oct 20, 2025

Uh oh!

liuqinfei commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize gf N-vect dot product SVE functions #367

Are you sure you want to change the base?

Optimize gf N-vect dot product SVE functions #367

Conversation

AWSjswinney commented Oct 14, 2025

Uh oh!

AWSjswinney commented Oct 14, 2025

Uh oh!

AWSjswinney commented Oct 16, 2025

Uh oh!

pablodelara commented Oct 17, 2025

Uh oh!

AWSjswinney commented Oct 17, 2025

Uh oh!

pablodelara commented Oct 20, 2025

Uh oh!

liuqinfei commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants