Skip to content

Conversation

AWSjswinney
Copy link
Contributor

aarch64/erasure_code: SVE intrinsics implementation

Replace hand-written SVE assembly implementations of n-vector dot
product functions with optimized SVE intrinsics that provide better
performance through 4x loop unrolling.

Key improvements over the original SVE assembly:

  • 4x unrolled loops processing 64 bytes per iteration (vs single vector)
  • Unified implementation supports 1-7 vector operations by using the
    compiler to generate each version.
  • The compiler also generates SVE2 versions of the same functions which
    make use of the EOR3 instruction.

The implementation maintains the existing nibble-based Galois Field
multiplication with 32-byte lookup tables while adding significant
performance optimizations.

reverts: aedcd37
This change also reverts the above commit which configured systems with
SVE width of 128 bits to use the path use NEON instead. NEON was faster
since it had more unrolling, but now the SVE also has the same level of
unrolling and the availablily of SVE2 makes that path faster still for
systems which support it.

aarch64: Optimize SVE encode functions to use peak-performance vector combinations

Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal
4 and 5 vector combinations based on benchmark results showing these
achieve the highest performance.

Key optimizations:

  • Loop over 4-vector operations when rows > 7 (peak performance)
  • Use 4+3 combination for 7 vectors instead of single 7-vector call
  • Use 4+2 combination for 6 vectors instead of single 6-vector call
  • Keep 5-vector for 5 vectors (second-best performance)
  • Applies to both SVE and SVE2 variants for consistent optimization

This leverages the benchmark findings that 4 and 5 vector operations
achieve 40+ GB/s performance, significantly better than 6-7 vector
operations which drop to 30-36 GB/s.

@AWSjswinney
Copy link
Contributor Author

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

Replace hand-written SVE assembly implementations of n-vector dot
product functions with optimized SVE intrinsics that provide better
performance through 4x loop unrolling.

Key improvements over the original SVE assembly:
- 4x unrolled loops processing 64 bytes per iteration (vs single vector)
- Unified implementation supports 1-7 vector operations by using the
  compiler to generate each version.
- The compiler also generates SVE2 versions of the same functions which
  make use of the EOR3 instruction.

The implementation maintains the existing nibble-based Galois Field
multiplication with 32-byte lookup tables while adding significant
performance optimizations.

reverts: aedcd37
This change also reverts the above commit which configured systems with
SVE width of 128 bits to use the path use NEON instead. NEON was faster
since it had more unrolling, but now the SVE also has the same level of
unrolling and the availablily of SVE2 makes that path faster still for
systems which support it.

Signed-off-by: Jonathan Swinney <[email protected]>
… combinations

Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal
4 and 5 vector combinations based on benchmark results showing these
achieve the highest performance.

Key optimizations:
- Loop over 4-vector operations when rows > 7 (peak performance)
- Use 4+3 combination for 7 vectors instead of single 7-vector call
- Use 4+2 combination for 6 vectors instead of single 6-vector call
- Keep 5-vector for 5 vectors (second-best performance)
- Applies to both SVE and SVE2 variants for consistent optimization

This leverages the benchmark findings that 4 and 5 vector operations
achieve 40+ GB/s performance, significantly better than 6-7 vector
operations which drop to 30-36 GB/s.

Signed-off-by: Jonathan Swinney <[email protected]>
@AWSjswinney AWSjswinney force-pushed the jswinney/2025-10-14-sve2-pr branch from 2859288 to b01e834 Compare October 16, 2025 18:51
@AWSjswinney
Copy link
Contributor Author

I haven't been able to reproduce the failure on macOS, but I pushed a change which I'm hoping will do the trick.

@pablodelara
Copy link
Contributor

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

So #349 can be closed? Thanks

@AWSjswinney
Copy link
Contributor Author

So #349 can be closed? Thanks

Yes. I just closed it!

@pablodelara
Copy link
Contributor

@liuqinfei could you review this PR?

@liuqinfei
Copy link
Contributor

@liuqinfei could you review this PR?

of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants