Hello, would you consider adopting AMX (Advanced Matrix Extensions) to accelerate inner product computations? It is estimated that for FP32 precision, AMX could offer a 10% performance improvement over AVX-512. If support for BF16 (Brain Floating-Point 16) data is considered, the performance boost could be 1.8 times greater.
If yes, I will prepare a pull request for this optimization. Thanks very much