-
Notifications
You must be signed in to change notification settings - Fork 317
Description
Hello there !
I have been trying to follow your instructions to get KleidiAI int4 kernels working on a Scaleway ARM instance (4x16), but I'm still encountering issues.
I've done the following:
- Built and installed KleidiAI (the library is installed at
/usr/local/lib/libkleidiai.a
) - Built torchao with the flags you mentioned:
USE_CPP=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 pip install .
However, when I try to run code that uses the KleidiAI kernels, I get this error:
AttributeError: '_OpNamespace' 'torchao' object has no attribute '_pack_8bit_act_4bit_weight'
Exception: TorchAO experimental kernels are not loaded. To install the kernels, run `USE_CPP=1 pip install .` from ao on a machine with an ARM CPU. You can also set target to 'aten' if you are using ARM CPU.
My CPU definitely has the required ARM features (verified with /proc/cpuinfo
):
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Including: asimd (NEON), asimddp (Dot Product), etc.
I'm particularly interested in the optimizations mentioned in the recently merged PR #2000 , which added the new KleidiAI kernels for ARM NEON dotprod.
When I run the KleidiAI benchmark, I can see:
kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod/m:64/n:64/k:64/bl:32 SKIPPED: 'GEMV optimized for m=1 only'
kai_matmul_clamp_f32_qai8dxp4x4_qsi4c32p8x4_4x8_neon_dotprod/m:64/n:64/k:64/bl:32 6305 ns 6302 ns 111140
So it seems like the KleidiAI kernels themselves are working, but for some reason the _pack_8bit_act_4bit_weight
operator isn't being registered properly in torchao.
Is there something specific I need to do to get the _pack_8bit_act_4bit_weight operator registered? Are there any diagnostic steps I can take to debug this further?