Some questions about general compute on NPU #44

zlaazlaa · 2025-06-09T08:41:55Z

zlaazlaa
Jun 9, 2025

Hi, I'm currently exploring general-purpose computation on the NPU, such as matrix multiplication. I tried generating a minimal ONNX model containing only a MatMul operation, then converting it into a QNN model to run on a Snapdragon 8 Elite. However, the computed results are too imprecise to be acceptable. Additionally, using this model-based approach for such basic operations feels counterintuitive and inefficient.

I'm looking for a way to perform matrix multiplication directly using the QNN APIs or just Hexagon SDK, bypassing the ONNX model path. Unfortunately, the QNN documentation only describes how to run complete models and doesn’t seem to support standalone function-level invocations. Many search results on this topic pointed me to your repository, where I saw that you’ve done extensive work on Snapdragon NPU—very impressive and inspiring.

So far, I’ve found the official QNN documentation for the MatMul op here: QNN MatMul op definition. But from what I understand, using this op still requires building a model, and doesn’t allow direct usage in a standalone C/C++ program.

Also, as mentioned in another discussion, it seems Qualcomm doesn’t publicly provide HMX, but HVX can potentially be used to implement matrix multiplication indirectly. I noticed that your repo includes such an implementation: ggml-hexagon/mulmat.c. This appears to be built on the Hexagon SDK, not QNN SDK (just in my understand), and seems to implement matrix multiplication at the DSP layer.

However, I didn't see any .idl files in your codebase, which is used in the Hexagon SDK example calculator_c++_apk, though that example itself is a bit unclear to me—it doesn’t seem to involve HVX or explain how to utilize HVX extensions. The official documentation also lacks clarity on this topic.

Lastly, regarding your code in ggml/src/ggml-hexagon: when running it, is it still necessary to link against libraries like libQnnHtp.so, libQnnHtpPrepare.so, or libQnnHtpV79Stub.so? In other words, if compiled independently, is it possible to expose NPU-based matrix compute as a standalone C++ library—just like a typical external computation library?

I’d greatly appreciate any advice you could share on how to directly invoke the computing power of the NPU.

jeffzhou2000 · 2025-06-09T09:35:56Z

jeffzhou2000
Jun 9, 2025
Maintainer

thanks for your interest in this project.

So far, I’ve found the official QNN documentation for the MatMul op here: QNN MatMul op definition. But from what I understand, using this op still requires building a model, and doesn’t allow direct usage in a standalone C/C++ program.

pls refer to ggmlqnn_compute_mul_mat in this project: https://github.com/zhouwg/ggml-hexagon/blob/self-build/ggml/src/ggml-hexagon/ggml-hexagon.cpp#L4437-L4639

you will thoroughly understand how to using C API provided in QNN SDK after you fully understand this function. one more thing, you will understand the core principle of QNN SDK, especially the key-point principle of Qualcomm's highly-designed/very complicated AI-Hub tech stack.

Also, as mentioned in another discussion, it seems Qualcomm doesn’t publicly provide HMX, but HVX can potentially be used to implement matrix multiplication indirectly. I noticed that your repo includes such an implementation: ggml-hexagon/mulmat.c. This appears to be built on the Hexagon SDK, not QNN SDK (just in my understand), and seems to implement matrix multiplication at the DSP layer.

pls refer to high-level data path of ggml-hexagon #33, then you will understand what I did in this project(provide two different approaches in this project: HWACCEL_QNN(offload ggmlop to QNN SDK), HWACCEL_CDSP(offload ggmlop to Hexagon cDSP(aka NPU) directly)
Qualcomm didn't provide public tech docs and instructions of HMX(Hexagon Matrix eXtension) technology currently.
I personally think HMX enabled matrix mulmat is much faster than HVX enabled matrix mulmat.
the ggml-hexagon/mulmat.c is a reference implementation via HWACCEL_CDSP and the performance of that reference implementation is really bad, the fully implementation of mulmat via HWACCEL_CDSP is not open-sourced at the moment. pls refer to [WIP]big matrix mulmat on Hexagon cDSP #32

However, I didn't see any .idl files in your codebase, which is used in the Hexagon SDK example calculator_c++_apk, though that example itself is a bit unclear to me—it doesn’t seem to involve HVX or explain how to utilize HVX extensions. The official documentation also lacks clarity on this topic.

the stub.c on ARM-AP side and the skel.c on cDSP side are exactly produced by a specified .idl file and Qualcomm's dedicated binary tools.
the approach here(you didn't see any .idl in this project) is to simplify the workflow to avoid introduce complicated/redundant build script to this project, because that's not the key-point. one more thing, I had suggested that the top talent software engineers in Qualcomm's Hexagon SDK team can refer to some design principles in TEE tech stack and provide a flexible way of exchange necessary data between ARM-AP and cDSP to developers, because I personally don't think the idl mechanism in the existing Hexagon SDK is the fastest RPC approach.
pls refer to high-level data path of ggml-hexagon #33 again. then you will find that the approach here is similar to what Qualcomm did in their excellent highly-designed and highly-uniform QNN SDK.

Lastly, regarding your code in ggml/src/ggml-hexagon: when running it, is it still necessary to link against libraries like libQnnHtp.so, libQnnHtpPrepare.so, or libQnnHtpV79Stub.so? In other words, if compiled independently, is it possible to expose NPU-based matrix compute as a standalone C++ library—just like a typical external computation library?

for HWACCEL_QNN(offload ggmlop to QNN SDK), link against QNN runtime libs is not required because Qualcomm's QNN SDK is a highly-designed SDK(uniform C API on Linux/Android/Windows/QNX), the authors of QNN SDK provides dlopen/dlsym mechanism to avoid link against QNN runtime libs in user's QNN-based project.
for HWACCEL_CDSP(offload ggmlop to Hexagon cDSP directly), apparently all QNN runtime libs are not needed(that's one of the key-reasons why I suspended R&D activities of this project since 04/24/2025, although I re-launched the R&D activities of this project after 05/27/2025). as I said in my previous post, this is another unexpected big advantage in HWACCEL_CDSP approach.
YES, you can see this scenario in this post:https://github.com/zhouwg/ggml-hexagon/discussions/43, they are all prebuilt binary libs which running on cDSP/NPU directly(this is exactly similar to what Qualcomm did in their excellent QNN SDK, pls refer to high-level data path of ggml-hexagon #33 again).

I’d greatly appreciate any advice you could share on how to directly invoke the computing power of the NPU.

the following source code is a good and simple example to help you to understand how to directly invoke the computing power of the Hexagon NPU:
https://github.com/zhouwg/ggml-hexagon/blob/self-build/ggml/src/ggml-hexagon/kernels/add.c

3 replies

zlaazlaa Jun 9, 2025
Author

Thanks for your detailed responses and answers!

pls refer to ggmlqnn_compute_mul_mat in this project: https://github.com/zhouwg/ggml-hexagon/blob/self-build/ggml/src/ggml-hexagon/ggml-hexagon.cpp#L4437-L4639

The way you've integrated various QNN graph operations further confirms that QNN is fundamentally designed to serve higher-level model workflows.

pls refer to #33, then you will understand what I did in this project(provide two different approaches in this project: HWACCEL_QNN(offload ggmlop to QNN SDK), HWACCEL_CDSP(offload ggmlop to Hexagon cDSP(aka NPU) directly)

I also agree that, in the context of ggml and my specific use case, directly using the Hexagon SDK is more appropriate than relying on QNN. QNN seems more oriented toward higher-level model execution rather than low-level operations like these. Using the Hexagon SDK not only feels more intuitive but also appears to offer greater flexibility. Sure, if directly use QNN, we will get professional performance optimization from Qualcomm. But I'm very interested in your mention that HWACCEL_CDSP is even faster than HWACCEL_QNN.

the ggml-hexagon/mulmat.c is a reference implementation via HWACCEL_CDSP and the performance of that reference implementation is really bad, the fully implementation of mulmat via HWACCEL_CDSP is not open-sourced at the moment. pls refer to #32

As for HVX-accelerated matrix multiplication, that’s indeed a major topic. I’m genuinely interested in that area, but as a beginner to Hexagon, my immediate priority is just getting the code to run. Programming for cDSP is proving to be more challenging than I initially expected — XD.

By the way, when you implemented matrix multiplication on the DSP, did you run into any issues with numerical precision? In a prior experiment, I performed matrix multiplication indirectly using a quantized ONNX model running on the NPU, and the results were significantly off. The root cause turned out to be that QNN only supports quantized models on the NPU.
While this might not be relevant when using the Hexagon SDK directly—since it doesn't involve quantization—I still want to know about it in advance.

Lastly, your implementation of HWACCEL_CDSP is a great reference. I’ll definitely look into it further.

jeffzhou2000 Jun 9, 2025
Maintainer

Thanks for your detailed responses and answers!

pls refer to ggmlqnn_compute_mul_mat in this project: https://github.com/zhouwg/ggml-hexagon/blob/self-build/ggml/src/ggml-hexagon/ggml-hexagon.cpp#L4437-L4639

The way you've integrated various QNN graph operations further confirms that QNN is fundamentally designed to serve higher-level model workflows.

YES, the highly-designed QNN C API is one of the fundamental techs of Qualcomm's AI-Hub tech stack(from QNNSample to Genie in QNN SDK, to various binary tools in QNN SDK, to high-level APIs in AI-Hub tech stack).

One more thing,

I personally don't think Qualcomm will deprecate/abandon the highly-designed QNN SDK in the future although it's not easy to use. Qualcomm is a commercial IT giant, the logic of a commercial company is obviously not exactly the same as the logic of an open source project.
QNN solution is a highly-designed/highly-uniform commercial End-to-End solution, llama.cpp is a highly-designed/highly-uniform open-source on-device inference solution, they are both excellent and can be used in different scenarios.
Let's see whether Qualcomm will open their HMX tech docs and instructions in the future, I personally think this will be a significant signal of big change in NPU backend for llama.cpp.
I personally hope Qualcomm can release a new Hexagon SDK although they decide to not open their HMX tech in the future, because I found the latest QNN SDK(2.35.0.250530) can support hexagon-v81(I guess that's the 8Elite2 which might-be released on 09/2025 or 10/2025).

pls refer to #33, then you will understand what I did in this project(provide two different approaches in this project: HWACCEL_QNN(offload ggmlop to QNN SDK), HWACCEL_CDSP(offload ggmlop to Hexagon cDSP(aka NPU) directly)

I also agree that, in the context of ggml and my specific use case, directly using the Hexagon SDK is more appropriate than relying on QNN. QNN seems more oriented toward higher-level model execution rather than low-level operations like these. Using the Hexagon SDK not only feels more intuitive but also appears to offer greater flexibility. Sure, if directly use QNN, we will get professional performance optimization from Qualcomm. But I'm very interested in your mention that HWACCEL_CDSP is even faster than HWACCEL_QNN.

Why the HWACCEL_CDSP is faster than HWACCEL_QNN, because:

HWACCEL_CDSP(offload some performance-sensitive ggml op to Hexagon cDSP directly) is exactly similar to Intel's ggml-sycl and ggml-cuda and they both claimed 4x-10x performance gains than pure CPU inference.
tensor data must be transferred or exchanged between ARM-CPU and cDSP regardless of tech approaches. FastRPC must be utilized regardless of tech approach. the overhead of FastRPC should be exist in various tech approaches, I personally think the overhead through cDSP directly might-be minimum:

datapath through QNN:

user code(ggml-hexagon backend through QNN)  <------> QNN API <------> QNN SDK <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> Hexagon nn libs on cDSP

datapath through cDSP directly:

user code(ggml-hexagon backend through cDSP, similar to  TEE CA) <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> user code on cDSP(hexagon kernels, similar to TEE TA, or opencl kernels in ggml-opencl, or cuda kernels in other backends)

the ggml-hexagon/mulmat.c is a reference implementation via HWACCEL_CDSP and the performance of that reference implementation is really bad, the fully implementation of mulmat via HWACCEL_CDSP is not open-sourced at the moment. pls refer to #32

As for HVX-accelerated matrix multiplication, that’s indeed a major topic. I’m genuinely interested in that area, but as a beginner to Hexagon, my immediate priority is just getting the code to run. Programming for cDSP is proving to be more challenging than I initially expected — XD.

General programming on cDSP is not very hard(background knowledge is important and hard, e.g. AI related knowledge in this topic) and this project is a good example because there are no complicated/redundant encapsulation in this project, and there are many examples in Qualcomm Hexagon SDK. in the fact, the concepts of HVX programming are exactly similar to x86 MMX/SSE(I'm not familiar with AVX or other advanced SIMD programming) or ARM neon. of course, there are some tricks in how to use VTCM memory.

By the way, when you implemented matrix multiplication on the DSP, did you run into any issues with numerical precision? In a prior experiment, I performed matrix multiplication indirectly using a quantized ONNX model running on the NPU, and the results were significantly off. The root cause turned out to be that QNN only supports quantized models on the NPU. While this might not be relevant when using the Hexagon SDK directly—since it doesn't involve quantization—I still want to know about it in advance.

Currently the codes on cDSP side only support fp32 matrix mulmat(my second goal is that the performance of fp32 mulmat is faster than the default ggml backend, I personally think this is a basis progress of so-called ggml-hexagon backend for llama.cpp), so there are no such issues at the moment.

Lastly, your implementation of HWACCEL_CDSP is a great reference. I’ll definitely look into it further.

Thanks and best wishes for you.

zlaazlaa Jun 10, 2025
Author

Thanks for your detailed responses and answers!

pls refer to ggmlqnn_compute_mul_mat in this project: https://github.com/zhouwg/ggml-hexagon/blob/self-build/ggml/src/ggml-hexagon/ggml-hexagon.cpp#L4437-L4639

The way you've integrated various QNN graph operations further confirms that QNN is fundamentally designed to serve higher-level model workflows.

YES, the highly-designed QNN C API is one of the fundamental techs of Qualcomm's AI-Hub tech stack(from QNNSample to Genie in QNN SDK, to various binary tools in QNN SDK, to high-level APIs in AI-Hub tech stack).

One more thing,

I personally don't think Qualcomm will deprecate/abandon the highly-designed QNN SDK in the future although it's not easy to use. Qualcomm is a commercial IT giant, the logic of a commercial company is obviously not exactly the same as the logic of an open source project.

QNN solution is a highly-designed/highly-uniform commercial End-to-End solution, llama.cpp is a highly-designed/highly-uniform open-source on-device inference solution, they are both excellent and can be used in different scenarios.

Let's see whether Qualcomm will open their HMX tech docs and instructions in the future, I personally think this will be a significant signal of big change in NPU backend for llama.cpp.

I personally hope Qualcomm can release a new Hexagon SDK although they decide to not open their HMX tech in the future, because I found the latest QNN SDK(2.35.0.250530) can support hexagon-v81(I guess that's the 8Elite2 which might-be released on 09/2025 or 10/2025).

pls refer to #33, then you will understand what I did in this project(provide two different approaches in this project: HWACCEL_QNN(offload ggmlop to QNN SDK), HWACCEL_CDSP(offload ggmlop to Hexagon cDSP(aka NPU) directly)

I also agree that, in the context of ggml and my specific use case, directly using the Hexagon SDK is more appropriate than relying on QNN. QNN seems more oriented toward higher-level model execution rather than low-level operations like these. Using the Hexagon SDK not only feels more intuitive but also appears to offer greater flexibility. Sure, if directly use QNN, we will get professional performance optimization from Qualcomm. But I'm very interested in your mention that HWACCEL_CDSP is even faster than HWACCEL_QNN.

Why the HWACCEL_CDSP is faster than HWACCEL_QNN, because:

HWACCEL_CDSP(offload some performance-sensitive ggml op to Hexagon cDSP directly) is exactly similar to Intel's ggml-sycl and ggml-cuda and they both claimed 4x-10x performance gains than pure CPU inference.

tensor data must be transferred or exchanged between ARM-CPU and cDSP regardless of tech approaches. FastRPC must be utilized regardless of tech approach. the overhead of FastRPC should be exist in various tech approaches, I personally think the overhead through cDSP directly might-be minimum:

datapath through QNN:
user code(ggml-hexagon backend through QNN)  <------> QNN API <------> QNN SDK <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> Hexagon nn libs on cDSP
datapath through cDSP directly:
user code(ggml-hexagon backend through cDSP, similar to  TEE CA) <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> user code on cDSP(hexagon kernels, similar to TEE TA, or opencl kernels in ggml-opencl, or cuda kernels in other backends)
the ggml-hexagon/mulmat.c is a reference implementation via HWACCEL_CDSP and the performance of that reference implementation is really bad, the fully implementation of mulmat via HWACCEL_CDSP is not open-sourced at the moment. pls refer to #32

As for HVX-accelerated matrix multiplication, that’s indeed a major topic. I’m genuinely interested in that area, but as a beginner to Hexagon, my immediate priority is just getting the code to run. Programming for cDSP is proving to be more challenging than I initially expected — XD.

General programming on cDSP is not very hard(background knowledge is important and hard, e.g. AI related knowledge in this topic) and this project is a good example because there are no complicated/redundant encapsulation in this project, and there are many examples in Qualcomm Hexagon SDK. in the fact, the concepts of HVX programming are exactly similar to x86 MMX/SSE(I'm not familiar with AVX or other advanced SIMD programming) or ARM neon. of course, there are some tricks in how to use VTCM memory.

By the way, when you implemented matrix multiplication on the DSP, did you run into any issues with numerical precision? In a prior experiment, I performed matrix multiplication indirectly using a quantized ONNX model running on the NPU, and the results were significantly off. The root cause turned out to be that QNN only supports quantized models on the NPU. While this might not be relevant when using the Hexagon SDK directly—since it doesn't involve quantization—I still want to know about it in advance.

Currently the codes on cDSP side only support fp32 matrix mulmat(my second goal is that the performance of fp32 mulmat is faster than the default ggml backend, I personally think this is a basis progress of so-called ggml-hexagon backend for llama.cpp), so there are no such issues at the moment.

Lastly, your implementation of HWACCEL_CDSP is a great reference. I’ll definitely look into it further.

Thanks and best wishes for you.

Yes, I'm also looking forward to Qualcomm opening up HMX. Of course, I'm even more looking forward to Hexagon-v81 bringing even greater performance improvements - the recent generations of NPU have all seen very obvious enhancements

Thank you for providing the excellent Hexagon example, which made it easier for me to get started.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some questions about general compute on NPU #44

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Some questions about general compute on NPU #44

Uh oh!

zlaazlaa Jun 9, 2025

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

jeffzhou2000 Jun 9, 2025 Maintainer

Uh oh!

zlaazlaa Jun 9, 2025 Author

Uh oh!

Uh oh!

jeffzhou2000 Jun 9, 2025 Maintainer

Uh oh!

zlaazlaa Jun 10, 2025 Author

zlaazlaa
Jun 9, 2025

Replies: 1 comment 3 replies

jeffzhou2000
Jun 9, 2025
Maintainer

zlaazlaa Jun 9, 2025
Author

jeffzhou2000 Jun 9, 2025
Maintainer

zlaazlaa Jun 10, 2025
Author