Why HWACCEL_CDSP in ggml-hexagon is a correct/reference implementation rather than a product-level implementation at the moment #28
jeffzhou2000
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
this tech article is highly related to my third formal PR in upstream llama.cpp project: ggml-org#12326. this tech article is intended to help developers and AI experts understand why offload ggml op to Hexagon cDSP directly is correct tech direction in llama.cpp community.
pls support me and promote that PR can be approved in the master branch of upstream llama.cpp community if you agree with the opinion in this tech article, thanks so much!
firstly
there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:
secondly
I personally think "mapping the entire computational graph to a single QNN graph" might-be or should-be or just-be Qualcomm's key-point/secret-sauce of "utilize the Hexagon NPU maximally through QNN SDK" in Qualcomm's all AI software stacks after doing some research on Qualcomm's QNN SDK and other Qualcomm's AI software stacks:
Fig-1: the key-point why performance on Qualcomm's official AI hub or official QNN solution is good enough
everyone should ask an interesting question: why performance on Qualcomm's official AI hub or official QNN solution is good enough:
QNN SDK's internal will indirectly calling Qualcomm's Hexagon nn libs on cDSP which might-be/should-be highly-optimized with HVX SIMD instructions + HVX multithreading.
QNN's internal did some things(I guess that's the Qualcomm AI institution did: quantized tech in QNN solution) with the specified single QNN graph, that's so-called "comprehensive graph optimization".
as well known, fixed-point arithmetic is faster than floating-point arithmetic on DSP.
AI experts can clearly see that Qualcomm's official QNN solution is an End-to-End solution(this is the key-reason why Qualcomm's senior staff tech expert max-krasnyansky said on 03/18/2025: QNN is not the right solution here).
thirdly
the following are some reasons why HWACCEL_CDSP approach is correct direction in llama.cpp community:
this approach(offload some performance-sensitive ggml op to Hexagon cDSP directly) is exactly similar to Intel's ggml-sycl and Qualcomm's ggml-opencl and they both claimed 4x-10x performance gains than pure CPU inference.
tensor data must be transferred or exchanged between ARM-CPU and cDSP regardless tech approaches, this is an undoubtedly fact.
FastRPC must be utilized regardless tech approaches, this is another undoubtedly fact. at the same time, we can clearly see that the so-called FastRPC mechanism or framework in QNN SDK or Hexagon SDK is exactly similar to mechanism in TEE.
the overhead of FastRPC should be exist in various tech approaches, I personally think the overhead through cDSP directly might-be minimum:
datapath through QNN:
datapath through cDSP directly:
I think this is another key-reason why @max-krasnyansky said on 03/18/2025: QNN is not the right solution here.
we(llama.cpp community) can't re-create an entire Qualcomm's dedicated AI stack in ggml/llama.cpp(mapping the all ggml ops and entire ggml cgraph to a single QNN graph is also not easy to Qualcomm's world-class engineering team, I'll prefer to do some adaptation efforts with Intel's sycl stack(aka HWACCEL_SYCL approach) if I'm a regular employee of Qualcomm's AI team, that's a more practical and desire direction. accordingly, offload some performance-sensitive ggml ops to cDSP directly is a practical way/direction: we only need to focus on hexagon kernels through various highly-designed algorithms or HVX instructions + HVX multithreading on cDSP.in the fact, the NPU performance through QNN is really bad here(ggml / llama.cpp), because we can't utilize the dedicated binary tools which provided by Qualcomm here(in ggml/llama.cpp).
we(llama.cpp community) cannot re-create an entire Qualcomm's dedicated AI stack in ggml/llama.cpp, I think HWACCEL_QNN_SINGLEGRAPH approach is also not practical direction for Qualcomm because Qualcomm already have perfect QNN solution for their customers(without llama.cpp) and this direction or efforts on this direction are probably an endless(burning- money) big project. I strongly agree the GGML way:try crazy ideas, build wild demos, and push the edge of what’s possible. but I also think we should have a timeline because such these things happened in the US and China many times.
AI experts can also offload all ggml ops or a single entire ggml cgraph to cDSP directly without overhead in compose a single QNN graph from a entire ggml cgraph on AP(CPU) side.
everyone should heard story of DeepSeek R1 on 01/2025: one of highlights they did is their excellent engineering team use lowlevel NVIDIA API directly rather than NVIDIA's highlevel CUDA API.
there is a big advantage in cDSP solution: there are many limitations in QNN API: for example, some matrix mulmat cannot offload to QNN. these QNN API limitations can be completely avoid in cDSP solution(offload ggml op to cDSP directly): because we can do anything on cDSP directly through Qualcomm's lightweight/lowlevel Hexagon SDK. this big advantage has verified with test-backend-ops in my local dev envs.
there is another unexpected big advantage in cDSP solution: as well known, Qualcomm Hexagon SDK is a lightweight low-level and thin SDK, domain tech experts and AI experts can operate cDSP hardware directly with Hexagon SDK, so there is no QNN version/runtime libs conflict issue in HWACCEL_CDSP approach. this big advantage has verified with test-backend-ops and llama-cli on a Snapdragon 8Elite(aka 8gen4) equipped Android phone. this advantage will very helpful to deploy llama.cpp+ggml-hexagon on-device AI solution on Qualcomm's mobile/desktop SoC.
[updated on 04/12/2025] the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone sometimes.
[updated on 05/09/2025(correct date should be 04/24/2025] mulmat with 8 thread + HVX SIMD + other optimizatation works fine on cDSP side but the performance is slower than the default ggml backend.
Hexagon NPU Performance
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.
case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference
LLM inference through HWACCEL_CDSP(offload GGML_OP_ADD to cDSP directly)
LLM inference through HWACCEL_QNN(offload GGML_OP_ADD to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)
we can/will clearly see(from adb logcat | grep ggml-hexagon) that the NPU performance in real LLM inference is really good and faster then QNN solution when disable cDSP rpc ion memory pool.
case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP
mulmat through HWACCEL_CDSP(offload mulmat to cDSP directly):
mulmat through HWACCEL_QNN(offload mulmat to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)
we can/will clearly see(from adb logcat | grep ggml-hexagon) that the performance difference of mulmat between HWACCEL_QNN and HWACCEL_CDSP and the NPU performance is really good and much faster then QNN solution when disable cDSP rpc ion memory pool.
[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.
I clearly understand that Qualcomm's QNN SDK team is a world-class engineering team(they provides a highly-designed and highly-uniform AI-SDK on Windows/Linux/Android. the high-level QNN SDK and the low-level Hexagon SDK are both provided by Qualcomm. I personally hope Qualcomm's Hexagon SDK team can release a new version and pls refine the FastRPC framework and remove the qidl accordingly: pls the top talent software engineers in Qualcomm's Hexagon SDK team can refer to some design principles in TEE and provide a flexible way of exchange necessary data between ARM-AP and cDSP to developers, the new version of Hexagon SDK will be helpful for NPU performance through HWACCEL_CDSP approach, of course, I think the refined FastRPC framework might-be also helpful for QNN SDK.
[updated on 05/01/2025,20:49] Project KanTV is a very good example to illustrate why HWACCEL_CDSP approach is a good direction in llama.cpp community: Qualcomm Hexagon SDK is a lightweight low-level and thin SDK, so there is no QNN version/runtime libs conflict issue in HWACCEL_CDSP approach. this big advantage has verified with Project KanTV on a Snapdragon 8Elite based Android phone. this advantage will very helpful to deploy llama.cpp+ggml-hexagon on-device AI solution on Qualcomm's mobile/desktop SoC.
How to reproduce above result
a computation visualization approach was provided in that PR to help other developers and AI experts to reproduce above results easily.
modify enable_profiler = 0 to enable_profiler = 1 in scripts/ggml-hexagon.cfg
generate profiler data with HWACCEL_CDSP approach:
modify hwaccel_approach to 0 in ./scripts/ggml-hexagon.cfg and then run
why HWACCEL_CDSP in ggml-hexagon is a correct / reference implementation rather than a product-level implementation at the moment
https://discuss.tvm.apache.org/t/hmx-support-on-htp-using-tvm-runtime/17457
https://github.com/apache/tvm
PowerInfer-2: Fast Large Language Model Inference on a Smartphone: https://arxiv.org/abs/2406.06282, https://github.com/SJTU-IPADS/PowerInfer
LLM prefilling with mllm-NPU: https://arxiv.org/abs/2407.05858v1, https://github.com/UbiquitousLearning/mllm
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge:https://arxiv.org/abs/2407.00088v1, https://github.com/microsoft/T-MAC
https://docs.qualcomm.com/bundle/publicresource/topics/80-70014-15/architecture.html
https://github.com/quic/aimet
How to get qualcomm_qnn PaddlePaddle/Paddle-Lite#10539
we can clearly see that PowerInfer-2 from SJTU, mllm-NPU from BUPT, T-MAC from MSRA(Microsoft Research Asia) are both End-to-End edge-inference solution base-on llama.cpp. one more thing, The Qualcomm QNN solution or AI-Hub solution is also a typical End-to-End and state-of-the-art(because Qualcomm created Hexagon NPU and they know everything about Hexagon NPU)inference solution. these 4 solutions are all outstanding solutions and all not suitable for llama.cpp at the moment.
I can't say more in this section. domain experts or tech experts from Qualcomm should understand what I want to say in this section.
conclusion
in the all, I have highly optimistic and confidence with "QNN is not the right solution here"(which is a senior staff tech expert from Qualcomm headquarter told me on 03/18/2025):
Beta Was this translation helpful? Give feedback.
All reactions