Why HWACCEL_CDSP in ggml-hexagon is a correct/reference implementation rather than a product-level implementation at the moment #28

jeffzhou2000 · 2025-03-31T23:14:05Z

jeffzhou2000
Mar 31, 2025
Maintainer

this tech article is highly related to my third formal PR in upstream llama.cpp project: ggml-org#12326. this tech article is intended to help developers and AI experts understand why offload ggml op to Hexagon cDSP directly is correct tech direction in llama.cpp community.

pls support me and promote that PR can be approved in the master branch of upstream llama.cpp community if you agree with the opinion in this tech article, thanks so much!

firstly

there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:

general approach through Qualcomm QNN SDK:offload ggml op to QNN (then QNN's internal will transfer to Hexagon cDSP)
general approach through Qualcomm Hexagon SDK:offload ggml op to Hexagon cDSP directly, which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl.
special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024. the tech details of "the special approach through QNN" or "how to compose an ideal single QNN graph from a ggml cgraph" can be found at my forked llama.cpp project:how to compose a single QNN graph from a complete ggml cgraph #24.

enum hwaccel_approach_type {
HWACCEL_QNN =0, (C API, before 03/11/2025, not easy because QNN SDK is a black-box or heavy SDK and many many tricks in the QNN SDK)
HWACCEL_QNN_SINGLEGRAPH=1,(C API, before 03/18/2025, very hard because the mechanism is a black black-box and workload is massive)
HWACCEL_CDSP=2,(C and assemble API, after 03/24/2025, hard but we can do anything on cDSP directly, because Hexagon SDK is a very lightweight/thin SDK and we can operate hardware directly through Hexagon SDK)
HWACCEL_SYCL=3,(this is my personal proposal or assumption, general and modern C++ API, N/A at the moment because essential adaption layer should be provided by Qualcomm)
};

secondly

I personally think "mapping the entire computational graph to a single QNN graph" might-be or should-be or just-be Qualcomm's key-point/secret-sauce of "utilize the Hexagon NPU maximally through QNN SDK" in Qualcomm's all AI software stacks after doing some research on Qualcomm's QNN SDK and other Qualcomm's AI software stacks:

Fig-1: the key-point why performance on Qualcomm's official AI hub or official QNN solution is good enough

everyone should ask an interesting question: why performance on Qualcomm's official AI hub or official QNN solution is good enough:

QNN SDK's internal will indirectly calling Qualcomm's Hexagon nn libs on cDSP which might-be/should-be highly-optimized with HVX SIMD instructions + HVX multithreading.
QNN's internal did some things(I guess that's the Qualcomm AI institution did: quantized tech in QNN solution) with the specified single QNN graph, that's so-called "comprehensive graph optimization".
as well known, fixed-point arithmetic is faster than floating-point arithmetic on DSP.
AI experts can clearly see that Qualcomm's official QNN solution is an End-to-End solution(this is the key-reason why Qualcomm's senior staff tech expert max-krasnyansky said on 03/18/2025: QNN is not the right solution here).

thirdly

the following are some reasons why HWACCEL_CDSP approach is correct direction in llama.cpp community:

this approach(offload some performance-sensitive ggml op to Hexagon cDSP directly) is exactly similar to Intel's ggml-sycl and Qualcomm's ggml-opencl and they both claimed 4x-10x performance gains than pure CPU inference.
tensor data must be transferred or exchanged between ARM-CPU and cDSP regardless tech approaches, this is an undoubtedly fact.
FastRPC must be utilized regardless tech approaches, this is another undoubtedly fact. at the same time, we can clearly see that the so-called FastRPC mechanism or framework in QNN SDK or Hexagon SDK is exactly similar to mechanism in TEE.
the overhead of FastRPC should be exist in various tech approaches, I personally think the overhead through cDSP directly might-be minimum:

datapath through QNN:

user code(ggml-hexagon backend through QNN)  <------> QNN API <------> QNN SDK <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> Hexagon nn libs on cDSP

datapath through cDSP directly:

user code(ggml-hexagon backend through cDSP, similar to  TEE CA) <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> user code on cDSP(hexagon kernels, similar to TEE TA, or opencl kernels in ggml-opencl, or cuda kernels in other backends)

I think this is another key-reason why @max-krasnyansky said on 03/18/2025: QNN is not the right solution here.

we(llama.cpp community) can't re-create an entire Qualcomm's dedicated AI stack in ggml/llama.cpp(mapping the all ggml ops and entire ggml cgraph to a single QNN graph is also not easy to Qualcomm's world-class engineering team, I'll prefer to do some adaptation efforts with Intel's sycl stack(aka HWACCEL_SYCL approach) if I'm a regular employee of Qualcomm's AI team, that's a more practical and desire direction. accordingly, offload some performance-sensitive ggml ops to cDSP directly is a practical way/direction: we only need to focus on hexagon kernels through various highly-designed algorithms or HVX instructions + HVX multithreading on cDSP.in the fact, the NPU performance through QNN is really bad here(ggml / llama.cpp), because we can't utilize the dedicated binary tools which provided by Qualcomm here(in ggml/llama.cpp).
we(llama.cpp community) cannot re-create an entire Qualcomm's dedicated AI stack in ggml/llama.cpp, I think HWACCEL_QNN_SINGLEGRAPH approach is also not practical direction for Qualcomm because Qualcomm already have perfect QNN solution for their customers(without llama.cpp) and this direction or efforts on this direction are probably an endless(burning- money) big project. I strongly agree the GGML way:try crazy ideas, build wild demos, and push the edge of what’s possible. but I also think we should have a timeline because such these things happened in the US and China many times.
AI experts can also offload all ggml ops or a single entire ggml cgraph to cDSP directly without overhead in compose a single QNN graph from a entire ggml cgraph on AP(CPU) side.
everyone should heard story of DeepSeek R1 on 01/2025: one of highlights they did is their excellent engineering team use lowlevel NVIDIA API directly rather than NVIDIA's highlevel CUDA API.
there is a big advantage in cDSP solution: there are many limitations in QNN API: for example, some matrix mulmat cannot offload to QNN. these QNN API limitations can be completely avoid in cDSP solution(offload ggml op to cDSP directly): because we can do anything on cDSP directly through Qualcomm's lightweight/lowlevel Hexagon SDK. this big advantage has verified with test-backend-ops in my local dev envs.
there is another unexpected big advantage in cDSP solution: as well known, Qualcomm Hexagon SDK is a lightweight low-level and thin SDK, domain tech experts and AI experts can operate cDSP hardware directly with Hexagon SDK, so there is no QNN version/runtime libs conflict issue in HWACCEL_CDSP approach. this big advantage has verified with test-backend-ops and llama-cli on a Snapdragon 8Elite(aka 8gen4) equipped Android phone. this advantage will very helpful to deploy llama.cpp+ggml-hexagon on-device AI solution on Qualcomm's mobile/desktop SoC.
[updated on 04/12/2025] the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone sometimes.
[updated on 05/09/2025(correct date should be 04/24/2025] mulmat with 8 thread + HVX SIMD + other optimizatation works fine on cDSP side but the performance is slower than the default ggml backend.

Hexagon NPU Performance

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.

case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference
LLM inference through HWACCEL_CDSP(offload GGML_OP_ADD to cDSP directly)

./scripts/build-run-android.sh run_llamacli

LLM inference through HWACCEL_QNN(offload GGML_OP_ADD to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)

./scripts/build-run-android.sh run_llamacli

we can/will clearly see(from adb logcat | grep ggml-hexagon) that the NPU performance in real LLM inference is really good and faster then QNN solution when disable cDSP rpc ion memory pool.

case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP
mulmat through HWACCEL_CDSP(offload mulmat to cDSP directly):

./scripts/build-run-android.sh run_testop MUL_MAT

mulmat through HWACCEL_QNN(offload mulmat to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)

./scripts/build-run-android.sh run_testop MUL_MAT

we can/will clearly see(from adb logcat | grep ggml-hexagon) that the performance difference of mulmat between HWACCEL_QNN and HWACCEL_CDSP and the NPU performance is really good and much faster then QNN solution when disable cDSP rpc ion memory pool.

[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.

I clearly understand that Qualcomm's QNN SDK team is a world-class engineering team(they provides a highly-designed and highly-uniform AI-SDK on Windows/Linux/Android. the high-level QNN SDK and the low-level Hexagon SDK are both provided by Qualcomm. I personally hope Qualcomm's Hexagon SDK team can release a new version and pls refine the FastRPC framework and remove the qidl accordingly: pls the top talent software engineers in Qualcomm's Hexagon SDK team can refer to some design principles in TEE and provide a flexible way of exchange necessary data between ARM-AP and cDSP to developers, the new version of Hexagon SDK will be helpful for NPU performance through HWACCEL_CDSP approach, of course, I think the refined FastRPC framework might-be also helpful for QNN SDK.

[updated on 05/01/2025,20:49] Project KanTV is a very good example to illustrate why HWACCEL_CDSP approach is a good direction in llama.cpp community: Qualcomm Hexagon SDK is a lightweight low-level and thin SDK, so there is no QNN version/runtime libs conflict issue in HWACCEL_CDSP approach. this big advantage has verified with Project KanTV on a Snapdragon 8Elite based Android phone. this advantage will very helpful to deploy llama.cpp+ggml-hexagon on-device AI solution on Qualcomm's mobile/desktop SoC.

How to reproduce above result

a computation visualization approach was provided in that PR to help other developers and AI experts to reproduce above results easily.

modify enable_profiler = 0 to enable_profiler = 1 in scripts/ggml-hexagon.cfg
generate profiler data with HWACCEL_CDSP approach:

./scripts/build-run-android.sh run_llamacli
adb pull /data/local/tmp/hexagon_perf_cdsp.dat .

generate profiler data with HWACCEL_QNN approach:
modify hwaccel_approach to 0 in ./scripts/ggml-hexagon.cfg and then run

./scripts/build-run-android.sh run_llamacli
adb pull /data/local/tmp/hexagon_perf_qnn.dat .

generate a visualization of the comparison results between HWACCEL_CDSP and HWACCEL_QNN

#!/usr/bin/gnuplot
set term png
set output "./hexagon-npu-comparison.png"

set xlabel "tensor index"
set ylabel "inference duration(microseconds)"

#for ggml-op-add in ./scripts/build-run-android.sh run_llamacli
set yrange [1:1200]
set ytics 10, 50 , 1200

#for ggml-op-mulmat in ./scripts/build-run-android.sh run_testop MUL_MAT
#set yrange [1:21000]
#set ytics 50, 1000, 21000

set title  "Qualcomm Hexagon NPU performance(Snapdragon 8 Gen3)"
plot "hexagon_perf_cdsp.dat" using 1:7 with linespoints pointtype 5 title "cDSP","hexagon_perf_qnn.dat" using 1:7 with linespoints pointtype 7 title "QNN-NPU"

why HWACCEL_CDSP in ggml-hexagon is a correct / reference implementation rather than a product-level implementation at the moment

https://discuss.tvm.apache.org/t/hmx-support-on-htp-using-tvm-runtime/17457
https://github.com/apache/tvm
PowerInfer-2: Fast Large Language Model Inference on a Smartphone: https://arxiv.org/abs/2406.06282, https://github.com/SJTU-IPADS/PowerInfer
LLM prefilling with mllm-NPU: https://arxiv.org/abs/2407.05858v1, https://github.com/UbiquitousLearning/mllm
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge:https://arxiv.org/abs/2407.00088v1, https://github.com/microsoft/T-MAC
https://docs.qualcomm.com/bundle/publicresource/topics/80-70014-15/architecture.html
https://github.com/quic/aimet
How to get qualcomm_qnn PaddlePaddle/Paddle-Lite#10539

we can clearly see that PowerInfer-2 from SJTU, mllm-NPU from BUPT, T-MAC from MSRA(Microsoft Research Asia) are both End-to-End edge-inference solution base-on llama.cpp. one more thing, The Qualcomm QNN solution or AI-Hub solution is also a typical End-to-End and state-of-the-art(because Qualcomm created Hexagon NPU and they know everything about Hexagon NPU)inference solution. these 4 solutions are all outstanding solutions and all not suitable for llama.cpp at the moment.

I can't say more in this section. domain experts or tech experts from Qualcomm should understand what I want to say in this section.

conclusion

in the all, I have highly optimistic and confidence with "QNN is not the right solution here"(which is a senior staff tech expert from Qualcomm headquarter told me on 03/18/2025):

I wish AI experts and domain tech experts around the world can help to improve the hexagon-kernels(or codes on cDSP side) in HWACCEL_APPROACH.
I wish some Qualcomm's senior tech experts can support me(I'd like to express my sincerely thanks to Max again).
I hope more heavyweight/influential tech experts(from Qualcomm or other heavyweight IT company) can participate in the discussion of this topic and make a reasonable decision finally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why HWACCEL_CDSP in ggml-hexagon is a correct/reference implementation rather than a product-level implementation at the moment #28

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why HWACCEL_CDSP in ggml-hexagon is a correct/reference implementation rather than a product-level implementation at the moment #28

Uh oh!

Uh oh!

jeffzhou2000 Mar 31, 2025 Maintainer

firstly

secondly

thirdly

Hexagon NPU Performance

How to reproduce above result

why HWACCEL_CDSP in ggml-hexagon is a correct / reference implementation rather than a product-level implementation at the moment

conclusion

Replies: 0 comments

jeffzhou2000
Mar 31, 2025
Maintainer