fully version of PR-12326(https://github.com/ggml-org/llama.cpp/pull/12326) #30
jeffzhou2000
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
this post is the fully version of my third formal PR in upstream llama.cpp project: ggml-org#12326.
* [ ] Low
* [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
* [ ] High
* [x]
test-backend-opsandllama-clithrough HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone* [x]
test-backend-opsandllama-clithrough HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phonePR Description
this PR is a continued effort of my original PR ggml-org#6869 on 04/2024, focus on the final mission:
this is a concise ggml-hexagon(the previous name was ggml-qnn but that wasn't accurate) implementation:
thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),
this implementation put main logic in one single source file(ggml-hexagon.cpp) because it will helpful for other highly-skilled or highly-experienced developers and domain tech experts or AI experts. other reason of this coding style is I think this will make the developers' workflow more easily:
Features
data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm) in my first PR on 04/2024
a simple and effective QNN graph cache mechanism already implemented on 04/2024
use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully
a simple skeleton in function ggmlqnn_compute_elementwise:offload GGML_OP_ADD & GGML_OP_MUL & GGML_OP_SUB & GGML_OP_DIV & GGML_OP_LOG & GGML_OP_SQRT to QNN backend. we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MUL_MAT(2d&3d mulmat) to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a more complex skeleton in function ggml_qnn_mulmat_4d: offload 4d mulmat to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a concise implementation rather than complex C++ encapsulation with hide many tech details.(UT passed but some unknown bugs with test-backend-ops).
QNN NPU RPC feature already implemented on 04/2024.
special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024.
dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).

probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:


#v68 --- Snapdragon 888
#v69 --- Snapdragon 8 Gen1
#v73 --- Snapdragon 8 Gen2
#v75 --- Snapdragon 8 Gen3(verified)
#v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
offload quantized data type with 2d&3d mulmat to QNN backend in HWACCEL_QNN approach.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
provide a very fast approach which exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl: offload ggml op to Hexagon cDSP directly. as well known, Qualcomm Hexagon SDK is a lightweight low-level and thin SDK, developers and AI experts can operate cDSP hardware directly with Hexagon SDK, so there is no QNN version/runtime libs conflict issue in HWACCEL_CDSP approach. this feature will very helpful to deploy llama.cpp + ggml-hexagon on-device AI solution on Qualcomm's world-class mobile/desktop SoC.
the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared:provide a computation visualization approach in this PR to help other developers and AI experts to visualize the comparison between cDSP approach and QNN approach.
cDSP RPC ion memory pool(a single big memory pool for tensors and intend to achive ideal zero-copy between ARM-AP side and cDSP side) can be utilized in HWACCEL_CDSP approach and verified with test-backend-ops and llama-cli. at the same time, there some unknown issues with rpc dma memory pool but I personally think that's not the key-point at the moment.
code in ggml-hexagon.cpp is well-organized in this self-contained single-source file and domain developers and tech experts can understand code quickly, without complex encapsulation and hide tech details, because layered abstraction and loose coupling will bring difficult with code tracking and troubleshooting.
special clarification in this section:
How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE:
utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon". for programmers, we can use "adb logcat | grep ggml-hexagon" to help troubleshooting work.
How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device
the good news for WoA port is:
Hexagon NPU Performance
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.
case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference
LLM inference through HWACCEL_CDSP(offload GGML_OP_ADD to cDSP directly)
LLM inference through HWACCEL_QNN(offload GGML_OP_ADD to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)
we can/will clearly see(from adb logcat | grep ggml-hexagon) that the NPU performance in real LLM inference is really good and faster then QNN solution when disable cDSP rpc ion memory pool.
case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP
mulmat through HWACCEL_CDSP(offload mulmat to cDSP directly):
mulmat through HWACCEL_QNN(offload mulmat to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)
we can/will clearly see(from adb logcat | grep ggml-hexagon) that the performance difference of mulmat between HWACCEL_QNN and HWACCEL_CDSP and the NPU performance is really good and much faster then QNN solution when disable cDSP rpc ion memory pool.
case-3: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP in real LLM inference
TBD
[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.

I clearly understand that Qualcomm's QNN SDK team is a world-class engineering team(they provides a highly-designed and highly-uniform AI-SDK on Windows/Linux/Android. the high-level QNN SDK and the low-level Hexagon SDK are both provided by Qualcomm. I personally hope Qualcomm's Hexagon SDK team can release a new version and pls refine the FastRPC framework and remove the qidl accordingly: pls the top talent software engineers in Qualcomm's Hexagon SDK team can refer to some design principles in TEE and provide a flexible way of exchange necessary data between ARM-AP and cDSP to developers, the new version of Hexagon SDK will be helpful for NPU performance through HWACCEL_CDSP approach, of course, I think the refined FastRPC framework might-be also helpful for QNN SDK.
How to reproduce above result
a computation visualization approach was provided in that PR to help other developers and AI experts to reproduce above results easily.
modify hwaccel_approach to 0 in ./scripts/ggml-hexagon.cfg and then run
Big picture of ggml-hexagon backend
there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:
the general approach through QNN SDK(HWACCEL_QNN) or Hexagon SDK(HWACCEL_CDSP) can be seen in this PR. the special approach through QNN(HWACCEL_QNN_SINGLEGRAPH) will be seen in another standalone PR, because:
key-points about ggml-hexagon's performance in the general approach:
key-points about ggml-hexagon's peformance in the special approach:
[updated on 03/19/2025] the technical approach of "mapping the entire ggml computational graph to QNN graph" will be seen in another standalone PR: provide a concise( without complex/complicated encapsulation and hide tech details, for example, 4d mulmat) implementation of the technical approach "mapping the entire ggml cgraph to a single QNN graph" .
[updated on 03/20/2025]: I deeply thought many hours after a senior staff technical expert from Qualcomm told me on 03/18/2025 that "QNN is not the right solution here" very valuablely, today I think I know there is another tech approach of "utilize the Hexagon NPU maximally". I'll try to implement the third tech approach base on this PR(in other words, most of codes in this PR will be re-used in the third tech approach, AND the efforts on the first tech approach or the second tech approach is also meaningful because these are all necessary exploring steps before completing the final mission) if my guess can be confirmed by the senior staff technical expert at Qualcomm: I think I know how do that so-called third approach and I think I completely understand why there is so much performance difference between ggml-hexagon and Intel's ggml-sycl or Huawei's ggml-cann at the moment if my guess can be confirmed.
[updated on 03/22/2025]: the general approach through Hexagon cDSP which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl can be seen in this PR.
[updated on 03/23/2025]: I'm not AI expert so I'd like to port a tiny customized ggml-dsp to Hexagon cDSP and then optimize this tiny ggml-dsp with Hexagon SIMD instructions.
[updated on 03/31/2025]: there is another big advantage in cDSP solution: there are many limitations in QNN API: for example, some matrix mulmat cannot offload to QNN. these QNN API limitations can be completely avoid in cDSP solution(offload ggml op to cDSP directly). this big advantage has verified with test-backend-ops in my local dev envs.
[updated on 03/31/2025,22:20] release ggml_hexagon v1.00, hope this PR can be seen in the master branch of llama.cpp so other domain tech experts and AI experts can help improve the hexagon kernel (which similar to opencl kernel, cuda kernel, metal kernel......): implement highly-optimized q6_k mulmat on cDSP side, add rms_norm/norm/softmax/... on cDSP side.
Todo tasks
Acknowledgement
Conclusion
after spent too much efforts on ggml-hexagon backend, I personally think:
some work in the hexagon-kernels seems beyond my skillsets at the moment, AI experts must be involved in the rest parts of hexagon-kernels: AI experts only need to focus on hexagon-kernels, AI experts and other domain tech experts around the world can help to improve the hexagon-kernels(various mulmat and norm/rmsnorm/softmax/....) on cDSP side.
some design tricks from FFmpeg or GStreamer might-be/already used in GGML's backend subsystem: there are more than 1 backend implementation for the same hardware accelerator------ open source version from llama.cpp community and commercial version from Qualcomm.
this PR's style is exactly similr/same to the original ggml/llama.cpp: code is clear & concise and without complex/complicated encapsulation in ggml/llama.cpp although the core maintainers are both genius programmers and modern C++ masters.
[updated on 04/09/2025, 17:25] I hope developers and experts can understand my policy or Intel's toothpaste squeezing style with ggml-dsp.c in this PR since 04/04/2025. because:
there are only two core source file in this topic: ggml-hexagon.cpp and ggml-dsp.c. ggml-hexagon.cpp in this PR is exactly same to my local dev envs.
we can clearly see that Qualcomm's opencl backend also has many TODO and bugfix.
I'm facing an not-easy situation in this candidate PR at the moment although I always thought I'm an open-minded programmer.
of course, all related source codes in ggml-dsp.c will be open to this great tech community(one of reasons is that all fundamental techs in ggml-dsp.c is exactly comes from original authors of ggml), I also hope that day is coming ASAP.
Beta Was this translation helpful? Give feedback.
All reactions