fully version of PR-12326(https://github.com/ggml-org/llama.cpp/pull/12326) #30

jeffzhou2000 · 2025-03-31T23:24:07Z

jeffzhou2000
Mar 31, 2025
Maintainer

this post is the fully version of my third formal PR in upstream llama.cpp project: ggml-org#12326.

I have read the contributing guidelines
Self-reported review complexity:
* [ ] Low
* [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
* [ ] High
Testing Done
* [x] test-backend-ops and llama-cli through HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone
* [x] test-backend-ops and llama-cli through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone

PR Description

this PR is a continued effort of my original PR ggml-org#6869 on 04/2024, focus on the final mission:

how to utilize the Qualcomm Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.
try to build a fully open-source implementation of ggml-hexagon backend for llama.cpp on Qualcomm's Hexagon NPU, provide a fully open-source on-device AI inference solution powered by the standout and exceptional llama.cpp.

this is a concise ggml-hexagon(the previous name was ggml-qnn but that wasn't accurate) implementation:

follow the principle of "simple is beautiful" which comes from the great Unix in the US, code is concise and quickly understand, without complex encapsulation, without hide tech details, it's a good reference implementation of ggml-hexagon, can be easily extended/customized as needs .
follow the principle of "make it run, then make it right, then make it fast" (run and right already got at the moment).

thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),

data path of ggml-hexagon backend works pretty good as expected.
the official command line tool "test-backend-ops" & "llama-cli" has verified on a Qualcomm Snapdragon 8 Gen3 equipped Android phone.
works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp with a standard Android APP(which is a self-made Android APP) on a Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

this implementation put main logic in one single source file(ggml-hexagon.cpp) because it will helpful for other highly-skilled or highly-experienced developers and domain tech experts or AI experts. other reason of this coding style is I think this will make the developers' workflow more easily:

this is a self-contained single source file(I can split to some well-organized small source files in less then 1 day if there is a strong need, I don't think this is the point at the moment: this self-contained single source file is similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp), or is exactly similar to Qualcomm's ggml-opencl.
try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT
then expand other ggml ops accordingly with team-work from AI experts and programmers in this great pure-tech community

Features

data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm) in my first PR on 04/2024
a simple and effective QNN graph cache mechanism already implemented on 04/2024
use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully
a simple skeleton in function ggmlqnn_compute_elementwise:offload GGML_OP_ADD & GGML_OP_MUL & GGML_OP_SUB & GGML_OP_DIV & GGML_OP_LOG & GGML_OP_SQRT to QNN backend. we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MUL_MAT(2d&3d mulmat) to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a more complex skeleton in function ggml_qnn_mulmat_4d: offload 4d mulmat to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a concise implementation rather than complex C++ encapsulation with hide many tech details.(UT passed but some unknown bugs with test-backend-ops).
QNN NPU RPC feature already implemented on 04/2024.
special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024.
dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).
probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:
#v68 --- Snapdragon 888
#v69 --- Snapdragon 8 Gen1
#v73 --- Snapdragon 8 Gen2
#v75 --- Snapdragon 8 Gen3(verified)
#v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
offload quantized data type with 2d&3d mulmat to QNN backend in HWACCEL_QNN approach.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
provide a very fast approach which exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl: offload ggml op to Hexagon cDSP directly. as well known, Qualcomm Hexagon SDK is a lightweight low-level and thin SDK, developers and AI experts can operate cDSP hardware directly with Hexagon SDK, so there is no QNN version/runtime libs conflict issue in HWACCEL_CDSP approach. this feature will very helpful to deploy llama.cpp + ggml-hexagon on-device AI solution on Qualcomm's world-class mobile/desktop SoC.
the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared:provide a computation visualization approach in this PR to help other developers and AI experts to visualize the comparison between cDSP approach and QNN approach.
cDSP RPC ion memory pool(a single big memory pool for tensors and intend to achive ideal zero-copy between ARM-AP side and cDSP side) can be utilized in HWACCEL_CDSP approach and verified with test-backend-ops and llama-cli. at the same time, there some unknown issues with rpc dma memory pool but I personally think that's not the key-point at the moment.
code in ggml-hexagon.cpp is well-organized in this self-contained single-source file and domain developers and tech experts can understand code quickly, without complex encapsulation and hide tech details, because layered abstraction and loose coupling will bring difficult with code tracking and troubleshooting.

special clarification in this section:

all original tech comes from Qualcomm, Qualcomm provide the fundamental mechanism and we programmer use it regardless of coding style or tech approach.

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE:

utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)

  git clone https://github.com/zhouwg/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream

 ./scripts/build-run-android.sh 
Usage:
  ./scripts/build-run-android.sh help
  ./scripts/build-run-android.sh print_oplist
  ./scripts/build-run-android.sh build
  ./scripts/build-run-android.sh updateqnnlib
  ./scripts/build-run-android.sh run_testops
  ./scripts/build-run-android.sh run_testop          [ADD/MUL_MAT]
  ./scripts/build-run-android.sh run_llamacli
  ./scripts/build-run-android.sh run_llamabench

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon". for programmers, we can use "adb logcat | grep ggml-hexagon" to help troubleshooting work.

How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device

the good news for WoA port is:

a Snapdragon 8gen2 or 8gen3 or 8 Elite(aka 8gen4) equipped Android phone can be seen or bought everywhere.
WoA port can be done in another standalone PR by a skilled Windows programmer because the highly-well designed Qualcomm QNN SDK and the source codes of ggml/llama.cpp are both highly portable.

Hexagon NPU Performance

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.

case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference
LLM inference through HWACCEL_CDSP(offload GGML_OP_ADD to cDSP directly)

./scripts/build-run-android.sh run_llamacli

LLM inference through HWACCEL_QNN(offload GGML_OP_ADD to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)

./scripts/build-run-android.sh run_llamacli

we can/will clearly see(from adb logcat | grep ggml-hexagon) that the NPU performance in real LLM inference is really good and faster then QNN solution when disable cDSP rpc ion memory pool.

case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP
mulmat through HWACCEL_CDSP(offload mulmat to cDSP directly):

./scripts/build-run-android.sh run_testop MUL_MAT

mulmat through HWACCEL_QNN(offload mulmat to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)

./scripts/build-run-android.sh run_testop MUL_MAT

we can/will clearly see(from adb logcat | grep ggml-hexagon) that the performance difference of mulmat between HWACCEL_QNN and HWACCEL_CDSP and the NPU performance is really good and much faster then QNN solution when disable cDSP rpc ion memory pool.

case-3: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP in real LLM inference
TBD

[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.

I clearly understand that Qualcomm's QNN SDK team is a world-class engineering team(they provides a highly-designed and highly-uniform AI-SDK on Windows/Linux/Android. the high-level QNN SDK and the low-level Hexagon SDK are both provided by Qualcomm. I personally hope Qualcomm's Hexagon SDK team can release a new version and pls refine the FastRPC framework and remove the qidl accordingly: pls the top talent software engineers in Qualcomm's Hexagon SDK team can refer to some design principles in TEE and provide a flexible way of exchange necessary data between ARM-AP and cDSP to developers, the new version of Hexagon SDK will be helpful for NPU performance through HWACCEL_CDSP approach, of course, I think the refined FastRPC framework might-be also helpful for QNN SDK.

How to reproduce above result

a computation visualization approach was provided in that PR to help other developers and AI experts to reproduce above results easily.

generate profiler data with HWACCEL_CDSP approach:

./scripts/build-run-android.sh run_llamacli
adb pull /data/local/tmp/hexagon_perf_cdsp.dat .

generate profiler data with HWACCEL_QNN approach:
modify hwaccel_approach to 0 in ./scripts/ggml-hexagon.cfg and then run

./scripts/build-run-android.sh run_llamacli
adb pull /data/local/tmp/hexagon_perf_qnn.dat .

generate a visualization of the comparison results between HWACCEL_CDSP and HWACCEL_QNN

#!/usr/bin/gnuplot
set term png
set output "./hexagon-npu-comparison.png"

set xlabel "tensor index"
set ylabel "inference duration(microseconds)"

#for ggml-op-add in ./scripts/build-run-android.sh run_llamacli
set yrange [1:1200]
set ytics 10, 50 , 1200

#for ggml-op-mulmat in ./scripts/build-run-android.sh run_testop MUL_MAT
#set yrange [1:21000]
#set ytics 50, 1000, 21000

set title  "Qualcomm Hexagon NPU performance(Snapdragon 8 Gen3)"
plot "hexagon_perf_cdsp.dat" using 1:7 with linespoints pointtype 5 title "cDSP","hexagon_perf_qnn.dat" using 1:7 with linespoints pointtype 7 title "QNN-NPU"

Big picture of ggml-hexagon backend

there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:

general approach through Qualcomm QNN SDK:offload ggml op to QNN (then QNN's internal will transfer to Hexagon cDSP)
general approach through Qualcomm Hexagon SDK:offload ggml op to Hexagon cDSP directly, which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl.
special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024. the tech details of "the special approach through QNN" or "how to compose an ideal single QNN graph from a ggml cgraph" can be found at my forked llama.cpp project:how to compose a single QNN graph from a complete ggml cgraph #24.

enum hwaccel_approach_type {
HWACCEL_QNN =0, (C API, before 03/11/2025, not easy because QNN SDK is a black-box or heavy SDK and many many tricks in the QNN SDK)
HWACCEL_QNN_SINGLEGRAPH=1,(C API, before 03/18/2025, very hard because the mechanism is a black black-box and workload is massive)
HWACCEL_CDSP=2,(C and assemble API, after 03/24/2025, hard but we can do anything on cDSP directly, because Hexagon SDK is a very lightweight/thin SDK and we can operate hardware directly through Hexagon SDK)
HWACCEL_SYCL=3,(this is personal proposal or assumption, general and modern C++ API, N/A at the moment because essential adaption layer should be provided by Qualcomm)
};

the general approach through QNN SDK(HWACCEL_QNN) or Hexagon SDK(HWACCEL_CDSP) can be seen in this PR. the special approach through QNN(HWACCEL_QNN_SINGLEGRAPH) will be seen in another standalone PR, because:

it's heavily not mature. the general approach through QNN or cDSP directly is a common approach in the all existing backends and it works fine in the all existing backends.
my implementation contains about 1000+ LoC and it contains a nave graph algorithm, I think it will brings some unexpected troubles to this formal PR(there are already about 5000+ LoC in the existing PR). accordingly, it will be seen in another standalone PR after this PR can be approved.
we should/must implement all ggml ops through QNN API in this approach, this workload is very very big.
it's not the key-point at the moment(my personal opinion): the general approach through QNN is mature but the NPU performance is really bad; the general approach through cDSP is not enough mature(lack of optimization with Qualcomm HVX multithreading and lack of highly-optimized quantize-mulmat) at the moment but the NPU performance is really good and much faster then QNN solution when disable cDSP rpc ion memory pool. I personally think HWACCEL_CDSP is the correct direction(some reasons why I think HWACCEL_CDSP is the correct direction can be found in my forked llama.cpp project).

key-points about ggml-hexagon's performance in the general approach:

load/performance loss of data transfer between AP(arm cpu) and NPU(dsp). that's the performance loss caused by transferring data between the main CPU and the NPU. this part requires redesigning the data structure in the ggml-hexagon implementation(in other words, shared buffer or memory pool should-be used in the implementation of ggml-hexagon), placing all tensor data entirely in the DSP's device memory to minimize data copying or ideally achieve zero-copy.
relative tricks with Qualcomm's QNN SDK in the tech approach through QNN, this is a highly-well designed SDK at the same time I personally think it's usage is really not easy.
relative tricks with Qualcomm's Hexagon cDSP in the tech approach through Hexagon cDSP, this is a straight way and exactly similar to Intel's ggml-sycl.
we need to write some "hexagon kernels" in the general approach through Hexagon cDSP, which is similar to Qualcomm's ggml-opencl(opencl kernel) or other backend(cuda kernel), or which is exactly similar to TEE TA & CA: we can clearly see here hexagon kernels(libggmlop-skel.so) is similar to a specified TEE TA on TEE OS and ggml-hexagon is similar to a specified TEE CA on AP(arm-cpu), the difference is that TEE GP API is an international sw standard and here is Qualcomm's dedicated sw stack.
some ops that are generally critical to inference performance in ggml-hexagon, AI experts must be involved in the rest parts of ggml-hexagon, we only need to implement some performance-sensitive ggml ops in the general approach(through Hexagon cDSP).


MUL_MAT (Matrix Multiplication)
    Why it’s significant: Matrix multiplication is the backbone of neural network inference, especially in Transformers, where it’s used in attention mechanisms (e.g., query-key-value computations) and feed-forward layers. In GGML, MUL_MAT is heavily optimized for various hardware (e.g., CPU with SIMD, GPU with CUDA, or Metal on Apple Silicon).
    Performance impact: This op often dominates the compute time in inference. Its efficiency depends on quantization (e.g., Q4, Q8), hardware acceleration, and memory alignment. For example, in llama.cpp, optimized MUL_MAT kernels can significantly speed up token generation.
    Context: Most critical for large language models (LLMs) during the forward pass.
SOFT_MAX (Softmax)
    Why it’s significant: Softmax is used in the attention mechanism of Transformers to normalize attention scores. It’s computationally expensive because it involves exponentiation and summation across potentially large vectors.
    Performance impact: While not as compute-heavy as MUL_MAT, SOFT_MAX can become a bottleneck for long sequences due to its sequential nature and memory access patterns. Optimizations in GGML (e.g., vectorization) help mitigate this.
    Context: Critical in attention-based models like LLMs or vision transformers.
ROPE (Rotary Position Embeddings)
    Why it’s significant: ROPE is a specialized op for positional encodings in some Transformer models (e.g., LLaMA). It applies rotary transformations to embeddings, which is a key part of the attention mechanism in these models.
    Performance impact: This op is executed for every token in every layer, so its efficiency directly affects inference latency. GGML optimizes it for low overhead, but it’s still a frequent operation in modern LLMs.
    Context: Highly relevant for models like LLaMA or its derivatives.
ADD / ADD_REL_POS (Element-wise Addition)
    Why it’s significant: Addition ops are used throughout neural networks—for example, in residual connections (common in Transformers) or when combining positional encodings with token embeddings.
    Performance impact: While individually lightweight, these ops are executed frequently across layers and tokens, so their cumulative impact is notable. Efficient memory access and vectorization are key to keeping them fast.
    Context: Ubiquitous in Transformer inference.
FLASH_ATTN_EXT (Flash Attention Extension)
    Why it’s significant: This is an optimized implementation of attention, inspired by techniques like FlashAttention, which reduces memory usage and improves compute efficiency by fusing operations and minimizing memory reads/writes.
    Performance impact: For models supporting this op, it can drastically improve inference speed and memory efficiency, especially for long sequences. It’s a game-changer on GPU hardware.
    Context: Relevant for cutting-edge LLMs with long context lengths.
RMS_NORM (Root Mean Square Normalization)
    Why it’s significant: RMSNorm is a lightweight alternative to LayerNorm, used in models like LLaMA. It normalizes activations across layers, which is essential for stable inference.
    Performance impact: It’s executed for every layer and token, so its efficiency matters. GGML optimizes it for speed, but it still contributes to the overall latency.
    Context: Common in modern LLMs.
CONV_TRANSPOSE_1D / CONV_TRANSPOSE_2D (Convolution Operations)
    Why it’s significant: These ops are critical for models with convolutional components, such as Whisper (speech processing) or vision transformers. They involve sliding window computations over input data.
    Performance impact: Convolutions are computationally intensive and memory-bound, especially for high-dimensional inputs (e.g., audio spectrograms or images). Optimizations like IM2COL (image-to-column transformation) help, but they remain costly.
    Context: Key for non-LLM models like Whisper.
POOL_2D (Pooling)
    Why it’s significant: Pooling reduces spatial dimensions in convolutional models, often used in audio or vision tasks (e.g., Whisper’s encoder).
    Performance impact: It’s less compute-intensive than convolutions but can bottleneck memory bandwidth if not optimized.
    Context: Relevant for feature extraction in non-text models.

key-points about ggml-hexagon's peformance in the special approach:

Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work in this approach: compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph, this is the most important key-point in this approach.
we must implement ALL ggml ops in this approach and the general approach through QNN is the essential foundation of this approach.

[updated on 03/19/2025] the technical approach of "mapping the entire ggml computational graph to QNN graph" will be seen in another standalone PR: provide a concise( without complex/complicated encapsulation and hide tech details, for example, 4d mulmat) implementation of the technical approach "mapping the entire ggml cgraph to a single QNN graph" .

[updated on 03/20/2025]: I deeply thought many hours after a senior staff technical expert from Qualcomm told me on 03/18/2025 that "QNN is not the right solution here" very valuablely, today I think I know there is another tech approach of "utilize the Hexagon NPU maximally". I'll try to implement the third tech approach base on this PR(in other words, most of codes in this PR will be re-used in the third tech approach, AND the efforts on the first tech approach or the second tech approach is also meaningful because these are all necessary exploring steps before completing the final mission) if my guess can be confirmed by the senior staff technical expert at Qualcomm: I think I know how do that so-called third approach and I think I completely understand why there is so much performance difference between ggml-hexagon and Intel's ggml-sycl or Huawei's ggml-cann at the moment if my guess can be confirmed.
[updated on 03/22/2025]: the general approach through Hexagon cDSP which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl can be seen in this PR.
[updated on 03/23/2025]: I'm not AI expert so I'd like to port a tiny customized ggml-dsp to Hexagon cDSP and then optimize this tiny ggml-dsp with Hexagon SIMD instructions.
[updated on 03/31/2025]: there is another big advantage in cDSP solution: there are many limitations in QNN API: for example, some matrix mulmat cannot offload to QNN. these QNN API limitations can be completely avoid in cDSP solution(offload ggml op to cDSP directly). this big advantage has verified with test-backend-ops in my local dev envs.
[updated on 03/31/2025,22:20] release ggml_hexagon v1.00, hope this PR can be seen in the master branch of llama.cpp so other domain tech experts and AI experts can help improve the hexagon kernel (which similar to opencl kernel, cuda kernel, metal kernel......): implement highly-optimized q6_k mulmat on cDSP side, add rms_norm/norm/softmax/... on cDSP side.

Todo tasks

fully understand/depict the tech detail in qidl: qidl is a binary tool to generate some very complicated and hard-to customized bridge-layer codes between ARM-AP and cDSP, I personally think that the bridge layer codes generated by qidl will have a great impact on the NPU performance in the HWACCEL_CDSP approach at the moment(my understanding might-be incorrect and help/guidance/patch from domain tech experts is greatly appreciated). a workaround approach is manually modify the important data structure "struct ggml_tensor in ggml.h" but I think it's not make sense. this is a P0 task.
implement a highly-optimized(exquisite algorithm with HVX SIMD instructions and HVX multithreading) q6_k mulat in hexagon-kernels on cDSP side, qwen1_5-1_8b-chat-q4_0.gguf need q6_k mulmat. this is a P0 task and AI experts must be involved in this P0 task.
fully understand/depict the accurate overhead/cost of every(not most of) step when offloading a specified ggml op(GGML_OP_ADD or GGML_OP_MUL_MAT) to cDSP directly: from "load LLM model from a specified gguf file" to "construct llama's internal data structure and memory layout", from llama's internal memory layout to compose a specified ggml cgraph; from code on ggml-hexagon's ARM-AP side to code on ggml-hexagon's cDSP side...... this is a P1/P2 task.
implement other performance-sensitive ggml-ops(such as rms-norm/norm/softmax/group-norm/pool2d...) on cDSP side. this is a P1 task.
figure out why test-backend-ops failed with GGML_OP_ADD when disable cDSP ion memory pool(good news is that test-backend-ops has passed when enable cDSP ion memory pool). this is a P5 task.
there are some unknown issues with cDSP dma memory pool(good news is that cDSP ion memory pool can works fine with test-backend-ops and llama-cli). this is a P9 task.

Acknowledgement

the implementation of HWACCEL_QNN is mainly porting/reverse engineering from executorch(the implementation of QNN backend in executorch comes from Qualcomm). the implementation of HWACCEL_CDSP borrows some codes from Qualcomm's Hexagon SDK. in the all: all the original techs of this topic(a specified ggml/llama.cpp backend for Qualcomm's Hexagon NPU) comes from Qualcomm.
I got breakthrough help from chiwwang@Qualcomm Technologies Inc on 04/2024.
I also got a meaningful help from XiaoMi-StableDiffusionOnDevice on 05/2024.
I got massive help from Intel's sycl on 04/2024, the initial ggml-hexagon(original name is ggml-qnn) is benefited from Intel's sycl too much! I also got a meaningful help from Huawei's cann on 03/2025 because I'm not AI expert.
thanks for that I borrowed 4-6 functions from another QNN implementation which comes from a CN programmer chraac 's team. one of these functions is very helpful.
thanks for various helps during my difficult effort on WoA(Windows on ARM) build: thanks for this post:Bug: MinGW build fails to load models with "error loading model: PrefetchVirtualMemory unavailable" ggml-org/llama.cpp#9311; the original tech of a customized/dedicated toolchain llvm-mingw-20250305-ggml-ucrt-x86_64.zip comes from https://github.com/mstorsjo/llvm-mingw/releases; the git.exe and cmake.exe and ninja.exe comes from MS's VS2022. the purpose of this customized toolchain is to make workflow easy for Linux programmer. help from @ejrydhfs and a senior&staff tech leader&engineer @AndreasKunar, at the same time, I also got a meaningful help from https://github.com/Windows-on-ARM-Experiments/mingw-woarm64-build which comes from some excellent MS's compiler&toolchain engineers.
thanks for @zhuipiaochen and a special QNN implementation in powerserve-project from SJTU because your unintentional/casual & kind help helped me to complete the final puzzle of QNN solution and now I has a more clear understanding of implementation of HWACCEL_QNN_SINGLEGRAPH for ggml-hexagon backend.
I tried AI-assisted programming for 4d mulmat in HWACCEL_QNN, the impressive and unforgettable help (I thought it's IQ closer to 120 or 125 at that moment) from the powerful Grok3 is really helped me a lot in this PR.
huge thanks to the excellent/great maintainers&original authors of ggml&llama.cpp,I learnt so much from their codes: their open-minded spirits and standout contributions made a great public good for open-source community and our planet. one more important thing: the tiny ggml-dsp on Hexagon cDSP side(the existing implementation of hexagon kernels) is completely ported/borrowed from the original ggml. I have no chance and I coudn't reach this milestone without their standout contributions.
huge thanks to a senior staff technical expert @max-krasnyansky from Qualcomm headquarter whom give an important/valuable/breakthrough guidance on direction on 03/18/2025:QNN is not the right solution here.I couldn't reach this milestone without the breakthrough reminder/guidance/help from Max.

Conclusion

after spent too much efforts on ggml-hexagon backend, I personally think:

some work in the hexagon-kernels seems beyond my skillsets at the moment, AI experts must be involved in the rest parts of hexagon-kernels: AI experts only need to focus on hexagon-kernels, AI experts and other domain tech experts around the world can help to improve the hexagon-kernels(various mulmat and norm/rmsnorm/softmax/....) on cDSP side.
some design tricks from FFmpeg or GStreamer might-be/already used in GGML's backend subsystem: there are more than 1 backend implementation for the same hardware accelerator------ open source version from llama.cpp community and commercial version from Qualcomm.
this PR's style is exactly similr/same to the original ggml/llama.cpp: code is clear & concise and without complex/complicated encapsulation in ggml/llama.cpp although the core maintainers are both genius programmers and modern C++ masters.

[updated on 04/09/2025, 17:25] I hope developers and experts can understand my policy or Intel's toothpaste squeezing style with ggml-dsp.c in this PR since 04/04/2025. because:

there are only two core source file in this topic: ggml-hexagon.cpp and ggml-dsp.c. ggml-hexagon.cpp in this PR is exactly same to my local dev envs.
we can clearly see that Qualcomm's opencl backend also has many TODO and bugfix.
I'm facing an not-easy situation in this candidate PR at the moment although I always thought I'm an open-minded programmer.

of course, all related source codes in ggml-dsp.c will be open to this great tech community(one of reasons is that all fundamental techs in ggml-dsp.c is exactly comes from original authors of ggml), I also hope that day is coming ASAP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fully version of PR-12326(https://github.com/ggml-org/llama.cpp/pull/12326) #30

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

fully version of PR-12326(https://github.com/ggml-org/llama.cpp/pull/12326) #30

Uh oh!

Uh oh!

jeffzhou2000 Mar 31, 2025 Maintainer

PR Description

Features

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device

Hexagon NPU Performance

How to reproduce above result

Big picture of ggml-hexagon backend

Todo tasks

Acknowledgement

Conclusion

Replies: 0 comments

jeffzhou2000
Mar 31, 2025
Maintainer