about ggml-hexagon #18

jeffzhou2000 · 2025-03-16T22:42:58Z

jeffzhou2000
Mar 16, 2025
Maintainer

llama.cpp for Qualcomm Hexagon NPU(aka ggml-hexagon)

Background
News
OS
Hardware
Android
Windows over ARM
Q&A
TODO

Background

Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent . Qualcomm is No.1 mobile SoC semiconductor company in our planet currently.

About QNN SDK

QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

TensorFlow: tf-1.15.0, or tf-2.10.1
TFLite: tflite-2.3.0
PyTorch: torch-1.13.1
ONNX: onnx-1.11.0

The Qualcomm® AI Engine Direct architecture is designed to be modular and allows for clean separation in the software for different hardware cores/accelerators such as the CPU, GPU and DSP that are designated as backends. Learn more about Qualcomm® AI Engine Direct backends here.

The Qualcomm® AI Engine Direct backends for different hardware cores/accelerators are compiled into individual core-specific libraries that come packaged with the SDK.

About Hexagon SDK

Each Qualcomm chip includes multiple Hexagon DSPs such as the compute DSP (cDSP), audio DSP (aDSP), and sensor DSP (SLPI -- Sensor Low Power Island). Each of these DSPs implement a specific Instruction Set Architecture (ISA) version. The compute DSP, which is intended for compute-intensive tasks such as image processing, computer vision, and camera streaming, also includes an instruction set extension for fixed-point vector operations called Hexagon Vector eXtensions (HVX).The following diagram provides an overview of the processing units within the cDSP and how they connect to the memory cache.

Compared to the host CPU, the DSP typically runs at a lower clock speed but provides more parallelism opportunities at the instruction level. This often makes the DSP a better alternative in terms of throughput and/or power consumption. As a result, it is preferable to offload as many large compute-intensive tasks as possible onto the DSP to reduce power consumption of the device and free up cycles on the CPU for additional features.

Hexagon SDK is a lightweight and low-level SDK provided by Qualcomm. developers and AI experts can operate cDSP hardware directly with Hexagon SDK.

Llama.cpp + Hexagon NPU

The llama.cpp Hexagon NPU backend(aka ggml-hexagon backend) is intended to support Qualcomm Hexagon NPU firstly, supported chipsets:

Snapdragon 8 Gen 1
Snapdragon 8 Gen 1+
Snapdragon 8 Gen 2
Snapdragon 8 Gen 3
Snapdragon 8 Elite(aka  Snapdragon 8 Gen 4)

block-beta
columns 1

block:llamacpp
  llamacpp["llama_cpp"]
  style llamacpp        fill:#3c3,color:#000,stroke:#000
end

block:ggml_backend
ggml_backend["GGML backend subsystem"]
  style ggml_backend    fill:#3c3,color:#000,stroke:#000

block:ggmlbackends
 ggml_cpu["ggml-cpu"]
   ggml_metal["ggml-metal"]
   ggml_sycl["ggml-sycl"]
   ggml_cuda["ggml-cuda"]
   ggml_hip["ggml-hip"]
   ggml_vulkan["ggml-vulkan"]
   ggml_cann["ggml-cann"]
   ggml_opencl["ggml-opencl"]
   ggml_hexagon["ggml-hexagon"]
   ggml_nnpa["ggml-nnpa"]
   ggml_ane["ggml-ane"]

   style ggml_cpu       fill:#888,color:#000,stroke:#000
   style ggml_metal     fill:#888,color:#000,stroke:#000
   style ggml_sycl      fill:#888,color:#000,stroke:#000
   style ggml_cuda      fill:#888,color:#000,stroke:#000
   style ggml_hip       fill:#888,color:#000,stroke:#000
   style ggml_vulkan    fill:#888,color:#000,stroke:#000
   style ggml_cann      fill:#888,color:#000,stroke:#000

   style ggml_opencl    fill:#cc3,color:#000,stroke:#000
   style ggml_hexagon       fill:#cc3,color:#000,stroke:#000
   style ggml_ane       fill:#fff,color:#000,stroke:#f00,stroke-width:2,stroke-dasharray:5
   style ggml_nnpa      fill:#cc3,color:#000,stroke:#000
  end
end

block:ggml_backendsubsystem
  ggml_backendsubsystem["GGML backend subsystem"]
  style ggml_backendsubsystem fill:#3c3,color:#000,stroke:#000
end

block:group1：2
  columns 2
  block:ggml_tensor
  ggml_tensor["GGML tensor"]
  style ggml_tensor fill:#3c3,color:#000,stroke:#000
  end

  block:ggml_cgraph
  ggml_cgraph["GGML cgraph"]
  style ggml_cgraph  fill:#3c3,color:#000,stroke:#000
  end
end

block:OS
    Windows
    Linux
    Android
    QNX
end

block:hardware_vendors
    Intel
    AMD
    Apple
    Nvidia
    Huawei
    Loongson
    Qualcomm
    IBM

    ggml_metal  --> Apple
    ggml_cuda   --> Nvidia
    ggml_hip    --> AMD
    ggml_cann   --> Huawei
    ggml_sycl   --> Intel
    ggml_opencl --> Qualcomm
    ggml_hexagon    --> Qualcomm
    ggml_ane    --> Apple
    ggml_nnpa   --> IBM
end

block:hardware_types
    CPU
    GPU
    NPU
    DSP
end

block:hardware_archs
    x86
    arm
    risc
    loongson
end

%%{init: {"flowchart": {"htmlLabels": false, 'nodeSpacing': 30, 'rankSpacing': 30}} }%%
flowchart LR
    classDef EXIST fill:#888,color:#000,stroke:#000
    classDef DONE fill:#3c3,color:#000,stroke:#000
    classDef WIP fill:#cc3,color:#000,stroke:#000
    classDef NEW fill:#fff,color:#000,stroke:#f00,stroke-width:2,stroke-dasharray:5
    subgraph Legend
      direction LR
      EXIST:::EXIST ~~~ WIP:::WIP ~~~ DONE:::DONE ~~~ NEW:::NEW
    end

News

06/27/2025
- release v0.98.8( fully open-source ggml-hexagon.cpp v1.13 + prebuilt binary library ggml-dsp v0.98.8)
  performance of fp32 4096x4096 mulmat on cDSP:
before 05/27/2025: about 28 seconds
relaunched the dev activity of project ggml-hexagon since 05/27/2025
06/09/2025: about 7-8 seconds
06/25/2025: about 6-8 seconds
06/27/2025: about 3.4-4.2seconds
06/25/2025
- release v0.98( fully open-source ggml-hexagon.cpp v1.13 + prebuilt binary library ggml-dsp v0.98), details can be found at ggml-hexagon v0.98(v20250625) #54
06/09/2025
- release v0.97( fully open-source ggml-hexagon.cpp v1.10 + prebuit binary library ggml-dsp v0.97), details can be found at ggml-hexagon v0.97(v20250609) #46 original post#43 which filed on 06/07/2025 has deleted) . fully source code of ggml-dsp is not open-sourced at the moment because of Performance of llama.cpp on Snapdragon X Elite/Plus ggml-org/llama.cpp#8273 (comment) (EN) or https://github.com/zhouwg/ggml-hexagon/discussions/20 (ZH). in other words, I have a candidate original PR in upstream, but there is a candidate forked PR in upstream, I don't know how to handle this scenario, so the source code of ggml-dsp is not fully open-sourced at the moment.
06/03/2025
- add set_hexagon_cfg(int new_hexagon_backend, int new_hwaccel_approach) in ggml-hexagon.h and ggml-hexagon.cpp for further usage
- release ggml-hexagon 1.09
05/10/2025
- sync with upstream llama.cpp
- release ggml-hexagon 1.08(fix some minor issues and refine codes)
04/24/2025
- sync from project kantv:make ggml-hexagon backend can works in a standard Android APP
- distinguishing QNN-NPU from CDSP (now QNN-NPU is 2 and CDSP is 3, previous QNN-NPU and CDSP are both 2) and make different backends(QNN-CPU (0), QNN-GPU(1), QNN-NPU(2), CDSP(3)) in ggml-hexagon more clear
- make QNN-NPU in ggml-hexagon more faster(the project's goal is trying to make my third formal PR can be approved in upstream llama.cpp community rather than beat QNN-NPU solution, cDSP and QNN-NPU are all both comes from Qualcomm and QNN-NPU solution has very good performance in Qualcomm's offical end-to-end AI-Hub)
- release ggml-hexagon 1.07
04/17/2025
- fix typo
- code cleanup
- refine code
- release ggml-hexagon 1.06 and ggml-dsp 0.63
- ready for code review from Qualcomm's senior staff tech expert
04/12/2025(April/12/2025)
- refine pinned-memory feature
- refine build system in ggml-hexagon
- upgrade Android NDK to android-ndk-r28
- refine ggml-dsp and make ggml-dsp more clear
- write tech article: high-level data path of ggml-hexagon
- release ggml-hexagon 1.05 and ggml-dsp 0.62.
04/09/2025(April/09/2025)
- fix a build issue in ggml-hexagon/CMakeLists.txt with reminder/help from myan-o
- upgrade QNN SDK to v2.33.0.250327
- the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone in my local dev envs.
04/08/2025(April/08/2025)
- setting multi-threading feature in code on cDSP side dynamically in scripts/ggml-hexagon.cfg, then compare mulmat performance on cDSP will be more easily.
- remove so-called dma memory pool to avoid confusion and ambiguity.
- make function ggmlhexagon_init_rpcmempool in ggml-hexagon.cpp more robust.
- fix potential resource leak in class hexagon_profiler.
- update ggml-hexagon.cpp's version to 1.03 and update ggml-dsp.c's version to 0.61.
04/07/2025(April/07/2025)
- provide a computation visualization approach in this PR to help other developers and AI experts to visualize the comparison between cDSP approach and QNN approach easily, this is a good idea which original created by me in a commercial embedded system project many years ago.
- release ggml-hexagon.cpp v1.02.
04/06/2025(April/06/2025)
- uniform NDEBUG usage in ggml-hexagon.cpp and ggml-dsp.c
- uniform other things
- release ggml-hexagon.cpp v1.01
04/05/2025
- release ggml-dsp v0.60(0.60 means the code in ggml-dsp.c on cDSP side can be 60 scores if full scores is 100: data path can works fine as expected and skeleton is stable)
- merge build logic to CMakeLists.txt and remove Makefile in hexagon-kernels accordingly, the total number of affected files in the upstream llama.cpp can be reduced by one.
04/02/2025
- make buffer mechanism in ggml-hexagon.cpp more clear
- fix a stupid issue in function ggmlhexagon_set_rpc_latency and NPU performance improved significantly.
- fix a minor issue(general scenario VS corner scenario) in function ggmlhexagon_init_dsp
- list all known todo and fixme tasks in ggml-hexagon.cpp
- now should be formal v1.00(1.00 means the code in ggml-hexagon.cpp(ARM-AP side) is closer to productive quality(after fix all known FIXME in ggml-hexagon.cpp)), 2 days later than excepted, or 4 days later than excepted 03/29/2025(re-launch effort on Hexagon NPU since 01/29/2025 and planned release v1.00 on 03/29/2025 but there are many tricky issues in on-going ggml-dsp).
03/31/2025
- cDSP rpc ion memory pool and test-backend-ops works fine at the first time,v1.00 has released.
03/29/2025
- code on AP(arm-cpu) side is stable now,v0.99 has released.
03/25/2025-03/27/2025
- focus on hexagon kernels on cDSP: a tiny customized ggml-dsp which ported from original ggml
- try to make test-backend-ops on cDSP works fine
- try to optimize hexagon kernels with HVX SIMD instructions
03/19/2025---03/24/2025
- provide a very fast approach: offload ggml op to Hexagon cDSP directly, details can be found my PR in upstream:PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp ggml-org/llama.cpp#12326
03/12/2025---03/19/2025
implement a concise implementation of the special approach:"mapping the entire ggml cgraph to a single QNN graph"
01/29/2025---03/11/2025
- re-launch activity of "Refine ggml-hexagon backend(Qualcomm's Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp"
- data path works pretty good as expected with whisper.cpp and llama.cpp and test-backend-ops and llama-cli with ggml-qnn backend and verified on Xiaomi14(high-end Qualcomm mobile SoC equipped Android phone)
- support quantize type mulmat
- more feature, more UT, more CT, bugfix, santiy check, code refine, refine code format according to coding stye and principle of upstream ggml community
- support OPs
  - GGML_OP_ADD/GGML_OP_SUB/GGML_OP_MUL/GGML_OP_DIV/GGML_OP_LOG/GGML_OP_SQRT/GGML_OP_MUL_MAT
- submit the second formal PR to upstream(broken and polluted, now already deprecated)
- submit the third formal PR to upstream llama.cpp community:"Refine ggml-hexagon backend(Qualcomm's Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp"
05/28/2024---06/15/2024
- re-launch activity of PR in upstream ggml community
04/26/2024
- refine PR according to coding stye and principles of upstream ggml community
- add command line test using test-backend-ops.cpp
- refine PR according to comments from reviewer
04/24/2024
- the first formal (a very beginning) PR to upstream llama.cpp community
- data path works fine as expected by a workaround approach which not accepted by the author of ggml backend subsystem with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC
- Support OPs
  - GGML_OP_ADD
  - GGML_OP_MUL
  - GGML_OP_MUL_MAT
03/29/2024---04/24/2024
- first implementation of ggml-qnn PoC:add QNN backend for Qualcomm mobile SoC
03/25/2024
- Add Qualcomm mobile SoC native backend for GGML ggml-org/ggml#771
03/05/2024---03/16/2024
- first touch with ggml PoC:clean-room implementation of real-time AI subtitle for English online-TV(OTT TV)

OS

OS	Status	Verified
Android	Support	Android 14 , Android 15
Windows over ARM	TBD	TBD
Linux	TBD	TBD

Hardware

Qualcomm Hexagon NPU

Verified devices

Qualcom mobile SoC	Status	Verified Vendor
Qualcomm Snapdragon 8 Gen 3	Support	Xiaomi 14
Qualcomm Snapdragon 8 Elite	Support	Xiaomi 15

DataType Supports

DataType	Status
Q4_0	Support, but not optimized
Q8_0	Support, but not optimized
Q6_K	Support, but not optimized
Q8_K	Support, but not optimized

Windows on ARM(Qualcomm desktop SoC)

a Snapdragon desktop SoC equipped WoA device(Windows on ARM) is required to verify build result or further dev activity for WoA(Windows on ARM), unfortunately, I have no such WoA device. accordingly, there are might-be some minor issues on WoA(Windows on ARM). the good news for WoA port is:

a Snapdragon 8gen2 or 8gen3 or 8gen4 equipped Android phone can be seen or bought everywhere, I or this great tech community will finish the major work of ggml-qnn on Snapdragon high-end mobile SoC equipped Android phone.
the WoA port is an easy thing for a skilled Windows programmer because the highly-well designed Qualcomm QNN SDK and the source codes of ggml/llama.cpp are both highly portable.

Android

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok).

utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite)(aka Snapdragon 8 Gen 4)

  git clone https://github.com/zhouwg/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream

 ./scripts/build-run-android.sh 
Usage:
  ./scripts/build-run-android.sh help
  ./scripts/build-run-android.sh print_oplist
  ./scripts/build-run-android.sh build
  ./scripts/build-run-android.sh updateqnnlib
  ./scripts/build-run-android.sh run_testops
  ./scripts/build-run-android.sh run_testop          [ADD/MUL_MAT] 
  ./scripts/build-run-android.sh run_llamacli
  ./scripts/build-run-android.sh run_llamabench

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.

ggml-hexagon for WoA(Windows on ARM)

before build, we should modify file <llama.cpp_src_path>/cmake/arm64-windows-llvm.camke manually(this modification will bring side-effect to other build so we should modify it manually):

download and install Qualcomm QNN SDK on Windows accordingly from https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk, put them in C:\qairt\2.32.0.250228 (as of 03/10/2025, the latest QNN SDK is 2.32.0.250228, pls modify this accordingly)
download the customized/dedicated toolchain llvm-mingw-20250305-ggml-ucrt-x86_64.zip (less then 300M) from https://github.com/zhouwg/toolchain and unzip it to C:\Program Files\llvm-mingw-20250305-ggml-ucrt-x86_64\

open a Windows command line prompt

set PATH=C:\Program Files\llvm-mingw-20250305-ggml-ucrt-x86_64\bin;C:\Program Files\llvm-mingw-20250305-ggml-ucrt-x86_64\Git\cmd;C:\Program Files\llvm-mingw-20250305-ggml-ucrt-x86_64\CMake\bin;%PATH%;

git clone https://github.com/zhouwg/ggml-hexagon
cd ggml-hexagon
git checkout pr_to_upstream
cd pr_to_upstream
cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF -DGGML_HEXAGON=ON -DCMAKE_CXX_FLAGS=-D_WIN32_WINNT=0x602 -DGGML_QNN_SDK_PATH="C:\\qairt\\2.32.0.250228"
cmake --build build-arm64-windows-llvm-release

Known Issues

Currently ggml-hexagon backend don't support 2+GiB ion memory pool

TODO

Optimization for Q6_K and Q4_0
Make the mulmat's performance better than the default ggml backend on Qualcomm Snapdragon 8Gen3/8Elite equipped phone.

Q&A

pls file issue reports on https://github.com/zhouwg/ggml-hexagon/discussions

GitHub contribution:

Please add the [ggml-hexagon] prefix/tag in discussions/PRs titles to help me check/address them without delay.

lippman1125 · 2025-08-29T11:46:19Z

lippman1125
Aug 29, 2025

how about generation speed of llama-3-8b ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

about ggml-hexagon #18

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

about ggml-hexagon #18

Uh oh!

Uh oh!

jeffzhou2000 Mar 16, 2025 Maintainer

llama.cpp for Qualcomm Hexagon NPU(aka ggml-hexagon)

Background

About QNN SDK

About Hexagon SDK

Llama.cpp + Hexagon NPU

News

OS

Hardware

Qualcomm Hexagon NPU

DataType Supports

Windows on ARM(Qualcomm desktop SoC)

Android

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

ggml-hexagon for WoA(Windows on ARM)

Known Issues

TODO

Q&A

GitHub contribution:

Replies: 1 comment

Uh oh!

lippman1125 Aug 29, 2025

jeffzhou2000
Mar 16, 2025
Maintainer

lippman1125
Aug 29, 2025