about ggml-hexagon #18
Pinned
jeffzhou2000
started this conversation in
General
Replies: 1 comment
-
|
how about generation speed of llama-3-8b ? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
llama.cpp for Qualcomm Hexagon NPU(aka ggml-hexagon)
Background
Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent . Qualcomm is No.1 mobile SoC semiconductor company in our planet currently.
About QNN SDK
QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:
The Qualcomm® AI Engine Direct architecture is designed to be modular and allows for clean separation in the software for different hardware cores/accelerators such as the CPU, GPU and DSP that are designated as backends. Learn more about Qualcomm® AI Engine Direct backends here.
The Qualcomm® AI Engine Direct backends for different hardware cores/accelerators are compiled into individual core-specific libraries that come packaged with the SDK.
About Hexagon SDK
Each Qualcomm chip includes multiple Hexagon DSPs such as the compute DSP (cDSP), audio DSP (aDSP), and sensor DSP (SLPI -- Sensor Low Power Island). Each of these DSPs implement a specific Instruction Set Architecture (ISA) version. The compute DSP, which is intended for compute-intensive tasks such as image processing, computer vision, and camera streaming, also includes an instruction set extension for fixed-point vector operations called Hexagon Vector eXtensions (HVX).The following diagram provides an overview of the processing units within the cDSP and how they connect to the memory cache.

Compared to the host CPU, the DSP typically runs at a lower clock speed but provides more parallelism opportunities at the instruction level. This often makes the DSP a better alternative in terms of throughput and/or power consumption. As a result, it is preferable to offload as many large compute-intensive tasks as possible onto the DSP to reduce power consumption of the device and free up cycles on the CPU for additional features.
Hexagon SDK is a lightweight and low-level SDK provided by Qualcomm. developers and AI experts can operate cDSP hardware directly with Hexagon SDK.
Llama.cpp + Hexagon NPU
The llama.cpp Hexagon NPU backend(aka ggml-hexagon backend) is intended to support Qualcomm Hexagon NPU firstly, supported chipsets:
block-beta columns 1 block:llamacpp llamacpp["llama_cpp"] style llamacpp fill:#3c3,color:#000,stroke:#000 end block:ggml_backend ggml_backend["GGML backend subsystem"] style ggml_backend fill:#3c3,color:#000,stroke:#000 block:ggmlbackends ggml_cpu["ggml-cpu"] ggml_metal["ggml-metal"] ggml_sycl["ggml-sycl"] ggml_cuda["ggml-cuda"] ggml_hip["ggml-hip"] ggml_vulkan["ggml-vulkan"] ggml_cann["ggml-cann"] ggml_opencl["ggml-opencl"] ggml_hexagon["ggml-hexagon"] ggml_nnpa["ggml-nnpa"] ggml_ane["ggml-ane"] style ggml_cpu fill:#888,color:#000,stroke:#000 style ggml_metal fill:#888,color:#000,stroke:#000 style ggml_sycl fill:#888,color:#000,stroke:#000 style ggml_cuda fill:#888,color:#000,stroke:#000 style ggml_hip fill:#888,color:#000,stroke:#000 style ggml_vulkan fill:#888,color:#000,stroke:#000 style ggml_cann fill:#888,color:#000,stroke:#000 style ggml_opencl fill:#cc3,color:#000,stroke:#000 style ggml_hexagon fill:#cc3,color:#000,stroke:#000 style ggml_ane fill:#fff,color:#000,stroke:#f00,stroke-width:2,stroke-dasharray:5 style ggml_nnpa fill:#cc3,color:#000,stroke:#000 end end block:ggml_backendsubsystem ggml_backendsubsystem["GGML backend subsystem"] style ggml_backendsubsystem fill:#3c3,color:#000,stroke:#000 end block:group1:2 columns 2 block:ggml_tensor ggml_tensor["GGML tensor"] style ggml_tensor fill:#3c3,color:#000,stroke:#000 end block:ggml_cgraph ggml_cgraph["GGML cgraph"] style ggml_cgraph fill:#3c3,color:#000,stroke:#000 end end block:OS Windows Linux Android QNX end block:hardware_vendors Intel AMD Apple Nvidia Huawei Loongson Qualcomm IBM ggml_metal --> Apple ggml_cuda --> Nvidia ggml_hip --> AMD ggml_cann --> Huawei ggml_sycl --> Intel ggml_opencl --> Qualcomm ggml_hexagon --> Qualcomm ggml_ane --> Apple ggml_nnpa --> IBM end block:hardware_types CPU GPU NPU DSP end block:hardware_archs x86 arm risc loongson end%%{init: {"flowchart": {"htmlLabels": false, 'nodeSpacing': 30, 'rankSpacing': 30}} }%% flowchart LR classDef EXIST fill:#888,color:#000,stroke:#000 classDef DONE fill:#3c3,color:#000,stroke:#000 classDef WIP fill:#cc3,color:#000,stroke:#000 classDef NEW fill:#fff,color:#000,stroke:#f00,stroke-width:2,stroke-dasharray:5 subgraph Legend direction LR EXIST:::EXIST ~~~ WIP:::WIP ~~~ DONE:::DONE ~~~ NEW:::NEW endNews
06/27/2025
performance of fp32 4096x4096 mulmat on cDSP:
before 05/27/2025: about 28 seconds
relaunched the dev activity of project ggml-hexagon since 05/27/2025
06/09/2025: about 7-8 seconds
06/25/2025: about 6-8 seconds
06/27/2025: about 3.4-4.2seconds
06/25/2025
06/09/2025
06/03/2025
05/10/2025
04/24/2025
04/17/2025
04/12/2025(April/12/2025)
04/09/2025(April/09/2025)
04/08/2025(April/08/2025)
04/07/2025(April/07/2025)
04/06/2025(April/06/2025)
04/05/2025
04/02/2025
03/31/2025
03/29/2025
03/25/2025-03/27/2025
03/19/2025---03/24/2025
03/12/2025---03/19/2025
implement a concise implementation of the special approach:"mapping the entire ggml cgraph to a single QNN graph"
01/29/2025---03/11/2025
05/28/2024---06/15/2024
04/26/2024
04/24/2024
03/29/2024---04/24/2024
03/25/2024
03/05/2024---03/16/2024
OS
Hardware
Qualcomm Hexagon NPU
Verified devices
DataType Supports
Windows on ARM(Qualcomm desktop SoC)
a Snapdragon desktop SoC equipped WoA device(Windows on ARM) is required to verify build result or further dev activity for WoA(Windows on ARM), unfortunately, I have no such WoA device. accordingly, there are might-be some minor issues on WoA(Windows on ARM). the good news for WoA port is:
Android
How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok).
utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite)(aka Snapdragon 8 Gen 4)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.
ggml-hexagon for WoA(Windows on ARM)
before build, we should modify file <llama.cpp_src_path>/cmake/arm64-windows-llvm.camke manually(this modification will bring side-effect to other build so we should modify it manually):

open a Windows command line prompt
Known Issues
TODO
Q&A
pls file issue reports on https://github.com/zhouwg/ggml-hexagon/discussions
GitHub contribution:
Please add the [ggml-hexagon] prefix/tag in discussions/PRs titles to help me check/address them without delay.
Beta Was this translation helpful? Give feedback.
All reactions