Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

MollySophia · 2024-12-28T10:22:55Z

QRWKV6-32B is a new model by Recursal which is a combination of the Qwen2.5 architecture and RWKV6.
It 'converts' a Qwen2.5-32B-Instruct model's QKV attention into RWKV6 linear attention, keeping knowledges in the origin Qwen model while gaining the advantages of linear models (constant vram usage and flops, independent of ctxlen).
More info/model for testing: https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1
Some converted GGUF for testing: https://huggingface.co/mollysama/QRWKV6-32B-Instruct-Preview-GGUF

Changes in this PR:

Add OP gated linear attention with CPU and CUDA impl, which looks like a simplified version of RWKV6 wkv attention.
Model conversion and inferencing for QRWKV6-32B
RWKV6 optimizations: graph simplification; concated lerp weights to reduce cpu overhead during inference (credit to @compilade)

Testing details:

32B Q4_0/Q4_K quantized model running on a single 4090 with decent speed:

$ ./build/bin/llama-bench -m ../QRWKV6-32B-Instruct-Preview-v0.1-Q4_0.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| rwkv6qwen2 32B Q4_0            |  19.34 GiB |    34.74 B | CUDA       |  99 |         pp512 |        819.60 ± 1.01 |
| rwkv6qwen2 32B Q4_0            |  19.34 GiB |    34.74 B | CUDA       |  99 |         tg128 |         32.72 ± 0.01 |

build: 5a73dbcb (4397)

wikitext2 PPLs:

Quant type	PPL
f32	5.6987 +/- 0.03365
q8_0	5.7005 +/- 0.03370
q6_k	5.7126 +/- 0.03376
q5_k_s	5.7339 +/- 0.03393
q4_k_m	5.7921 +/- 0.03428
q4_0	5.8568 +/- 0.03481
q3_k_m	6.0677 +/- 0.03635
q2_k	7.4547 +/- 0.04597

Performance of QRWKV6-32B difference before/after concating lerp weights together:

(Sry for the image attachment)

before:
$ ./build/bin/llama-bench -m ../QRWKV6-32B-Instruct-Preview-v0.1/QRWKV6-32B-Instruct-Preview-v0.1-F16.gguf -sm none -mg 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H800, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         pp512 |        697.64 ± 0.59 |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         tg128 |         21.91 ± 0.00 |

build: b7b45753 (4397)

after:
$ ./build/bin/llama-bench -m ../QRWKV6-32B-Instruct-Preview-v0.1/QRWKV6-32B-Instruct-Preview-v0.1-F16-fused-lerp.gguf -sm none -mg 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H800, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         pp512 |        731.32 ± 1.10 |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         tg128 |         26.51 ± 0.01 |

build: b7b45753 (4397)

Signed-off-by: Molly Sophia <[email protected]>

MollySophia · 2025-01-07T00:31:36Z

Hi! @ggerganov
May I request for a review? :3

ggerganov

I haven't tested the models. ggml-ci is passing on my CUDA machine.

ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

MollySophia · 2025-01-10T00:44:53Z

Self-reviewed again and fixed a merge issue after the refacor that makes RWKV6 fail to run with fused lerp weights.
I think this PR is ready to go.

src/llama-quant.cpp

@compilade

thanks @compilade Signed-off-by: Molly Sophia <[email protected]>

@compilade

llama: add support for QRWKV6 model architecture (ggml-org#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <[email protected]> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <[email protected]> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <[email protected]> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <[email protected]> * Fix some typos Signed-off-by: Molly Sophia <[email protected]> * code format changes Signed-off-by: Molly Sophia <[email protected]> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <[email protected]> * Fix cuda warning Signed-off-by: Molly Sophia <[email protected]> * Update README.md Signed-off-by: Molly Sophia <[email protected]> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <[email protected]> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <[email protected]> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: compilade <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: compilade <[email protected]>

@compilade

llama: add support for QRWKV6 model architecture (ggml-org#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <[email protected]> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <[email protected]> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <[email protected]> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <[email protected]> * Fix some typos Signed-off-by: Molly Sophia <[email protected]> * code format changes Signed-off-by: Molly Sophia <[email protected]> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <[email protected]> * Fix cuda warning Signed-off-by: Molly Sophia <[email protected]> * Update README.md Signed-off-by: Molly Sophia <[email protected]> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <[email protected]> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <[email protected]> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: compilade <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: compilade <[email protected]>

@compilade

llama: add support for QRWKV6 model architecture (ggml-org#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <[email protected]> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <[email protected]> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <[email protected]> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <[email protected]> * Fix some typos Signed-off-by: Molly Sophia <[email protected]> * code format changes Signed-off-by: Molly Sophia <[email protected]> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <[email protected]> * Fix cuda warning Signed-off-by: Molly Sophia <[email protected]> * Update README.md Signed-off-by: Molly Sophia <[email protected]> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <[email protected]> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <[email protected]> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: compilade <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: compilade <[email protected]>

MollySophia added 9 commits January 3, 2025 16:56

WIP: Add support for RWKV6Qwen2

f298f03

Signed-off-by: Molly Sophia <[email protected]>

RWKV: Some graph simplification

385b611

Signed-off-by: Molly Sophia <[email protected]>

Add support for RWKV6Qwen2 with cpu and cuda GLA

fab0aa7

Signed-off-by: Molly Sophia <[email protected]>

RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

bc930cd

Signed-off-by: Molly Sophia <[email protected]>

Fix some typos

f2c1a5c

Signed-off-by: Molly Sophia <[email protected]>

code format changes

aaa870e

Signed-off-by: Molly Sophia <[email protected]>

Fix wkv test & add gla test

00930e6

Signed-off-by: Molly Sophia <[email protected]>

Fix cuda warning

08cf560

Signed-off-by: Molly Sophia <[email protected]>

Update README.md

331581b

Signed-off-by: Molly Sophia <[email protected]>

MollySophia force-pushed the rwkv6qwen2 branch from 69148cf to 331581b Compare January 3, 2025 09:21

ggerganov approved these changes Jan 7, 2025

View reviewed changes

ggml/src/ggml-cuda/gla.cu Outdated Show resolved Hide resolved

ggerganov requested a review from compilade January 7, 2025 08:58

MollySophia and others added 2 commits January 7, 2025 17:00

Update ggml/src/ggml-cuda/gla.cu

aed0afb

Co-authored-by: Georgi Gerganov <[email protected]>

Fix fused lerp weights loading with RWKV6

d8a304c

Signed-off-by: Molly Sophia <[email protected]>

compilade reviewed Jan 10, 2025

View reviewed changes

src/llama-quant.cpp Outdated Show resolved Hide resolved

better sanity check skipping for QRWKV6 in llama-quant

324afba

thanks @compilade Signed-off-by: Molly Sophia <[email protected]>

compilade approved these changes Jan 10, 2025

View reviewed changes

MollySophia merged commit ee7136c into ggml-org:master Jan 10, 2025
50 of 51 checks passed

qnixsynapse mentioned this pull request Jan 10, 2025

SYCL: Add gated linear attention kernel #11175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

Uh oh!

MollySophia commented Dec 28, 2024 •

edited

Loading

Uh oh!

MollySophia commented Jan 7, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

MollySophia commented Jan 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

Uh oh!

Conversation

MollySophia commented Dec 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MollySophia commented Jan 7, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MollySophia commented Jan 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MollySophia commented Dec 28, 2024 •

edited

Loading