[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807

simondanielsson · 2025-10-14T13:02:11Z

Purpose

Part of #26201.

Adds Automatic Prefix Caching for GDN. Tries to be similar to APC for Mamba2 as introduced in #25752.

Specifically:

Extends the gated-delta chunk kernel to optionally return per-chunk intermediate states (flattening them into a contiguous stream so callers can repopulate prefix cache blocks).
Updates Qwen3NextGatedDeltaNet to recycle cached states during decode by copying the last computed block into the newly scheduled slot, and during prefill to replay the returned chunk history into persistent SSM cache blocks so later tokens can hit the prefix cache

Latency benchmark (APC ("default") vs no-APC ("default-noapc")):

Useful mostly for long prefills and short decodes, with similar trend as for Mamba2 as discussed in [V1] [Hybrid] Mamba2 Automatic Prefix Caching #25752
Note the figure is incomplete as I couldn't run all tests for batch sizes > 1 on my machine

TODOs:

Add better logic for making the kernel return intermediate states rather than using GDN_RECOMPUTE_SUPPRESS_LEVEL=4.
Make it work with fullgraph (decode)
Extend APC test suite to also run on qwen3-next (tiny random)
Run latency benchmarks on small model
Benchmark on 80B-A3 (I will need help from someone here)

Outstanding tasks, not captured here:

Support specdec

Test Plan

Note: this runs only with the tiny tiny-random/qwen3-next-moe model, as I only have an L4 with 20GB VRAM. Would be great if someone could try also with Qwen3-Next-80B-A3B

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory
import time

if __name__ == "__main__":
    # Note: should be tested with Qwen/Qwen3-Next-80B-A3B-Instruct
    MODEL = "tiny-random/qwen3-next-moe"
    PROMPT_MULTIPLE = 310
    sampling_params = SamplingParams(temperature=0.0)
    prefix = (  # examples/offline_inference/prefix_caching.py
        "You are an expert school principal, skilled in effectively managing "
        "faculty and staff. Draft 10-15 questions for a potential first grade "
        "Head Teacher for my K-12, all-girls', independent school that emphasizes "
        "community, joyful discovery, and life-long learning. The candidate is "
        "coming in for a first-round panel interview for a 8th grade Math "
        "teaching role. They have 5 years of previous teaching experience "
        "as an assistant teacher at a co-ed, public school with experience "
        "in middle school math teaching. "
    )
    prefix2 = "Based on these information, fulfill " "the following paragraph: "
    prompt = PROMPT_MULTIPLE * prefix + prefix2 + "Hello, my name is"
    print("Prompt length:", len(prompt))
    for APC in [True, False]:
        engine = LLM(
            model=MODEL,
            enable_prefix_caching=APC,
            gpu_memory_utilization=0.3,
            disable_log_stats=False,
        )
        for i in range(3):
            if i == 0:
                print("Warm-up")
            if i == 1:
                print("Measuring")
                start_time = time.time()
            outputs = engine.generate(prompt, sampling_params)
            print("APC:", APC, i, f"Generated text: {outputs[0].outputs[0].text!r}")
            for m in engine.llm_engine.get_metrics():
                if "vllm:prefix_cache_hits" in m.name:
                    print(m.name, m.value)
        print("APC:", APC, "loop took --- %s seconds ---" % (time.time() - start_time))
        del engine
        cleanup_dist_env_and_memory()

Test Result

Note: gibberish output due to random model.

No cudagraphs (enforce_eager=True):

Warm-up
APC: True 0 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 0
Measuring
APC: True 1 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 31680
APC: True 2 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 63360
APC: True loop took --- 0.7412824630737305 seconds ---

Warm-up
APC: False 0 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 0
Measuring
APC: False 1 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 0
APC: False 2 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 0
APC: False loop took --- 0.9228880405426025 seconds ---

With cudagraphs (enforce_eager=False):

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:02<00:00, 24.18it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00,  8.98it/s]
INFO 10-14 13:44:50 [gpu_model_runner.py:3821] Graph capturing finished in 7 secs, took 0.34 GiB
INFO 10-14 13:44:50 [core.py:242] init engine (profile, create kv cache, warmup model) took 25.02 seconds
INFO 10-14 13:44:51 [loggers.py:191] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 10969
INFO 10-14 13:44:51 [llm.py:335] Supported tasks: ('generate',)
Warm-up
APC: True 0 Generated text: ' estado Bernie阿拉 remotelySr春晚 ứngibelENCYcancel scientificallyResidentsnah Stout__))荁'
vllm:prefix_cache_hits 0
Measuring
APC: True 1 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 31680
APC: True 2 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 63360
APC: True loop took --- 0.3312194347381592 seconds ---

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:00<00:00, 72.41it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 17.99it/s]
INFO 10-14 13:54:26 [gpu_model_runner.py:3821] Graph capturing finished in 3 secs, took 0.20 GiB
INFO 10-14 13:54:26 [core.py:242] init engine (profile, create kv cache, warmup model) took 8.07 seconds
INFO 10-14 13:54:27 [loggers.py:191] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 11615
INFO 10-14 13:54:27 [llm.py:335] Supported tasks: ('generate',)
Warm-up
APC: False 0 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 0
Measuring
APC: False 1 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 0
APC: False 2 Generated text: ' estado Bernieatial oggi_five뉼หน้าที่wordpressหน้าที่ibelENCY荁=x Color Gh [],\r\n'
vllm:prefix_cache_hits 0
APC: False loop took --- 0.5677089691162109 seconds ---

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: simondanielsson <[email protected]>

mergify · 2025-10-14T13:02:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @simondanielsson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/models/language/generation/test_hybrid.py

Signed-off-by: simondanielsson <[email protected]>

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <[email protected]> Signed-off-by: FENP <[email protected]> Signed-off-by: Jaya Yuan <[email protected]>

Signed-off-by: simondanielsson <[email protected]>

vllm/config/model.py

Signed-off-by: simondanielsson <[email protected]>

vllm/model_executor/models/qwen3_next.py

…make sure prefill block-history indexing captures decode chunks Signed-off-by: simondanielsson <[email protected]>

…ng GDN_RECOMPUTE_SUPPRESS_LEVEL Signed-off-by: simondanielsson <[email protected]>

Signed-off-by: simondanielsson <[email protected]>

simondanielsson · 2025-10-16T15:44:33Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

vllm/model_executor/models/qwen3_next.py

Signed-off-by: simondanielsson <[email protected]>

simondanielsson · 2025-10-16T16:08:51Z

@codex review

chatgpt-codex-connector · 2025-10-16T16:15:39Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

Signed-off-by: simondanielsson <[email protected]>

hmellor · 2025-10-17T10:12:15Z

vllm/config/model.py

+        if chunk_size is None and self.hf_text_config.model_type == "qwen3_next":
+            # Fallback for Qwen3-Next. 64 is a hardcoded value in the GDN kernel.
+            # https://github.com/fla-org/flash-linear-attention/blob/2e7336262c11f8bc6cd6a94b1eb5ee353ae8b4cd/fla/ops/common/chunk_delta_h.py#L439
+            return 64
+


Would it be possible to put this model specific special case in the model code? i.e. in vllm/model_executor/models/qwen3_next.py?

simondanielsson added 2 commits October 14, 2025 12:51

First working version

0860de4

Signed-off-by: simondanielsson <[email protected]>

Merge remote-tracking branch 'upstream/main' into feature/gdn-apc

7b41ac4

mergify bot added qwen Related to Qwen models v1 labels Oct 14, 2025

mergify bot added the needs-rebase label Oct 14, 2025

simondanielsson commented Oct 14, 2025

View reviewed changes

tests/models/language/generation/test_hybrid.py Outdated Show resolved Hide resolved

Update type hints in gdn_attn

538c9a0

Signed-off-by: simondanielsson <[email protected]>

mergify bot removed the needs-rebase label Oct 14, 2025

FENP and others added 3 commits October 14, 2025 13:36

[DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttenti…

3fffae0

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <[email protected]> Signed-off-by: FENP <[email protected]> Signed-off-by: Jaya Yuan <[email protected]>

Enable cudagraphs support [skip ci]

76ac0fa

Signed-off-by: simondanielsson <[email protected]>

Merge remote-tracking branch 'upstream/main' into feature/gdn-apc

1d3afe0

simondanielsson force-pushed the feature/gdn-apc branch from 0e64636 to 1d3afe0 Compare October 14, 2025 13:40

Fix long() -> long [skip ci]

795ed51

Signed-off-by: simondanielsson <[email protected]>

simondanielsson commented Oct 14, 2025

View reviewed changes

vllm/config/model.py Outdated Show resolved Hide resolved

Add defensive programming asserts

044990c

Signed-off-by: simondanielsson <[email protected]>

simondanielsson commented Oct 14, 2025

View reviewed changes

vllm/model_executor/models/qwen3_next.py Outdated Show resolved Hide resolved

simondanielsson added 10 commits October 16, 2025 10:05

Allocate metadata buffer by chunk count rather than block count, and …

68ca70f

…make sure prefill block-history indexing captures decode chunks Signed-off-by: simondanielsson <[email protected]>

Return hidden state when return_intermediate_states is passed, ignori…

fe8f0b7

…ng GDN_RECOMPUTE_SUPPRESS_LEVEL Signed-off-by: simondanielsson <[email protected]>

Inline _reshape_intermediate_states in the fla chunk kernel wrapper

ac226e8

Signed-off-by: simondanielsson <[email protected]>

Add more explanatory comments in FLA's chunk.py

f975260

Signed-off-by: simondanielsson <[email protected]>

Improve logging

e74f67d

Signed-off-by: simondanielsson <[email protected]>

Add GDN model to APC tests

f177a1f

Signed-off-by: simondanielsson <[email protected]>

Add helpful comments in hard-to-understand areas

552ba6f

Signed-off-by: simondanielsson <[email protected]>

Merge remote-tracking branch 'upstream/main' into feature/gdn-apc

30b1ea0

Improve way to set chunk_size=64 for GDN

2ab062d

Signed-off-by: simondanielsson <[email protected]>

Revert KV cache memory limit in test

4837a11

Signed-off-by: simondanielsson <[email protected]>

simondanielsson changed the title ~~[Feature] GatedDeltaNet Automatic Prefix Caching~~ [V1][Hybrid] GatedDeltaNet Automatic Prefix Caching Oct 16, 2025

simondanielsson marked this pull request as ready for review October 16, 2025 15:40

simondanielsson requested a review from sighingnow as a code owner October 16, 2025 15:40

simondanielsson requested review from LucasWilkinson, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tdoublep, tlrmchlsmth, yewentao256 and youkaichao as code owners October 16, 2025 15:40

Merge remote-tracking branch 'upstream/main' into feature/gdn-apc

3a88844

chatgpt-codex-connector bot reviewed Oct 16, 2025

View reviewed changes

vllm/model_executor/models/qwen3_next.py Show resolved Hide resolved

Add dynamic counting of decode chunks, rather than static value

b58362a

Signed-off-by: simondanielsson <[email protected]>

simondanielsson added 2 commits October 17, 2025 09:57

Add plot

ccda04e

Signed-off-by: simondanielsson <[email protected]>

Remove plot

03aa33c

Signed-off-by: simondanielsson <[email protected]>

hmellor reviewed Oct 17, 2025

View reviewed changes

ZJY0516 mentioned this pull request Oct 21, 2025

[Tracking Issue]: Qwen3-next performance optimisations #27225

Open

8 tasks

tdoublep mentioned this pull request Oct 29, 2025

[Tracking Issue]: Prefix Caching for Hybrid Models #26201

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807

[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807

simondanielsson commented Oct 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simondanielsson commented Oct 16, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

simondanielsson commented Oct 16, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 16, 2025

Uh oh!

hmellor Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807

Are you sure you want to change the base?

[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807

Conversation

simondanielsson commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simondanielsson commented Oct 16, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

simondanielsson commented Oct 16, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 16, 2025

Uh oh!

hmellor Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simondanielsson commented Oct 14, 2025 •

edited by github-actions bot

Loading