- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807
Conversation
Signed-off-by: simondanielsson <[email protected]>
| This pull request has merge conflicts that must be resolved before it can be | 
Signed-off-by: simondanielsson <[email protected]>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <[email protected]> Signed-off-by: FENP <[email protected]> Signed-off-by: Jaya Yuan <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
0e64636    to
    1d3afe0      
    Compare
  
    Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
…make sure prefill block-history indexing captures decode chunks Signed-off-by: simondanielsson <[email protected]>
…ng GDN_RECOMPUTE_SUPPRESS_LEVEL Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
| @codex review | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
Signed-off-by: simondanielsson <[email protected]>
| @codex review | 
| Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you 
 If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting | 
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
| if chunk_size is None and self.hf_text_config.model_type == "qwen3_next": | ||
| # Fallback for Qwen3-Next. 64 is a hardcoded value in the GDN kernel. | ||
| # https://github.com/fla-org/flash-linear-attention/blob/2e7336262c11f8bc6cd6a94b1eb5ee353ae8b4cd/fla/ops/common/chunk_delta_h.py#L439 | ||
| return 64 | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to put this model specific special case in the model code? i.e. in vllm/model_executor/models/qwen3_next.py?
Purpose
Part of #26201.
Adds Automatic Prefix Caching for GDN. Tries to be similar to APC for Mamba2 as introduced in #25752.
Specifically:
Qwen3NextGatedDeltaNetto recycle cached states during decode by copying the last computed block into the newly scheduled slot, and during prefill to replay the returned chunk history into persistent SSM cache blocks so later tokens can hit the prefix cacheLatency benchmark (APC ("default") vs no-APC ("default-noapc")):

TODOs:
GDN_RECOMPUTE_SUPPRESS_LEVEL=4.Outstanding tasks, not captured here:
Test Plan
Note: this runs only with the tiny
tiny-random/qwen3-next-moemodel, as I only have an L4 with 20GB VRAM. Would be great if someone could try also with Qwen3-Next-80B-A3BTest Result
Note: gibberish output due to random model.
No cudagraphs (
enforce_eager=True):With cudagraphs (
enforce_eager=False):Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.