fix(v1/kv_cache): resolve async KV transfer bug in cascade attention #23485

ayushsatyam146 · 2025-08-24T08:57:36Z

Purpose

Solves #23130. This change fixes a critical bug in vLLM's cascade attention optimization in the V1 arch. The bug is in get_num_common_prefix_blocks(), which determines how many KV cache blocks are shared among all currently running requests to enable cascade attention optimizations.

Changes made

Replace ref_cnt-based common prefix detection with running request tracking
Update get_num_common_prefix_blocks() to accept running_request_ids set
Fix FullAttentionManager to count actual references from running requests
Prevent incorrect cascade attention when async KV offloading delays cleanup

gemini-code-assist

Code Review

This pull request correctly addresses a critical bug in cascade attention related to asynchronous KV transfer by replacing the unreliable ref_cnt-based logic with explicit tracking of running requests. The changes are well-contained and logically sound. My review includes one suggestion to optimize the performance of the new common prefix block calculation, which could be a bottleneck in scenarios with many concurrent requests.

heheda12345

Even after block in self.req_to_blocks[req_id] is fixed, I'm still concern about the performance when all requests are sharing a very long prefix. The time complexity is num_requests x num_blocks_per_request. What about passing in the requests that are not running but are during kv transfer?

vllm/v1/core/single_type_kv_cache_manager.py

ayushsatyam146 · 2025-08-29T07:03:33Z

Hi @heheda12345 @njhill The time complexity of the new code is O(RxB) now, which was O(RxB²) in the previous iteration. I have one caching based implementation as well in mind which will bring down the complexity to O(1) best case and O(RxB) worst case. But that makes the code a little complex for this module hence I did not want to push that version without someone's approval. PTAL if this is fine or if we need to improve this further? Thanks!

heheda12345 · 2025-08-29T07:49:03Z

My example code is O((num_transfering_request+1) * num_common_blocks). It should be much faster than num_running_request * num_common_blocks for short requests.

ayushsatyam146 · 2025-08-29T18:11:26Z

Hi @heheda12345 I did the changes your way this time and have pushed it as well. Please take a look, Thanks!

ayushsatyam146 · 2025-09-02T06:18:22Z

Hi @heheda12345 just a gentle reminder to please take a look and approve if everything is right. Thanks!

vllm/v1/core/kv_cache_coordinator.py

vllm/v1/core/single_type_kv_cache_manager.py

heheda12345 · 2025-09-10T05:30:57Z

@ayushsatyam146 Hi, can you help to update this PR?

ayushsatyam146 · 2025-09-10T05:50:09Z

Hi @heheda12345 sorry I got sick this week and couldn't work on this. But I am good now and will update this soon, Thanks for the patience.

ayushsatyam146 · 2025-09-27T02:36:31Z

@heheda12345, I tried to address all your concerns. Can you please take a look now, Thanks!

vllm/v1/core/single_type_kv_cache_manager.py

mergify · 2025-10-06T09:31:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ayushsatyam146 · 2025-10-06T19:29:12Z

Hi @heheda12345 I resolved the merge conflicts on this and also included the changes suggested by you. Please take a look, Thanks!

heheda12345

This implementation looks great!

vllm/v1/core/single_type_kv_cache_manager.py

heheda12345 · 2025-10-07T04:35:42Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/core/single_type_kv_cache_manager.py

vllm/v1/core/kv_cache_manager.py

* Replace ref_cnt-based common prefix detection with running request tracking * Update get_num_common_prefix_blocks() to accept running_request_ids set * Fix FullAttentionManager to count actual references from running requests * Prevent incorrect cascade attention when async KV offloading delays cleanup This resolves a bug where completed requests with pending async transfers still contributed to ref_cnt, causing incorrect cascade attention decisions. Signed-off-by: Ayush Satyam <[email protected]>

ayushsatyam146 · 2025-10-07T17:33:33Z

Hi @heheda12345, I went through this approach, and apart from some occasional conservative handling of cascade attention, it looks good overall. I’ve implemented it as well — please take a look when you get a chance. Thanks!

heheda12345

LGTM! I think this solution is clean.

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]>

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]>

### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with vllm-project/vllm@17c540a 1. refactor deepseek to the latest code arch as of vllm-project/vllm@17c540a 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by vllm-project/vllm#25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by vllm-project/vllm#26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by vllm-project/vllm#23485 - Fix `MLAAttention` import,caused by vllm-project/vllm#25103 - Fix `SharedFusedMoE` import, caused by vllm-project/vllm#26145 - Fix `LazyLoader` improt, caused by vllm-project/vllm#27022 - Fix `vllm.utils.swap_dict_values` improt, caused by vllm-project/vllm#26990 - Fix `Backend` enum import, caused by vllm-project/vllm#25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by vllm-project/vllm#26355 - Fix fused_moe ops, caused by vllm-project/vllm#24097 - Fix bert model because of `inputs_embeds`, caused by vllm-project/vllm#25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by vllm-project/vllm#24172 - Fix `splitting_ops` changes introduced by vllm-project/vllm#25845 - Fix multi-modality changes introduced by vllm-project/vllm#16229 - Fix lora bias dropping issue introduced by vllm-project/vllm#25807 - Fix structured ouput break introduced by vllm-project/vllm#26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: Icey <[email protected]> Co-authored-by: Icey <[email protected]>

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

ayushsatyam146 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 24, 2025 08:57

mergify bot added the v1 label Aug 24, 2025

gemini-code-assist bot reviewed Aug 24, 2025

View reviewed changes

ayushsatyam146 force-pushed the kv-cache-fix branch from f3f7105 to 91e3694 Compare August 24, 2025 10:11

heheda12345 reviewed Aug 26, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

ayushsatyam146 force-pushed the kv-cache-fix branch 4 times, most recently from 542d108 to 8adac43 Compare August 29, 2025 04:48

ayushsatyam146 force-pushed the kv-cache-fix branch from 8adac43 to 4d368d4 Compare August 29, 2025 18:09

ayushsatyam146 force-pushed the kv-cache-fix branch 4 times, most recently from 64ed09d to 0d66b57 Compare September 2, 2025 03:53

heheda12345 reviewed Sep 2, 2025

View reviewed changes

ayushsatyam146 force-pushed the kv-cache-fix branch from 0d66b57 to b11852c Compare September 26, 2025 16:37

ayushsatyam146 requested a review from ApostaC as a code owner September 26, 2025 16:37

ayushsatyam146 force-pushed the kv-cache-fix branch from b11852c to 606c471 Compare September 26, 2025 16:37

ayushsatyam146 force-pushed the kv-cache-fix branch from 4d031f2 to fd1c710 Compare October 4, 2025 05:05

heheda12345 reviewed Oct 6, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Oct 6, 2025

ayushsatyam146 force-pushed the kv-cache-fix branch from fd1c710 to bd0e85c Compare October 6, 2025 19:23

mergify bot removed the needs-rebase label Oct 6, 2025

heheda12345 reviewed Oct 7, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 7, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

ayushsatyam146 force-pushed the kv-cache-fix branch from bd0e85c to e105a65 Compare October 7, 2025 04:46

heheda12345 reviewed Oct 7, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Show resolved Hide resolved

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

ayushsatyam146 force-pushed the kv-cache-fix branch from e105a65 to 912667d Compare October 7, 2025 07:15

ayushsatyam146 force-pushed the kv-cache-fix branch from 912667d to d3405fd Compare October 7, 2025 17:02

heheda12345 approved these changes Oct 8, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) October 8, 2025 02:55

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 8, 2025

heheda12345 merged commit cd98905 into vllm-project:main Oct 8, 2025
46 checks passed

mrasquinha-g pushed a commit to mrasquinha-g/vllm that referenced this pull request Oct 9, 2025

fix(v1/kv_cache): resolve async KV transfer bug in cascade attention (v…

8bb6fd8

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]>

This was referenced Oct 16, 2025

[CI] Upgrade vllm to newest commit vllm-project/vllm-ascend#3423

Closed

[CI] Upgrade vllm to 0.11.1 vllm-project/vllm-ascend#3499

Closed

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

fix(v1/kv_cache): resolve async KV transfer bug in cascade attention (v…

9967b47

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]>

MengqingCao mentioned this pull request Oct 22, 2025

[1/N][Refactor] Refactor code to adapt with vllm main vllm-project/vllm-ascend#3612

Merged

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

fix(v1/kv_cache): resolve async KV transfer bug in cascade attention (v…

b1f174f

…llm-project#23485) Signed-off-by: Ayush Satyam <[email protected]>

Uh oh!

fix(v1/kv_cache): resolve async KV transfer bug in cascade attention #23485

fix(v1/kv_cache): resolve async KV transfer bug in cascade attention #23485

Uh oh!

Conversation

ayushsatyam146 commented Aug 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes made

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ayushsatyam146 commented Aug 29, 2025

Uh oh!

heheda12345 commented Aug 29, 2025

Uh oh!

ayushsatyam146 commented Aug 29, 2025

Uh oh!

ayushsatyam146 commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Sep 10, 2025

Uh oh!

ayushsatyam146 commented Sep 10, 2025

Uh oh!

ayushsatyam146 commented Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

ayushsatyam146 commented Oct 6, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

heheda12345 commented Oct 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayushsatyam146 commented Oct 7, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ayushsatyam146 commented Aug 24, 2025 •

edited by github-actions bot

Loading