Add utility functions to enable pass-kv prefill and allgather decode. #26059

qiruiyangmeta · 2025-10-01T23:02:00Z

Add utility functions to enable pass-kv prefill and allgather decode. For a more in-depth understanding of context parallelism in LLM inference, including partial attention, read the MLSys paper available at https://arxiv.org/pdf/2411.01783.

Purpose

Within the model, attention is the only component that has dependency on the sequence dimension, since each token must attend to all previous tokens in the same sequence. In contrast, FFN and element-wise operations are performed independently for each token. To implement efficient context parallelism in vLLM, the design needs to be aware of these dependencies to minimize synchronization overhead.
During the prefill phase, both the query (Q) and key-value (KV) tensors are sharded across GPUs. To ensure that each Q token can attend to all preceding KV tokens, it is necessary to exchange the relevant Q or KV shards among GPUs. To reduce synchronization overhead, data transfers are overlapped with partial attention computations, with the goal of fully hiding data transfer latency.
The choice between passing KV or Q shards depends on the relative sizes of the Q and KV tensors. For full prefill, passing KV shards is generally preferred, as the number of queries per KV head typically exceeds two in most models. Conversely, for chunked prefill, passing Q shards may be more efficient if the KV cache length is significantly greater than the number of Q tokens. The following figure shows an example of prefill with cp2.

For decode, KV are stored on CP ranks in a round-robin manner. The figure illustrates how decode is done when cp=2.

Test Plan

Unit tests and e2e tests will be added in following diffs.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces utility functions and configuration changes to support context parallelism in vLLM, specifically for "pass-kv prefill" and "allgather decode" strategies. The changes span across configuration, distributed state management, worker components, and new attention utilities. While the core logic for context parallelism seems well-thought-out, I've identified a critical issue in the creation of expert parallel groups when context parallelism is enabled, which could lead to a runtime error. Additionally, there's a bug in one of the new tests.

gemini-code-assist · 2025-10-01T23:04:28Z

vllm/distributed/parallel_state.py

+    group_ranks = (all_ranks.transpose(1, 2).reshape(
+        -1, data_parallel_size * tensor_model_parallel_size).unbind(0))


The logic for creating expert parallel (EP) groups is incorrect when context parallelism (CP) is enabled (context_parallel_size > 1). The reshape operation on a non-contiguous tensor resulting from transpose(1, 2) will fail. The current all_ranks tensor has dimensions (ext_dp, dp, pp, cp, tp), and after transpose(1, 2), its shape becomes (ext_dp, pp, dp, cp, tp). Attempting to reshape this to (-1, dp*tp) is not a valid operation if cp > 1.

To correctly form EP groups of size dp*tp for each (pp, cp) pair, you should permute the dimensions to make the tensor contiguous with dp and tp as the last dimensions before reshaping.

Suggested change

group_ranks = (all_ranks.transpose(1, 2).reshape(

-1, data_parallel_size * tensor_model_parallel_size).unbind(0))

group_ranks = (all_ranks.permute(0, 2, 3, 1, 4).contiguous().view(

-1, data_parallel_size * tensor_model_parallel_size).unbind(0))

gemini-code-assist · 2025-10-01T23:04:28Z

tests/v1/attention/test_context_parallel_attention.py

+    assert num_comp_local == [
+        num_computed_tokens[0][-1] // 2, [num_computed_tokens[1][-1] // 2]
+    ]


The assertion for num_comp_local is incorrect. The second element in the expected list is [num_computed_tokens[1][-1] // 2], which is a list containing an integer. However, num_comp_local is a flat list of integers. This type mismatch will cause the test to fail.

Suggested change

assert num_comp_local == [

num_computed_tokens[0][-1] // 2, [num_computed_tokens[1][-1] // 2]

]

assert num_comp_local == [

num_computed_tokens[0][-1] // 2, num_computed_tokens[1][-1] // 2

]

Signed-off-by: Qirui Yang <[email protected]>

mergify · 2025-10-07T22:16:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qiruiyangmeta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hmellor · 2025-10-08T08:32:28Z

These conflicts are caused by our migration to ruff. Please see https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1759663228844749 which contains detailed instructions to make updating your branch as painless as possible.

qiruiyangmeta requested review from LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners October 1, 2025 23:02

mergify bot added the v1 label Oct 1, 2025

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

qiruiyangmeta force-pushed the add_cp_for_xformers branch from 6d4170c to 1e98a21 Compare October 2, 2025 17:15

qiruiyangmeta mentioned this pull request Oct 2, 2025

[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133

Open

1 task

qiruiyangmeta force-pushed the add_cp_for_xformers branch 2 times, most recently from c480912 to e61f17d Compare October 2, 2025 22:47

Qirui Yang added 9 commits October 2, 2025 19:42

Add context parallelism configurations and parallel group

633bde0

Lint

09e199a

Fix EP group ranks

0f803e8

Add token sharding functions and tests for context parallelism

3ba4ecd

lint

fe334e6

lint

f6b6ed3

Add token sharding functions and tests for context parallelism

57ce4af

Add context parallel support for xformers attention backend

9e845dc

lint

29cd86f

qiruiyangmeta force-pushed the add_cp_for_xformers branch from e61f17d to 29cd86f Compare October 3, 2025 02:55

lint

a2a23de

Signed-off-by: Qirui Yang <[email protected]>

mergify bot added the needs-rebase label Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add utility functions to enable pass-kv prefill and allgather decode. #26059

Add utility functions to enable pass-kv prefill and allgather decode. #26059

Uh oh!

qiruiyangmeta commented Oct 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 1, 2025

Uh oh!

gemini-code-assist bot Oct 1, 2025

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

hmellor commented Oct 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		group_ranks = (all_ranks.transpose(1, 2).reshape(
		-1, data_parallel_size * tensor_model_parallel_size).unbind(0))

Uh oh!

Add utility functions to enable pass-kv prefill and allgather decode. #26059

Are you sure you want to change the base?

Add utility functions to enable pass-kv prefill and allgather decode. #26059

Uh oh!

Conversation

qiruiyangmeta commented Oct 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

hmellor commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qiruiyangmeta commented Oct 1, 2025 •

edited by github-actions bot

Loading

hmellor commented Oct 8, 2025 •

edited

Loading