-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT
#22760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request improves performance by replacing a blocking tolist() call on a CUDA tensor with a non-blocking copy and an explicit CUDA event synchronization. This avoids unnecessary device-wide stream synchronization. My review focuses on further optimizing this change by suggesting the reuse of allocated resources to minimize overhead in this critical performance path.
cbf3070 to
284eba1
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liuzijing2014!
It might be cleaner to put this in a separate to_list(self, tensor) method?
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this work instead of using an event?
| pinned.copy_(sampled_token_ids, non_blocking=True) | |
| self.transfer_event.record() | |
| self.transfer_event.synchronize() | |
| pinned.copy_(sampled_token_ids, non_blocking=True) | |
| torch.cuda.current_stream().synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the observation, i think the CUDA stream synchronization is what has caused us the issue. I added a few more details in the issue ticket: #22754
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I was just curious whether the problem was that all streams were being synchronized.
If my suggestion doesn't work or is inferior in some way then the event synchronization looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I was just curious whether the problem was that all streams were being synchronized.
Yes, I think this is what we have observed.
284eba1 to
fda5aa5
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
@njhill @WoosukKwon do you mind taking a quick review and see if this looks to you? Would appreciate this PR gets merged sooner to unblock a potential production launch. |
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liuzijing2014! LGTM, but I would like to get blessing from @WoosukKwon too on this one.
Please also merge in latest main and resolve the conflicts.
|
We may also want to consider doing the same for other tensors that are transferred in the non default case such as logprobs and spec decode tokens. |
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me. Sorry for the delay!
|
@liuzijing2014 but wait until #23118 is merged before rebasing! |
|
Thanks for the help! Will rebase and merge soon. |
|
@liuzijing2014 let me know if you'd like me to rebase to get this over the line |
@njhill sorry for the delay merging today! |
fda5aa5 to
7b6a613
Compare
Signed-off-by: Zijing Liu <[email protected]>
7b6a613 to
a9bff6c
Compare
|
@liuzijing2014 sorry, I didn't get to it in time and there are more conflicts that need resolving now :-/ |
|
@njhill i don't think the failing CIs is related to this PR, do you think we can still merge this PR directly? |
…oid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (vllm-project#22760) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: tc-mb <[email protected]>
…oid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (vllm-project#22760) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Zijing Liu <[email protected]>
…oid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (vllm-project#22760) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…oid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (vllm-project#22760) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Zijing Liu <[email protected]>
|
@liuzijing2014 @njhill |
@tomasruizt it still blocks, it's just that it can now happen concurrently with transfers that kv connectors might make in different cuda streams. We are working on eliminating this sync point altogether, including for the spec decoding case: #23569 |
…oid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (vllm-project#22760) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Zijing Liu <[email protected]>
…oid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (vllm-project#22760) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Zijing Liu <[email protected]>
|
I am seeing crashes in Instance 1: Instance 2: Couldn't find a way to reproduce them yet, but likely caused by this change?! There is also this related fix: #28025 But that wrong buffer seems not the cause of the crash as @liuzijing2014, @njhill wdyt? |
|
Is the issue that the copy operation is happening while the next iteration runs? I am not familiar with the details here but to me the error seems like there are concurrent copy operations and this change could make them overlap. Can iteration N-1 still be copying the output while operation N also gets to that point? Both then access the same pinned buffer? |
|
@dr75 I expect the root cause of the issue you are seeing is unrelated to this change. CUDA operates asynchronously and errors can manifest in arbitrary places where there are sync points. You can try with |
Current Issue
Mitigation to avoid blocking copy operations across different CUDA streams. Details could be found in: #22754
Change
When we copy the sampled valid token ids from device to host, avoid using
tolistwhich would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync.Test for Non Disagg
Bring up vLLM server
Load test for non-disagg setup and observe no perf difference:
Before
After
Test for Disagg
Runs with a vendor internal disagg implementation (kv connector interface based), and observe TTIT wins.
Before
After
Also confirmed from GPU trace that there is no blocking behavior between vLLM model forward CUDA stream with other CUDA streams.

Accuracy
gsm8K 8shot (disagg)
mmlu_pro (non-disagg)
cc @WoosukKwon @lucas-tucker @simon-mo @houseroad