[WIP] implement overlap of prepare_input during execute_model #23811

Ronald1995 · 2025-08-28T08:18:54Z

Purpose

in execute_model, it has update_states and prepare_inputs opearations, this PR aims to use multithread to overlap the update_states and prepare_input into gpu to cpu copy operations of sample_token_ids.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Ronald1995 <[email protected]>

mergify · 2025-08-28T08:19:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces an experimental feature for asynchronous model execution to improve performance by overlapping CPU and GPU work. The changes involve adding a new configuration flag async_execute_model, modifying the scheduling and execution logic to handle asynchronous operations, and introducing threading with events for synchronization. While the overall direction is promising for performance, I've identified two critical issues in the implementation that could lead to deadlocks or livelocks. One is in the engine's core stepping logic, and the other is in the multiprocessing executor's handling of the asynchronous execution pipeline. These issues need to be addressed for the feature to function correctly.

gemini-code-assist · 2025-08-28T08:22:16Z

vllm/v1/engine/core.py

+            if (self.async_execute_model
+                    and scheduler_output.total_num_scheduled_tokens == 0):
+                return engine_core_outputs, scheduled_batch


This new conditional block introduces two critical issues:

AttributeError Bug: scheduler_output can be None when this block is reached. This occurs if self.batch_queue is full, because scheduler.schedule() is not called, and scheduler_output remains None. Accessing scheduler_output.total_num_scheduled_tokens will then raise an AttributeError.

Potential Livelock: Even if the AttributeError is fixed (e.g., by checking scheduler_output is not None), a logical flaw remains. If this condition is met, the function returns without processing items from self.batch_queue. Since the state that led to this condition might not change, subsequent calls to step_with_batch_queue could repeatedly hit the same condition, causing items in the queue to be starved and leading to a livelock.

The logic for when to process items from the queue versus returning early needs to be reconsidered to avoid these problems.

gemini-code-assist · 2025-08-28T08:22:16Z

vllm/v1/executor/multiproc_executor.py

+    def execute_model_with_queue(self, func, *args, **kwargs):
+        """Execute model with a queue for async execution."""
+        output = None
+        if not self.exe_queue.full():
+            output_future = self.exe_thread_pool.submit(func, *args, **kwargs)
+            self.exe_queue.put_nowait(output_future)
+        if self.exe_queue.full():
+            output = self.exe_queue.get().result()
+            self.exe_queue.task_done()
+        return output


The current implementation of execute_model_with_queue will lead to a deadlock. Here's why:

On the first call to execute_model_with_queue, self.exe_queue is empty, so it's not full. A future is submitted and added to the queue. The function then returns None.

In worker_busy_loop, because the output from execute_model_with_queue is None, the loop continues to the next iteration without sending a response back to the main process via self.worker_response_mq.enqueue() (due to the if not output: continue check).

The MultiprocExecutor in the main process, which made the collective_rpc call, will hang indefinitely waiting for a response that will never arrive.

To prevent this deadlock, execute_model_with_queue must ensure that a response is sent for every execute_model RPC call. The pipelining logic needs to be revised to guarantee a reply, even for the first call that primes the pipeline.

njhill · 2025-08-28T13:59:28Z

@Ronald1995 have you looked at #23569

Ronald1995 · 2025-08-29T08:30:46Z

@Ronald1995 have you looked at #23569

I haven't looked at it before you mention this, but read the code of #23569 today, it seems like we are doing the same thing.

njhill · 2025-08-31T21:46:17Z

Thanks @Ronald1995... It's good that you had the same idea, we appreciate you submitting the contribution!

implement overlap of prepare_input during execute_model

178cb9f

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners August 28, 2025 08:18

mergify bot added the v1 label Aug 28, 2025

mergify bot added the needs-rebase label Aug 28, 2025

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

Ronald1995 changed the title ~~[[WIP] implement overlap of prepare_input during execute_model~~ [WIP] implement overlap of prepare_input during execute_model Aug 28, 2025

Ronald1995 mentioned this pull request Aug 28, 2025

[Core] Restructure core loop for async input preparation #23391

Closed

njhill mentioned this pull request Aug 29, 2025

[V1] [P/D] Add Support for KV Load Failure Recovery #19330

Merged

Ronald1995 closed this Aug 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] implement overlap of prepare_input during execute_model #23811

[WIP] implement overlap of prepare_input during execute_model #23811

Uh oh!

Ronald1995 commented Aug 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Aug 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

njhill commented Aug 28, 2025

Uh oh!

Ronald1995 commented Aug 29, 2025

Uh oh!

njhill commented Aug 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[WIP] implement overlap of prepare_input during execute_model #23811

[WIP] implement overlap of prepare_input during execute_model #23811

Uh oh!

Conversation

Ronald1995 commented Aug 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Aug 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented Aug 28, 2025

Uh oh!

Ronald1995 commented Aug 29, 2025

Uh oh!

njhill commented Aug 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ronald1995 commented Aug 28, 2025 •

edited by github-actions bot

Loading