-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[WIP] implement overlap of prepare_input during execute_model #23811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ronald1995 <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an experimental feature for asynchronous model execution to improve performance by overlapping CPU and GPU work. The changes involve adding a new configuration flag async_execute_model, modifying the scheduling and execution logic to handle asynchronous operations, and introducing threading with events for synchronization. While the overall direction is promising for performance, I've identified two critical issues in the implementation that could lead to deadlocks or livelocks. One is in the engine's core stepping logic, and the other is in the multiprocessing executor's handling of the asynchronous execution pipeline. These issues need to be addressed for the feature to function correctly.
| if (self.async_execute_model | ||
| and scheduler_output.total_num_scheduled_tokens == 0): | ||
| return engine_core_outputs, scheduled_batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new conditional block introduces two critical issues:
-
AttributeErrorBug:scheduler_outputcan beNonewhen this block is reached. This occurs ifself.batch_queueis full, becausescheduler.schedule()is not called, andscheduler_outputremainsNone. Accessingscheduler_output.total_num_scheduled_tokenswill then raise anAttributeError. -
Potential Livelock: Even if the
AttributeErroris fixed (e.g., by checkingscheduler_output is not None), a logical flaw remains. If this condition is met, the function returns without processing items fromself.batch_queue. Since the state that led to this condition might not change, subsequent calls tostep_with_batch_queuecould repeatedly hit the same condition, causing items in the queue to be starved and leading to a livelock.
The logic for when to process items from the queue versus returning early needs to be reconsidered to avoid these problems.
| def execute_model_with_queue(self, func, *args, **kwargs): | ||
| """Execute model with a queue for async execution.""" | ||
| output = None | ||
| if not self.exe_queue.full(): | ||
| output_future = self.exe_thread_pool.submit(func, *args, **kwargs) | ||
| self.exe_queue.put_nowait(output_future) | ||
| if self.exe_queue.full(): | ||
| output = self.exe_queue.get().result() | ||
| self.exe_queue.task_done() | ||
| return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of execute_model_with_queue will lead to a deadlock. Here's why:
- On the first call to
execute_model_with_queue,self.exe_queueis empty, so it's not full. A future is submitted and added to the queue. The function then returnsNone. - In
worker_busy_loop, because theoutputfromexecute_model_with_queueisNone, the loop continues to the next iteration without sending a response back to the main process viaself.worker_response_mq.enqueue()(due to theif not output: continuecheck). - The
MultiprocExecutorin the main process, which made thecollective_rpccall, will hang indefinitely waiting for a response that will never arrive.
To prevent this deadlock, execute_model_with_queue must ensure that a response is sent for every execute_model RPC call. The pipelining logic needs to be revised to guarantee a reply, even for the first call that primes the pipeline.
|
@Ronald1995 have you looked at #23569 |
I haven't looked at it before you mention this, but read the code of #23569 today, it seems like we are doing the same thing. |
|
Thanks @Ronald1995... It's good that you had the same idea, we appreciate you submitting the contribution! |
Purpose
in
execute_model, it has update_states and prepare_inputs opearations, this PR aims to use multithread to overlap the update_states and prepare_input into gpu to cpu copy operations of sample_token_ids.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.