-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Core] Async scheduling + structured outputs compatibility #26866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
829ef60 to
8cba549
Compare
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…tput # Conflicts: # vllm/v1/engine/core.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…hed-struct-output # Conflicts: # vllm/v1/core/sched/output.py # vllm/v1/worker/gpu_worker.py # vllm/v1/worker/tpu_worker.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…tput Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/executor/abstract.py # vllm/v1/executor/ray_distributed_executor.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the effort.
Signed-off-by: Nick Hill <[email protected]>
…ect#26866) Signed-off-by: Nick Hill <[email protected]>
|
Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support. And below is the command to launch the server: The command for send the request: And the perf drop from 617.26 tok/s to 384.40 tok/s. |
Thanks @ys950902. Which commit exactly were you testing? There was some known perf regression from this PR which was subsequently fixed in #28012. Unfortunately, that PR was just reverted due to a compatibility bug, but the re-apply of it #28319 should be merged to main soon. It would be great if you could check whether the degraded performance still shows up when that PR is included (if it wasn't already in your test). If so could you open a new issue with the above detail and we can investigate further. |
…ect#26866) Signed-off-by: Nick Hill <[email protected]>
…ect#26866) Signed-off-by: Nick Hill <[email protected]>
| grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output) | ||
| # Block-wait for execute to return (continues running async on the GPU). | ||
| with self.log_error_detail(scheduler_output): | ||
| exec_result = exec_future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we blocking before batch queue is full? won't this break the batch queue behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you have a look? PP mode will block here, none parallel will happen. Even though here is intent to wait for the model_execute, but the previous sample_tokens task should also in the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ys950902 is your PP perf issue solved? is it also related to the blocking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill Could you help answer this question? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
draft fix: #28286
Following similar approach to #23391.
Throughput benchmarks using the same json schema as #23224:
--async-scheduling--async-schedulingThis is a breaking change for the model runner and scheduler interfaces.