Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Oct 15, 2025

Following similar approach to #23391.

Throughput benchmarks using the same json schema as #23224:

vllm serve Qwen/Qwen3-1.7B --uvicorn-log-level=error  --no-enable-prefix-caching

python3 benchmarks/benchmark_serving_structured_output.py --backend vllm --model Qwen/Qwen3-1.7B --structured-output-ratio $ratio --request-rate 200 --max-concurrency 800 --num-prompts 4000 --json-schema-path ./test3.json  --output-len 128
Test Executor / pct struct reqs -> 0.0 0.2 0.8 1.0
main uniproc 103.16 92.57 70.68 69.36
This PR uniproc 103.19 99.67 87.90 85.28
This PR + --async-scheduling uniproc 132.72 106.08 93.59 90.34
This PR + --async-scheduling multiproc 133.31 114.67 96.08 93.42

This is a breaking change for the model runner and scheduler interfaces.

@mergify
Copy link

mergify bot commented Oct 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 17, 2025
@njhill njhill force-pushed the async-sched-struct-output branch from 829ef60 to 8cba549 Compare October 17, 2025 01:40
@mergify mergify bot removed the needs-rebase label Oct 17, 2025
…tput

Signed-off-by: Nick Hill <[email protected]>

# Conflicts:
#	vllm/v1/engine/core.py
#	vllm/v1/executor/abstract.py
#	vllm/v1/executor/ray_distributed_executor.py
@njhill njhill requested a review from chaunceyjiang as a code owner October 30, 2025 19:24
@mergify mergify bot added frontend and removed needs-rebase labels Oct 30, 2025
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the effort.

@njhill njhill enabled auto-merge (squash) October 31, 2025 23:14
@njhill njhill merged commit 0cdbe7b into vllm-project:main Nov 1, 2025
54 checks passed
@njhill njhill deleted the async-sched-struct-output branch November 1, 2025 03:25
zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025
@ys950902
Copy link
Contributor

ys950902 commented Nov 7, 2025

Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support.

And below is the command to launch the server:

VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --enforce-eager --port 8000 --host 0.0.0.0 -pp 2 --distributed_executor_backend=mp --trust-remote-code --gpu-memory-util=0.9 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=8192 --block-size 64 --quantization fp8    --dtype=float16   -tp=2

The command for send the request:

python3 -m vllm.entrypoints.cli.main bench serve  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --ready-check-timeout-sec 1 --dataset-name random --random-input-len=1024 --random-output-len=512 --ignore-eos --port=8000 --host 0.0.0.0 --num-prompt 30 --request-rate inf --backend vllm --trust-remote-code

And the perf drop from 617.26 tok/s to 384.40 tok/s.

@njhill
Copy link
Member Author

njhill commented Nov 7, 2025

Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support.

And below is the command to launch the server:

VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --enforce-eager --port 8000 --host 0.0.0.0 -pp 2 --distributed_executor_backend=mp --trust-remote-code --gpu-memory-util=0.9 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=8192 --block-size 64 --quantization fp8    --dtype=float16   -tp=2

The command for send the request:

python3 -m vllm.entrypoints.cli.main bench serve  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --ready-check-timeout-sec 1 --dataset-name random --random-input-len=1024 --random-output-len=512 --ignore-eos --port=8000 --host 0.0.0.0 --num-prompt 30 --request-rate inf --backend vllm --trust-remote-code

And the perf drop from 617.26 tok/s to 384.40 tok/s.

Thanks @ys950902. Which commit exactly were you testing? There was some known perf regression from this PR which was subsequently fixed in #28012. Unfortunately, that PR was just reverted due to a compatibility bug, but the re-apply of it #28319 should be merged to main soon.

It would be great if you could check whether the degraded performance still shows up when that PR is included (if it wasn't already in your test). If so could you open a new issue with the above detail and we can investigate further.

grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
# Block-wait for execute to return (continues running async on the GPU).
with self.log_error_detail(scheduler_output):
exec_result = exec_future.result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we blocking before batch queue is full? won't this break the batch queue behavior?

Copy link
Contributor

@weireweire weireweire Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a look? PP mode will block here, none parallel will happen. Even though here is intent to wait for the model_execute, but the previous sample_tokens task should also in the queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ys950902 is your PP perf issue solved? is it also related to the blocking here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill Could you help answer this question? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

draft fix: #28286

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector ready ONLY add when PR is ready to merge/full CI is needed structured-output suppress-bc-linter tpu Related to Google TPUs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants