Skip to content

Conversation

@wangyxbh
Copy link

@wangyxbh wangyxbh commented Sep 5, 2025

This PR refactors the execution of structured output generation to improve performance by hiding the overhead of grammar bitmask calculations, as proposed in the RFC.

Currently, all backend generation for StructuredRequests is managed by the StructuredOutputManager, which runs in the same process as the scheduler. The call to the grammar_bitmask function within this manager introduces blocking overhead that can impact overall latency.

This change moves the execution of the grammar_bitmask function into the gpu_runner, scheduling it to run after the main model execution. This allows the CPU-bound work of calculating the grammar bitmask to be overlapped (hidden) by the GPU computation, leading to better performance for requests using constrained grammar sampling.

To achieve this, the implementation includes:

Performing grammar state initialization when the CacheRequest is generated in the worker_runner.

Moving all related logic for speculative decoding and "reason thinking" for structured output requests into the gpu_runner.

Test Plan

Command: python /path/vllm/benchmarks/benchmark_serving_structured_output.py --model /home/models/Qwen3-8B --dataset xgrammar_bench --num-prompts 100

Request throughput increased by ~31.5% (from 18.47 to 24.29 req/s).

Output token throughput improved by ~33.1% (from 1132.30 to 1507.32 tok/s).

Furthermore, the latency for generating subsequent tokens was reduced, with the mean Time per Output Token (TPOT) decreasing by ~21% and the mean Inter-token Latency (ITL) decreasing by ~19%.

baseline

Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.47it/s]
============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  5.41      
Total input tokens:                      27748     
Total generated tokens:                  6129      
Request throughput (req/s):              18.47     
Output token throughput (tok/s):         1132.30   
Total Token throughput (tok/s):          6258.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          968.93    
Median TTFT (ms):                        1122.44   
P99 TTFT (ms):                           1401.09   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.58     
Median TPOT (ms):                        39.42     
P99 TPOT (ms):                           77.37     
---------------Inter-token Latency----------------
Mean ITL (ms):                           40.60     
Median ITL (ms):                         37.08     
P99 ITL (ms):                            156.27    
==================================================
correct_rate(%) 91.0 

opt

Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 24.29it/s]
============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  4.12      
Total input tokens:                      27748     
Total generated tokens:                  6205      
Request throughput (req/s):              24.29     
Output token throughput (tok/s):         1507.32   
Total Token throughput (tok/s):          8247.88   
---------------Time to First Token----------------
Mean TTFT (ms):                          226.07    
Median TTFT (ms):                        227.83    
P99 TTFT (ms):                           264.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.47     
Median TPOT (ms):                        34.67     
P99 TPOT (ms):                           39.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.78     
Median ITL (ms):                         33.62     
P99 ITL (ms):                            44.94     
==================================================
correct_rate(%) 91.0 

@mergify mergify bot added performance Performance-related issues structured-output v1 labels Sep 5, 2025
@mergify mergify bot added the tpu Related to Google TPUs label Sep 5, 2025
@gemini-code-assist
Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@njhill
Copy link
Member

njhill commented Sep 5, 2025

See also: #23224

@mergify
Copy link

mergify bot commented Sep 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangyxbh.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 7, 2025
@benchislett
Copy link
Collaborator

At a glance, I prefer the implementation in #23224 to this one, though they're very similar. I've messaged the author of that PR and am hoping we can make progress there.

@russellb
Copy link
Member

Thank you for the PR!

There have been a few different efforts in this area. The one we're moving forward with right now is #26866. Any feedback you have on that appraoch is welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase performance Performance-related issues structured-output tpu Related to Google TPUs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants