Xgrammar fixed #24300

wangyxbh · 2025-09-05T07:29:19Z

This PR refactors the execution of structured output generation to improve performance by hiding the overhead of grammar bitmask calculations, as proposed in the RFC.

Currently, all backend generation for StructuredRequests is managed by the StructuredOutputManager, which runs in the same process as the scheduler. The call to the grammar_bitmask function within this manager introduces blocking overhead that can impact overall latency.

This change moves the execution of the grammar_bitmask function into the gpu_runner, scheduling it to run after the main model execution. This allows the CPU-bound work of calculating the grammar bitmask to be overlapped (hidden) by the GPU computation, leading to better performance for requests using constrained grammar sampling.

To achieve this, the implementation includes:

Performing grammar state initialization when the CacheRequest is generated in the worker_runner.

Moving all related logic for speculative decoding and "reason thinking" for structured output requests into the gpu_runner.

Test Plan

Command: python /path/vllm/benchmarks/benchmark_serving_structured_output.py --model /home/models/Qwen3-8B --dataset xgrammar_bench --num-prompts 100

Request throughput increased by ~31.5% (from 18.47 to 24.29 req/s).

Output token throughput improved by ~33.1% (from 1132.30 to 1507.32 tok/s).

Furthermore, the latency for generating subsequent tokens was reduced, with the mean Time per Output Token (TPOT) decreasing by ~21% and the mean Inter-token Latency (ITL) decreasing by ~19%.

baseline

Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.47it/s]
============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  5.41      
Total input tokens:                      27748     
Total generated tokens:                  6129      
Request throughput (req/s):              18.47     
Output token throughput (tok/s):         1132.30   
Total Token throughput (tok/s):          6258.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          968.93    
Median TTFT (ms):                        1122.44   
P99 TTFT (ms):                           1401.09   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.58     
Median TPOT (ms):                        39.42     
P99 TPOT (ms):                           77.37     
---------------Inter-token Latency----------------
Mean ITL (ms):                           40.60     
Median ITL (ms):                         37.08     
P99 ITL (ms):                            156.27    
==================================================
correct_rate(%) 91.0

opt

Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 24.29it/s]
============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  4.12      
Total input tokens:                      27748     
Total generated tokens:                  6205      
Request throughput (req/s):              24.29     
Output token throughput (tok/s):         1507.32   
Total Token throughput (tok/s):          8247.88   
---------------Time to First Token----------------
Mean TTFT (ms):                          226.07    
Median TTFT (ms):                        227.83    
P99 TTFT (ms):                           264.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.47     
Median TPOT (ms):                        34.67     
P99 TPOT (ms):                           39.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.78     
Median ITL (ms):                         33.62     
P99 ITL (ms):                            44.94     
==================================================
correct_rate(%) 91.0

gemini-code-assist · 2025-09-05T07:37:53Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

github-actions · 2025-09-05T07:50:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

…o xgrammar_fixed

njhill · 2025-09-05T22:01:35Z

See also: #23224

mergify · 2025-09-07T15:18:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangyxbh.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchislett · 2025-09-15T17:05:49Z

At a glance, I prefer the implementation in #23224 to this one, though they're very similar. I've messaged the author of that PR and am hoping we can make progress there.

russellb · 2025-10-30T18:18:30Z

Thank you for the PR!

There have been a few different efforts in this area. The one we're moving forward with right now is #26866. Any feedback you have on that appraoch is welcome!

wangyxbh added 2 commits September 5, 2025 14:40

move bitmask from scheduler into worker_runner

a1a943f

fixed_xgrammar_from scheduler into workergpu

0293316

wangyxbh requested review from WoosukKwon, aarnphm, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, russellb and ywang96 as code owners September 5, 2025 07:29

Merge branch 'main' into xgrammar_fixed

42cb7ef

mergify bot added performance Performance-related issues structured-output v1 labels Sep 5, 2025

github-project-automation bot added this to Structured Output Sep 5, 2025

mergify bot added the tpu Related to Google TPUs label Sep 5, 2025

wangyxbh added 2 commits September 5, 2025 17:54

fixed pre-commit

94afa74

Merge branch 'xgrammar_fixed' of https://github.com/wangyxbh/vllm int…

39fba27

…o xgrammar_fixed

fixed_pre-commit

bb51fb9

wangyxbh requested a review from benchislett as a code owner September 7, 2025 15:18

mergify bot added the needs-rebase label Sep 7, 2025

russellb closed this Oct 30, 2025

github-project-automation bot moved this to Done in Structured Output Oct 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Xgrammar fixed #24300

Xgrammar fixed #24300

Uh oh!

wangyxbh commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

njhill commented Sep 5, 2025

Uh oh!

mergify bot commented Sep 7, 2025

Uh oh!

benchislett commented Sep 15, 2025

Uh oh!

russellb commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Xgrammar fixed #24300

Xgrammar fixed #24300

Uh oh!

Conversation

wangyxbh commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Uh oh!

gemini-code-assist bot commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

njhill commented Sep 5, 2025

Uh oh!

mergify bot commented Sep 7, 2025

Uh oh!

benchislett commented Sep 15, 2025

Uh oh!

russellb commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangyxbh commented Sep 5, 2025 •

edited by github-actions bot

Loading