-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Xgrammar fixed #24300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xgrammar fixed #24300
Conversation
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
See also: #23224 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
At a glance, I prefer the implementation in #23224 to this one, though they're very similar. I've messaged the author of that PR and am hoping we can make progress there. |
|
Thank you for the PR! There have been a few different efforts in this area. The one we're moving forward with right now is #26866. Any feedback you have on that appraoch is welcome! |
This PR refactors the execution of structured output generation to improve performance by hiding the overhead of grammar bitmask calculations, as proposed in the RFC.
Currently, all backend generation for StructuredRequests is managed by the StructuredOutputManager, which runs in the same process as the scheduler. The call to the grammar_bitmask function within this manager introduces blocking overhead that can impact overall latency.
This change moves the execution of the grammar_bitmask function into the gpu_runner, scheduling it to run after the main model execution. This allows the CPU-bound work of calculating the grammar bitmask to be overlapped (hidden) by the GPU computation, leading to better performance for requests using constrained grammar sampling.
To achieve this, the implementation includes:
Performing grammar state initialization when the CacheRequest is generated in the worker_runner.
Moving all related logic for speculative decoding and "reason thinking" for structured output requests into the gpu_runner.
Test Plan
Command: python /path/vllm/benchmarks/benchmark_serving_structured_output.py --model /home/models/Qwen3-8B --dataset xgrammar_bench --num-prompts 100
Request throughput increased by ~31.5% (from 18.47 to 24.29 req/s).
Output token throughput improved by ~33.1% (from 1132.30 to 1507.32 tok/s).
Furthermore, the latency for generating subsequent tokens was reduced, with the mean Time per Output Token (TPOT) decreasing by ~21% and the mean Inter-token Latency (ITL) decreasing by ~19%.
baseline
opt