Skip to content

Conversation

@ZJY0516
Copy link
Contributor

@ZJY0516 ZJY0516 commented Oct 27, 2025

Purpose

Follow #26440, enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next

Test Plan

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --enable-expert-parallel -tp 8

Disable multi-stream for shared experts

VLLM_DISABLE_SHARED_EXPERTS_STREAM=0 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --enable-expert-parallel -tp 8
vllm bench serve \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--dataset-name random \
--tokenizer Qwen/Qwen3-Next-80B-A3B-Instruct \
--num-prompts 32 \
--random-input-len 2048 \
--random-output-len 1024

Test Result

H20
this pr

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  13.93     
Total input tokens:                      65536     
Total generated tokens:                  31118     
Request throughput (req/s):              2.30      
Output token throughput (tok/s):         2234.03   
Peak output token throughput (tok/s):    2573.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          6938.99   
---------------Time to First Token----------------
Mean TTFT (ms):                          867.30    
Median TTFT (ms):                        873.08    
P99 TTFT (ms):                           1438.24   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.76     
Median TPOT (ms):                        12.65     
P99 TPOT (ms):                           13.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.73     
Median ITL (ms):                         12.19     
P99 ITL (ms):                            12.96     
==================================================

main

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  14.00     
Total input tokens:                      65536     
Total generated tokens:                  31271     
Request throughput (req/s):              2.29      
Output token throughput (tok/s):         2233.72   
Peak output token throughput (tok/s):    2573.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          6915.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          871.24    
Median TTFT (ms):                        876.67    
P99 TTFT (ms):                           1442.63   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.80     
Median TPOT (ms):                        12.71     
P99 TPOT (ms):                           13.50     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.79     
Median ITL (ms):                         12.30     
P99 ITL (ms):                            13.04     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <[email protected]>
@mergify mergify bot added the qwen Related to Qwen models label Oct 27, 2025
@ZJY0516 ZJY0516 marked this pull request as ready for review October 27, 2025 13:40
@ZJY0516 ZJY0516 requested a review from sighingnow as a code owner October 27, 2025 13:40
@vadiklyutiy
Copy link
Contributor

CC @alexm-redhat @LucasWilkinson @nvpohanh as participants of #26440

@vadiklyutiy
Copy link
Contributor

On B200 the speed with and without this PR the same (within the margin of measurement error)

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Oct 28, 2025

Using single H20, now it has observable performance improvements

main

env VLLM_DISABLE_SHARED_EXPERTS_STREAM=0 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  31.85     
Total input tokens:                      65536     
Total generated tokens:                  31424     
Request throughput (req/s):              1.00      
Output token throughput (tok/s):         986.62    
Peak output token throughput (tok/s):    1178.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          3044.24   
---------------Time to First Token----------------
Mean TTFT (ms):                          2147.91   
Median TTFT (ms):                        2039.91   
P99 TTFT (ms):                           4449.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.00     
Median TPOT (ms):                        28.87     
P99 TPOT (ms):                           31.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.97     
Median ITL (ms):                         26.88     
P99 ITL (ms):                            27.88     
==================================================

this pr

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  29.64     
Total input tokens:                      65536     
Total generated tokens:                  31424     
Request throughput (req/s):              1.08      
Output token throughput (tok/s):         1060.03   
Peak output token throughput (tok/s):    1271.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          3270.76   
---------------Time to First Token----------------
Mean TTFT (ms):                          2105.32   
Median TTFT (ms):                        2030.24   
P99 TTFT (ms):                           4169.15   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.88     
Median TPOT (ms):                        26.73     
P99 TPOT (ms):                           29.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.86     
Median ITL (ms):                         24.92     
P99 ITL (ms):                            26.02     
==================================================

Copy link
Contributor

@vadiklyutiy vadiklyutiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changes is pretty straightforward and mirror what we did for DeepSeek.
Propose to merge this PR

@ZJY0516 ZJY0516 changed the title [perf] Enable concurrent execution of "shared_experts" and "selected_experts" [perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next Oct 28, 2025
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vadim stamped

@simon-mo simon-mo enabled auto-merge (squash) October 28, 2025 22:31
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 28, 2025
@simon-mo simon-mo merged commit 8df98c2 into vllm-project:main Oct 29, 2025
53 checks passed
MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants