[perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next #27578

ZJY0516 · 2025-10-27T13:27:38Z

Purpose

Follow #26440, enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next

Test Plan

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --enable-expert-parallel -tp 8

Disable multi-stream for shared experts

VLLM_DISABLE_SHARED_EXPERTS_STREAM=0 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --enable-expert-parallel -tp 8

vllm bench serve \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--dataset-name random \
--tokenizer Qwen/Qwen3-Next-80B-A3B-Instruct \
--num-prompts 32 \
--random-input-len 2048 \
--random-output-len 1024

Test Result

H20
this pr

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  13.93     
Total input tokens:                      65536     
Total generated tokens:                  31118     
Request throughput (req/s):              2.30      
Output token throughput (tok/s):         2234.03   
Peak output token throughput (tok/s):    2573.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          6938.99   
---------------Time to First Token----------------
Mean TTFT (ms):                          867.30    
Median TTFT (ms):                        873.08    
P99 TTFT (ms):                           1438.24   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.76     
Median TPOT (ms):                        12.65     
P99 TPOT (ms):                           13.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.73     
Median ITL (ms):                         12.19     
P99 ITL (ms):                            12.96     
==================================================

main

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  14.00     
Total input tokens:                      65536     
Total generated tokens:                  31271     
Request throughput (req/s):              2.29      
Output token throughput (tok/s):         2233.72   
Peak output token throughput (tok/s):    2573.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          6915.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          871.24    
Median TTFT (ms):                        876.67    
P99 TTFT (ms):                           1442.63   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.80     
Median TPOT (ms):                        12.71     
P99 TPOT (ms):                           13.50     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.79     
Median ITL (ms):                         12.30     
P99 ITL (ms):                            13.04     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <[email protected]>

vadiklyutiy · 2025-10-27T14:36:21Z

CC @alexm-redhat @LucasWilkinson @nvpohanh as participants of #26440

vadiklyutiy · 2025-10-27T23:02:55Z

On B200 the speed with and without this PR the same (within the margin of measurement error)

ZJY0516 · 2025-10-28T01:17:23Z

Using single H20, now it has observable performance improvements

main

env VLLM_DISABLE_SHARED_EXPERTS_STREAM=0 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  31.85     
Total input tokens:                      65536     
Total generated tokens:                  31424     
Request throughput (req/s):              1.00      
Output token throughput (tok/s):         986.62    
Peak output token throughput (tok/s):    1178.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          3044.24   
---------------Time to First Token----------------
Mean TTFT (ms):                          2147.91   
Median TTFT (ms):                        2039.91   
P99 TTFT (ms):                           4449.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.00     
Median TPOT (ms):                        28.87     
P99 TPOT (ms):                           31.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.97     
Median ITL (ms):                         26.88     
P99 ITL (ms):                            27.88     
==================================================

this pr

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Benchmark duration (s):                  29.64     
Total input tokens:                      65536     
Total generated tokens:                  31424     
Request throughput (req/s):              1.08      
Output token throughput (tok/s):         1060.03   
Peak output token throughput (tok/s):    1271.00   
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          3270.76   
---------------Time to First Token----------------
Mean TTFT (ms):                          2105.32   
Median TTFT (ms):                        2030.24   
P99 TTFT (ms):                           4169.15   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.88     
Median TPOT (ms):                        26.73     
P99 TPOT (ms):                           29.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.86     
Median ITL (ms):                         24.92     
P99 ITL (ms):                            26.02     
==================================================

vadiklyutiy

I think changes is pretty straightforward and mirror what we did for DeepSeek.
Propose to merge this PR

simon-mo

vadim stamped

…experts" in qwen3-next (vllm-project#27578) Signed-off-by: zjy0516 <[email protected]>

init

c786eb8

Signed-off-by: zjy0516 <[email protected]>

mergify bot added the qwen Related to Qwen models label Oct 27, 2025

ZJY0516 marked this pull request as ready for review October 27, 2025 13:40

ZJY0516 requested a review from sighingnow as a code owner October 27, 2025 13:40

vadiklyutiy mentioned this pull request Oct 27, 2025

[Tracking Issue]: Qwen3-next performance optimisations #27225

Open

8 tasks

vadiklyutiy approved these changes Oct 28, 2025

View reviewed changes

ZJY0516 changed the title ~~[perf] Enable concurrent execution of "shared_experts" and "selected_experts"~~ [perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next Oct 28, 2025

simon-mo approved these changes Oct 28, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) October 28, 2025 22:31

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 28, 2025

ZJY0516 added 2 commits October 29, 2025 09:20

Merge branch 'main' into dual_stram

d56e4be

Merge branch 'main' into dual_stram

32e914b

simon-mo merged commit 8df98c2 into vllm-project:main Oct 29, 2025
53 checks passed

Kay-Tian mentioned this pull request Oct 29, 2025

vLLM PR #27578 变更核心文件提醒 Kay-Tian/vllm#62

Closed

MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Oct 30, 2025

[perf] Enable concurrent execution of "shared_experts" and "selected_…

13b0d7c

…experts" in qwen3-next (vllm-project#27578) Signed-off-by: zjy0516 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next #27578

[perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next #27578

Uh oh!

ZJY0516 commented Oct 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

vadiklyutiy commented Oct 27, 2025

Uh oh!

vadiklyutiy commented Oct 27, 2025

Uh oh!

ZJY0516 commented Oct 28, 2025 •

edited

Loading

Uh oh!

vadiklyutiy left a comment

Uh oh!

simon-mo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next #27578

[perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next #27578

Uh oh!

Conversation

ZJY0516 commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

vadiklyutiy commented Oct 27, 2025

Uh oh!

vadiklyutiy commented Oct 27, 2025

Uh oh!

ZJY0516 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy left a comment

Choose a reason for hiding this comment

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZJY0516 commented Oct 27, 2025 •

edited by github-actions bot

Loading

ZJY0516 commented Oct 28, 2025 •

edited

Loading