[Performance][Qwen3-next] Decrease huge CPU overhead

### Proposal to improve performance

Qwen3-next has a sufficient CPU overhead even with pretty big (more than 512 - upper bound for cudagraph size) batch sizes. 

I took batch size 1024 for demonstration. For that used
```
vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos
```
where the most time we spend on decoding 1024 request in 1024-size batches. 

For first run 
```
VLLM_USE_FLASHINFER_MOE_FP16=1  vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching 
```
it shows around **13900 gen tokens/s**. 

If capture cudagraph up to 1024 with `--cuda-graph-sizes=1024`
```
VLLM_USE_FLASHINFER_MOE_FP16=1  vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --cuda-graph-sizes=1024
```
the performance is dramatically better - more than **21000 gen tokens/s**.

Below nsys profile. B200.

<img width="1395" height="657" alt="Image" src="https://github.com/user-attachments/assets/94558cb4-f5ec-4c13-9dd8-52ab8c189938" />

on screenshot 
orange - full attn
yellow - GDN attn 
red - MoE (in the top row, pls ignore red in the second row)

As we can see there a lot of GPU idle time. The main reason GDN - we spend more time in CPU than GPU here. 
(MoE also spend sufficient time in CPU but it is still 20-25% less). 


I don't have specific way how we can speed up CPU time of GSN attn. It requires further investigation. 
For now just fix a problem. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Performance][Qwen3-next] Decrease huge CPU overhead #27222

Proposal to improve performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Performance][Qwen3-next] Decrease huge CPU overhead #27222

Description

Proposal to improve performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions