Skip to content

[Performance][Qwen3-next] Decrease huge CPU overhead #27222

@vadiklyutiy

Description

@vadiklyutiy

Proposal to improve performance

Qwen3-next has a sufficient CPU overhead even with pretty big (more than 512 - upper bound for cudagraph size) batch sizes.

I took batch size 1024 for demonstration. For that used

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos

where the most time we spend on decoding 1024 request in 1024-size batches.

For first run

VLLM_USE_FLASHINFER_MOE_FP16=1  vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching 

it shows around 13900 gen tokens/s.

If capture cudagraph up to 1024 with --cuda-graph-sizes=1024

VLLM_USE_FLASHINFER_MOE_FP16=1  vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --cuda-graph-sizes=1024

the performance is dramatically better - more than 21000 gen tokens/s.

Below nsys profile. B200.

Image

on screenshot
orange - full attn
yellow - GDN attn
red - MoE (in the top row, pls ignore red in the second row)

As we can see there a lot of GPU idle time. The main reason GDN - we spend more time in CPU than GPU here.
(MoE also spend sufficient time in CPU but it is still 20-25% less).

I don't have specific way how we can speed up CPU time of GSN attn. It requires further investigation.
For now just fix a problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions