-
-
Couldn't load subscription status.
- Fork 10.8k
Description
Proposal to improve performance
Qwen3-next has a sufficient CPU overhead even with pretty big (more than 512 - upper bound for cudagraph size) batch sizes.
I took batch size 1024 for demonstration. For that used
vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos
where the most time we spend on decoding 1024 request in 1024-size batches.
For first run
VLLM_USE_FLASHINFER_MOE_FP16=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching
it shows around 13900 gen tokens/s.
If capture cudagraph up to 1024 with --cuda-graph-sizes=1024
VLLM_USE_FLASHINFER_MOE_FP16=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --cuda-graph-sizes=1024
the performance is dramatically better - more than 21000 gen tokens/s.
Below nsys profile. B200.
on screenshot
orange - full attn
yellow - GDN attn
red - MoE (in the top row, pls ignore red in the second row)
As we can see there a lot of GPU idle time. The main reason GDN - we spend more time in CPU than GPU here.
(MoE also spend sufficient time in CPU but it is still 20-25% less).
I don't have specific way how we can speed up CPU time of GSN attn. It requires further investigation.
For now just fix a problem.