-
-
Couldn't load subscription status.
- Fork 10.8k
Open
Labels
performancePerformance-related issuesPerformance-related issues
Description
Proposal to improve performance
This is a tracking issue for performance optimization for Qwen3-next to keep all necessary things in one place.
-
torch.compilefor GDN attn [Perf][Qwen3-next]:torch.compileGDN attn #27152 - not optimal
linearfor small batch sizes [Performance]: non-optimal performance oflinearfor small batches #27173 - GDN attn decrease CPU overhead [Performance][Qwen3-next] Decrease huge CPU overhead #27222
- Full CudaGraph for TRT-LLM Gen attn (for MTP only) [Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937
- Enable TRT-LLM Gen MoE.
FP16:
FP8: [Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next #27492
FP4:
DONE:Update the routing for TRTLLMGEN to support kimi k2 and qwen flashinfer-ai/flashinfer#1831 - Moving of the gate / router op to be after the shared_experts execution. [perf] Enable concurrent execution of "shared_experts" and "selected_experts" #27578
[Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE #26440 for reference. - async-sched + spec-decoding (not Qwen3-next specific feature, but required) [Core] Async Scheduling X Spec Decoding Compatibility #24799
- GDN prefix cache [V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807
ZJY0516
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issues