[Tracking Issue]: Qwen3-next performance optimisations

### Proposal to improve performance

This is a tracking issue for performance optimization for Qwen3-next to keep all necessary things in one place. 

- [ ] `torch.compile` for GDN attn #27152 
- [ ] not optimal `linear` for small batch sizes #27173
- [ ] GDN attn decrease CPU overhead #27222 
- [ ] Full CudaGraph for TRT-LLM Gen attn (for MTP only) #26937 
- [ ] Enable TRT-LLM Gen MoE. 
        FP16:
        FP8: #27492  
        FP4:
        DONE:<s>https://github.com/flashinfer-ai/flashinfer/pull/1831</s>
- [ ] Moving of the gate / router op to be after the shared_experts execution. #27578
      #26440 for reference. 
- [ ] async-sched + spec-decoding (not Qwen3-next specific feature, but required) #24799
- [ ] GDN prefix cache #26807




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Tracking Issue]: Qwen3-next performance optimisations #27225

Proposal to improve performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Tracking Issue]: Qwen3-next performance optimisations #27225

Description

Proposal to improve performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions