[FEA] Support SM90 Grouped GEMM

**Is your feature request related to a problem? Please describe.**
Grouped GEMM using cutlass is ~30% slower than a for-loop with cuBLAS GEMM on SM90 (H100). Implementation of grouped GEMM using cutlass and cuBLAS can be found here https://github.com/tgale96/grouped_gemm/blob/main/csrc/grouped_gemm.cu

**Describe the solution you'd like**
Consider adding SM90 support to the Grouped GEMM kernel in cutlass. It's currently using SM80. Grouped GEMM is important for training MoE models.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Support SM90 Grouped GEMM #1280

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Support SM90 Grouped GEMM #1280

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions