-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
feature requestNew feature or requestNew feature or request
Milestone
Description
Is your feature request related to a problem? Please describe.
Grouped GEMM using cutlass is ~30% slower than a for-loop with cuBLAS GEMM on SM90 (H100). Implementation of grouped GEMM using cutlass and cuBLAS can be found here https://github.com/tgale96/grouped_gemm/blob/main/csrc/grouped_gemm.cu
Describe the solution you'd like
Consider adding SM90 support to the Grouped GEMM kernel in cutlass. It's currently using SM80. Grouped GEMM is important for training MoE models.
imoneoi and hld1919
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request