Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

learning-chip · 2024-08-18T16:18:59Z

Eventually will allow the e2e mamba2 example #39 to run without the dependency on the original mamba_ssm repo.

This PR adds unit tests to ensure equivalence between {chunk_simple_gla/torch_simple_gla/torch_simple_gla_recurrent under fla.ops.simple_gla of this repository} and {mamba_chunk_scan_combined/ssd_minimal_discrete inside mamba_ssm repository}.

Unit test output from this PR:

$ pytest -v ./test_simple_gla_for_mamba2.py
====================================================== test session starts ======================================================
collected 6 items                                                                                                               

test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float32-True] PASSED                                                    [ 16%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float32-False] PASSED                                                   [ 33%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float16-True] PASSED                                                    [ 50%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float16-False] PASSED                                                   [ 66%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[bfloat16-True] PASSED                                                   [ 83%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[bfloat16-False] PASSED                                                  [100%]

Differences between simple_gla kernel and "mamba2_ssd" kernel:

mamba2_ssd uses input/output layout [batch, seq, head, hidden], while simple_gla uses [batch, head, seq, hidden]
mamba2_ssd does not apply the attention-inspired scaling q * (DK ** -0.5)
mamba2_ssd takes an extra dt input for discretization, but this can be easily absorbed into the gating matrix A as did in mamba2 example
mamba2_ssd's fused kernel does not take time-varying A (though the minimal torch version does), probably because the time-dependence is expressed by dt, not A_t? simple_gla supports time-varying g directly.
mamba2_ssd uses "group query attention", but simple_gla (also other kernels in this repo?) always use the same number of heads for Q & K & V. For now, force the same number of heads in tests.

Ref Section 7.2 of Mamba-2 paper:

Todo:

Performance comparison for same input shapes and chunk size benchmark script for simple_gla vs mamba2 kernel #50
Support group-query pattern in simple_gla kernel (Mamba-Codestral uses n_groups=8)
Also check backward pass correctness
Swap-in simple_gla kernel into e2e mamba2 model code

FYI @DanFosing @yzhangcs @sustcsonglin

yzhangcs · 2024-08-18T17:39:39Z

@learning-chip very cool contributions! I think it would be great if you add some benchmarks regarding simple_gla and mamba2 kernels like in https://github.com/sustcsonglin/flash-linear-attention/blob/main/benchmarks/ops/benchmark_gla.py.

yzhangcs · 2024-08-18T17:40:40Z

I will be working on GQA recently

learning-chip · 2024-08-18T19:24:57Z

add some benchmarks regarding simple_gla and mamba2 kernels like in https://github.com/sustcsonglin/flash-linear-attention/blob/main/benchmarks/ops/benchmark_gla.py.

Some quick results #50

learning-chip added 4 commits August 18, 2024 14:27

change simple_gla scale factor to match mamba-2

518a4fd

allow optional scale factor for simple_gla

4359b18

check equivalence between simpl_ gla kernel and mamba2 ssd kernel

3a8ce0a

run autopep8

fd44b54

yzhangcs marked this pull request as ready for review August 18, 2024 17:40

yzhangcs added 2 commits August 19, 2024 02:22

[Simple GLA] Add comments & Fix bad grad

11c7f66

Update and rename test_simple_gla_for_mamba2.py to test_simple_gla.py

f71e096

yzhangcs merged commit 9aa2480 into fla-org:main Aug 18, 2024

learning-chip mentioned this pull request Aug 18, 2024

benchmark script for simple_gla vs mamba2 kernel #50

Merged

4 tasks

yzhangcs mentioned this pull request Aug 18, 2024

Add implementations of Mamba 2 into FLA #34

Closed

learning-chip mentioned this pull request Aug 30, 2024

llama : initial Mamba-2 support ggml-org/llama.cpp#9126

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

Uh oh!

learning-chip commented Aug 18, 2024 •

edited

Loading

Uh oh!

yzhangcs commented Aug 18, 2024

Uh oh!

yzhangcs commented Aug 18, 2024

Uh oh!

learning-chip commented Aug 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel #49

Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel #49

Uh oh!

Conversation

learning-chip commented Aug 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhangcs commented Aug 18, 2024

Uh oh!

yzhangcs commented Aug 18, 2024

Uh oh!

learning-chip commented Aug 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

learning-chip commented Aug 18, 2024 •

edited

Loading