benchmark script for simple_gla vs mamba2 kernel #50

learning-chip · 2024-08-18T19:24:08Z

Follow-up #49

Amazingly, it seems like chunk_simple_gla is much faster than mamba_chunk_scan_combined:

$ python ./benchmark_simple_gla_vs_mamba2.py

Performance:
         T  chunk_simple_gla  mamba2_ssd
0     64.0          0.084992    0.840208
1    128.0          0.100352    0.847920
2    256.0          0.100368    0.848896
3    512.0          0.174080    0.873472
4   1024.0          0.399360    0.880208
5   2048.0          0.776352    1.596416
6   4096.0          1.526784    3.160064
7   8192.0          3.067904    6.251520
8  16384.0          6.220800   12.452864

I left many TODO and NOTE in the benchmark scripts, including:

Testing more input shapes
Tuning block size
analyze impact of input memory layout

More importantly:

more detailed profiling to understand why exactly it is faster.

Maybe mamba-2 kernel incurs more memory IO (less "fused")? And why the short-sequence performance (T<256) differs by so much?

yzhangcs · 2024-08-18T19:26:49Z

@learning-chip Great job! Appreciate your quick actions.

sustcsonglin · 2024-08-18T22:35:21Z

@learning-chip Mamba2’s official kernel involves three main steps: 1) computation of each chunk’s last hidden state, 2) recurrence at the chunk level, and 3) output computation.

For steps 1) and 2), it stores/loads the hidden state in FP32, which incurs significant I/O costs.

FLA’s implementation fuses steps 1) and 2), avoids materializing the FP32 hidden state after step 1) and stores only the BF16 hidden state after 2), thus reducing I/O costs.

benchmark script for simple_gla vs mamba2 kernel

f57a027

learning-chip mentioned this pull request Aug 18, 2024

Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel #49

Merged

4 tasks

yzhangcs marked this pull request as ready for review August 18, 2024 19:25

yzhangcs merged commit c60ada3 into fla-org:main Aug 18, 2024

yzhangcs mentioned this pull request Aug 18, 2024

Add implementations of Mamba 2 into FLA #34

Closed

learning-chip mentioned this pull request Aug 30, 2024

llama : initial Mamba-2 support ggml-org/llama.cpp#9126

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

benchmark script for simple_gla vs mamba2 kernel #50

benchmark script for simple_gla vs mamba2 kernel #50

Uh oh!

learning-chip commented Aug 18, 2024 •

edited

Loading

Uh oh!

yzhangcs commented Aug 18, 2024

Uh oh!

sustcsonglin commented Aug 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benchmark script for simple_gla vs mamba2 kernel #50

benchmark script for simple_gla vs mamba2 kernel #50

Uh oh!

Conversation

learning-chip commented Aug 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhangcs commented Aug 18, 2024

Uh oh!

sustcsonglin commented Aug 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

learning-chip commented Aug 18, 2024 •

edited

Loading