Flash attention #173

oliverdutton · 2024-04-21T11:27:13Z

Implements FlashAttention similarly to google-deepmind/alphafold#931

For a 759 residue protein and model_5 this improves runtime 2.2x on an L4 (37.3 $\rightarrow$ 16.9 seconds [with minibatching of 256 for non-flash attention to avoid OOM])

Here's a colab link showing runtime improvement and no significant change in prediction output by visual inspection. I didn't want to rerun all the input prep so I've used a colab with alphafold input preparation and done fixes for colabdesign.

Notes

Key variations from a reference flash attention kernel are:

Attention logit biasing supported
Gating supported
Some heads have only 8 channels, they’re padded up to 16 within kernel (this is a requirement of pl.dot, we still see performance improvement relative to non-flash attn and keeps overall AlphaFold2 linear in memory requirements)
Broadcasted masks in batch, q and head dimensions supported (they’re often size 1 and implicitly broadcasted in AlphaFold2 einsums)

There's guards against kernel being called for short sequence lengths less than block sizes specified in q and k which exits to reference kernel.

Comments

I think the runtime improvement is benefitting from the triangular fusion you've previously implemented, as on an A100 I saw with flash attention and bfloat16 that starts to be significant.
I haven't done correctness/speed checks with multimer models or models using templates. If you have a test suite to do that, that'd be wonderful.
When you said 'fused attention' you meant shifting the mask to a bias so XLA lowers it to a fused kernel, right? I've moved that mask $\rightarrow$ bias conversion into the Attention module itself and kept it in the reference_kernel (so now reference_kernel differs from the one in google-deepmind/alphafold#931). So with use_flash_attention=False I haven't changed behaviour: here's a colab showing same 37.3s runtime from the main branch.
fix for use_dgram which seemed to access the wrong config keys
fix for models not containing pae head

(+) fix: fix access to global config fix: allow lack of predicted_aligned_error head

oliverdutton · 2024-05-05T21:24:13Z

@sokrypton I think this is ready for merging.

It's still strictly opt-in (as Pallas with Triton is only available for Ampere architecture GPUs and up)

You could improve performance a bit more by tuning block sizes and the number of warps on an input shape dependent manner, and similarly the 'subbatch_size` global config setting could be split into a default heuristic of memory usage where it selects subbatch sizes

oliverdutton mentioned this pull request Apr 21, 2024

Flash attention google-deepmind/alphafold#931

Open

feat: flash attention

c2e8505

(+) fix: fix access to global config fix: allow lack of predicted_aligned_error head

oliverdutton force-pushed the flash_attention branch from afc75f2 to c2e8505 Compare May 5, 2024 21:03

curtisdow1973-sys approved these changes Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flash attention #173

Flash attention #173

Uh oh!

oliverdutton commented Apr 21, 2024 •

edited

Loading

Uh oh!

oliverdutton commented May 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Flash attention #173

Are you sure you want to change the base?

Flash attention #173

Uh oh!

Conversation

oliverdutton commented Apr 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Comments

Uh oh!

oliverdutton commented May 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oliverdutton commented Apr 21, 2024 •

edited

Loading