Multi-head Latent Attention in Deepseekv2.
Paper Link: DeepseekV2
Multi-head Latent Attention (MLA) is a variant of multi-head attention which was introduced in the DeepSeek-V2 paper. There are several variants of multi-head attention whose purpose is primarily to reduce the KV-cache size, which is a memory bottleneck that emerges from scaling large models. These methods, which include Group-Query Attention and Multi-Query Attention, are primarily considered /performance tradeoffs/, i.e. the performance is worse, but you get to scale them much further by reducing the memory overhead.
-
Implementation Standard Multi-head Atention
- With Defaut
- With Standard RoPE
- With Decouple RoPE
-
Implementation Standard Multi-head Latent Atention
- With Defaut
- With Standard RoPE
- With Decouple RoPE
-
Training Script
You can find more implementation in here: AnythingFromScratchV2