Skip to content

DngBack/MLA_Pytorch_Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLA_Pytorch_Implementation

Multi-head Latent Attention in Deepseekv2.

Paper Link: DeepseekV2

Overview

Multi-head Latent Attention (MLA) is a variant of multi-head attention which was introduced in the DeepSeek-V2 paper. There are several variants of multi-head attention whose purpose is primarily to reduce the KV-cache size, which is a memory bottleneck that emerges from scaling large models. These methods, which include Group-Query Attention and Multi-Query Attention, are primarily considered /performance tradeoffs/, i.e. the performance is worse, but you get to scale them much further by reducing the memory overhead.

To do List

  • Implementation Standard Multi-head Atention

    • With Defaut
    • With Standard RoPE
    • With Decouple RoPE
  • Implementation Standard Multi-head Latent Atention

    • With Defaut
    • With Standard RoPE
    • With Decouple RoPE
  • Training Script

You can find more implementation in here: AnythingFromScratchV2

About

Multi-head Latent Attention in Deepseekv2.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published