[Feature]:  MLA Support

### 🚀 The feature, motivation and pitch

DeepSeek-V2 design **MLA (Multi-head Latent Attention)**, which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.

![image](https://github.com/vllm-project/vllm/assets/12186261/8bad7132-ec0b-4abd-9300-efc9caf2c04e)

Can VLLM support MLA for accelerated inference?

@misc{deepseek-v2,
  author = {DeepSeek-AI},
  title  = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model},
  year   = {2024},
  note   = {GitHub repository},
  url    = {https://github.com/deepseek-ai/deepseek-v2}
  }

https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: MLA Support #4625

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: MLA Support #4625

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions