-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
Closed as not planned
Labels
feature requestNew feature or requestNew feature or requeststaleOver 90 days of inactivityOver 90 days of inactivity
Description
🚀 The feature, motivation and pitch
DeepSeek-V2 design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
Can VLLM support MLA for accelerated inference?
@misc{deepseek-v2,
author = {DeepSeek-AI},
title = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model},
year = {2024},
note = {GitHub repository},
url = {https://github.com/deepseek-ai/deepseek-v2}
}
https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf
Alternatives
No response
Additional context
No response
cadedaniel, datalee, GanjinZero, ouwenjie03, neteroster and 27 more
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or requeststaleOver 90 days of inactivityOver 90 days of inactivity