-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Closed
Labels
Description
Add Q2_0 and Q2_1 quantization support to ggml:
- Follow the existing
Q4_0andQ4_1implementations - Implement reference scalar quantization and dequantization routines
- I suspect we might have to use
QK == 16in this case to compensate for further accuracy losses - Add SIMD support for a specific architecture - investigate best strategy to perform the
ggml_vec_dot_q2()computation - No need to implement
ggml_vec_mad_q2()- these will be deprecated soon - Compute perplexity scores
The expected model sizes for 7B and QK == 16 are:
Q2_0- 3.2 GB
For QK == 32 we have:
Q2_0- 2.4 GBQ2_1- 3.2 GB
Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.
myeolinmalchi, watsy0007, gorborukov, JamoDevNich, jackvial and 16 moreGreen-Sky, lin72h, NouamaneTazi, schneiderfelipe, TheSeamau5 and 2 moresevenreasons, NouamaneTazi and lolxdmainkaisemaanlu