Skip to content

Conversation

EAddario
Copy link
Contributor

@EAddario EAddario commented Aug 24, 2025

This PR introduces a new option --target-bpw implementing an optimised quant type selection algorithm to automatically determine per-tensor quantisation types in order to achieve a target bits-per-weight (bpw), with minimal estimated quality loss.

The selection algorithm,

  • builds a candidate set of quant types (K or IQ types)
  • for each layer/tensor, it simulates quantise→dequantise per candidate type, and estimates error using a weighted MSE error function. If the imatrix includes activations, it adds a bias penalty term to better reflect forward‑pass impact, making the error estimation more accurate and thus the quant type selection
  • it filters candidates to the pareto frontier (lowest error for a given size), then starts from the smallest bpw mix increasing to larger formats, based on the best error‑reduction per added bit, until the global bpw budget is reached
  • returns a map of tensor name → ggml_type overrides, which the main quantisation pass uses. If the minimum achievable BPW already exceeds the target, it returns that minimum.

The target_bpw_type() function will look over all quantisable tensors (e.g. embedding, output, etc.) unless --output-tensor-type, --token-embedding-type, and/or --tensor-type options are also used, in which case they'll take precedence.

--prune-layers can also be used in the same run, in which case the target_bpw_type() will skip the pruned layers and only consider the remaining against the total bpw budget.

Important note:

An imatrix that includes activations is required for the algorithm to work. At the time of writing, this is only available by generating the file using #14891 with the --output-format gguf option.

Typical usage: llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw 5.18 LLM-Model-F16.gguf BPW-Quantized-Q4_K_M.gguf q4_k_m

Special thanks to @ddh0 and @compilade for their contributions during the development of this PR.

PR created in draft until testing is completed

@netrunnereve
Copy link
Collaborator

netrunnereve commented Aug 25, 2025

This is a very interesting idea and makes me think of video compression. In video we can use a variable bitrate algorithm that allocates more bits to scenes with lots of detail and less bits for say a still image, all while targeting a preset birate.

I'm just thinking here but maybe in the future we can consider performance as well and automatically juggle error and speed with some sort of slider like what they have for video.

screenshot

@EAddario EAddario changed the title quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error possible quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error Aug 25, 2025
@EAddario
Copy link
Contributor Author

Sharing some like-for-like test results showing that this approach produces, in the majority of the cases, better quality models compared to naive quantisation (i.e. simply running standard llama-quantize with no further optimisations).

To reduce the duration of the tests, I have chosen two small but representative models: Llama-3.2-1B ("classic" transformer architecture) and Huihui-MoE-1.2B-A0.6B (typical Mixture or Experts).

The test protocol for each is:

  1. Generate Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, IQ4_NL, IQ3_M, and IQ3_S naive quantisations (e.g. llama-quantize --imatrix imatrix-with-activations.gguf LLM-Model-F16.gguf Naive-Quantized-<TYPE>.gguf <type>)
  2. Determine each model bits per weight (bpw). This can be easily done by using python llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --markdown BPW-Quantized-<TYPE>.gguf type
  3. Generate equivalent quant types by setting --target-bpw to the corresponding bpw values (e.g. llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw <naive bpw> LLM-Model-F16.gguf BPW-Quantized-<TYPE>.gguf <type>)
  4. Calculate quality scores via llama-perplexity -m <Naive|BPW>-Quantized-<TYPE>.gguf -f calibration_dataset.txt --kl-divergence-base LLM-Model-F16.logits --kl-divergence

Llama-3.2-1B results:

Model Naive BPW Target BPW Naive PPL PPL Naive 𝜌PPL 𝜌PPL Naive KLD KLD
IQ3_M 4.2042 4.2058 11.21441 13.08066 97.46% 94.62% 0.14661 0.29047
IQ3_S 4.1177 4.1191 11.41846 14.10772 97.08% 93.22% 0.16744 0.36883
IQ4_NL 4.9535 4.9542 10.10609 9.98096 99.19% 99.41% 0.04641 0.03356
Q3_K_L 4.6913 4.6894 10.74840 10.30599 98.10% 98.83% 0.10510 0.06738
Q3_K_M 4.4215 4.4184 10.97909 10.42277 97.71% 98.65% 0.12602 0.07920
Q3_K_S 4.1033 4.1037 14.19578 12.11165 92.80% 95.92% 0.37986 0.22400
Q4_K_M 5.1779 5.1792 10.01618 9.88781 99.34% 99.54% 0.03732 0.02654
Q4_K_S 4.9704 4.9762 10.06778 9.97105 99.27% 99.42% 0.04243 0.03350
Q5_K_M 5.8499 5.8521 9.75894 9.79049 99.80% 99.73% 0.01128 0.01620
Q5_K_S 5.7273 5.7291 9.76039 9.79663 99.80% 99.70% 0.01135 0.01757
Q6_K 6.5639 6.5646 9.68812 9.68277 99.91% 99.94% 0.00495 0.00354
Q8_0 8.5013 8.486 9.65172 9.64781 99.99% 99.99% 0.00050 0.00048

Huihui-MoE-1.2B-A0.6B results:

Model Naive BPW Target BPW Naive PPL PPL Naive 𝜌PPL 𝜌PPL Naive KLD KLD
IQ3_M 3.9173 3.9204 27.44670 30.17950 92.93% 91.42% 0.53704 0.58776
IQ3_S 3.8207 3.8239 29.30734 32.94412 92.20% 90.00% 0.52148 0.70061
IQ4_NL 4.5043 4.5092 19.55229 19.71237 98.62% 98.09% 0.08709 0.13948
Q3_K_L 4.3883 4.3923 21.48216 20.62434 96.80% 98.20% 0.20565 0.12301
Q3_K_M 4.1221 5.0412 21.94908 18.87276 96.43% 99.20% 0.23232 0.04863
Q3_K_S 3.8207 3.8519 26.05622 23.81005 93.87% 95.60% 0.41128 0.30752
Q4_K_M 4.9904 5.0412 18.91957 18.87276 99.02% 99.20% 0.05888 0.04863
Q4_K_S 4.7793 4.7826 19.12118 19.25212 98.89% 99.02% 0.06898 0.06238
Q5_K_M 5.7541 5.7950 18.28129 18.31989 99.66% 99.70% 0.01778 0.01531
Q5_K_S 5.6323 5.6342 18.38359 18.37216 99.63% 99.65% 0.02013 0.01884
Q6_K 6.5655 6.5693 18.20380 18.19202 99.80% 99.81% 0.00776 0.00725
Q8_0 8.5028 8.5071 18.09292 18.08959 99.90% 99.90% 0.00094 0.00090

PPL: the smaller the better; 𝜌PPL: the higher the better; KLD: the smaller the better; In bold: best quality

Note

Although these are very encouraging results, more testing with different model architectures and sizes will be required before categorically concluding this functionality consistently yields higher quality models.

Comments, feedback and, in particular, bug reports are very much welcome

@ThiloteE
Copy link
Contributor

ThiloteE commented Aug 30, 2025

This has an effect on the file size of the quants too, right?

@EAddario
Copy link
Contributor Author

That's correct @ThiloteE, file size is directly and proportionally influenced by the chosen bpw


auto name_tn = LLM_TN(model.arch);
auto can_quantize = [&](const ggml_tensor * t) -> bool {
// This list should be kept in sync with llama_tensor_quantize_impl() to avoid drift
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common function could be used by this and llama_tensor_quantize_impl to avoid drift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try not to modify existing logic unless it's absolutely necessary, and prefer to implement changes as self-contained functions, but agree this would be better handled by having a single function. I'll change accordingly

@EAddario
Copy link
Contributor Author

Latest version introduces a new algorithm to select candidates resulting in models that are much closer to the bpw target and, in the large majority of the models tested, with better quality metrics than the equivalent naive version or very close:

ERNIE-4.5-21B-A3B-PT
Model BPW PPL 𝜌PPL KLD Same Top P
ERNIE-4.5-21B-A3B-PT-Q4_K_M-naive 4.8543 6.358778 99.48% 0.020945 93.72%
ERNIE-4.5-21B-A3B-PT-Q4_K_M-mse 4.8600 6.358717 99.57% 0.017668 94.04%
ERNIE-4.5-21B-A3B-PT-Q4_K_M-fast 4.8600 6.361663 99.57% 0.017631 94.05%
ERNIE-4.5-21B-A3B-PT-Q4_K_M-precise 4.8600 6.361663 99.57% 0.017631 94.05%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-naive 4.5325 6.373562 99.40% 0.025309 93.14%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-mse 4.5382 6.350721 99.44% 0.023833 93.18%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-fast 4.5382 6.360253 99.45% 0.023487 93.29%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-precise 4.5382 6.360253 99.45% 0.023487 93.29%
Huihui-MoE-1.2B-A0.6B
Model BPW PPL 𝜌PPL KLD Same Top P
Huihui-MoE-1.2B-A0.6B-Q4_K_M-naive 4.9904 18.919566 99.02% 0.058880 87.35%
Huihui-MoE-1.2B-A0.6B-Q4_K_M-mse 4.9934 18.790209 99.19% 0.047507 88.43%
Huihui-MoE-1.2B-A0.6B-Q4_K_M-fast 4.9934 19.042245 99.11% 0.056029 87.12%
Huihui-MoE-1.2B-A0.6B-Q4_K_M-precise 4.9928 18.850755 99.20% 0.047156 88.47%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-naive 4.7539 19.547184 98.62% 0.087197 84.89%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-mse 4.7568 19.164889 98.99% 0.064329 86.78%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-fast 4.7563 19.123570 98.88% 0.069254 86.30%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-precise 4.7568 19.203988 98.87% 0.070200 86.04%
Huihui-MoE-5B-A1.7B-abliterated
Model BPW PPL 𝜌PPL KLD Same Top P
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-naive 5.1797 14.744046 99.21% 0.031757 92.07%
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-mse 5.1815 14.833509 99.20% 0.032450 92.01%
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-fast 5.1812 14.879478 99.19% 0.032689 91.96%
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-precise 5.1813 14.851407 99.20% 0.032571 92.00%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-naive 4.6252 14.911541 98.58% 0.065290 89.03%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-mse 4.6270 14.917422 98.86% 0.050793 90.09%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-fast 4.6270 14.787866 98.80% 0.051434 90.12%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-precise 4.6270 14.825255 98.79% 0.051980 90.09%
Llama-3.1-8B
Model BPW PPL 𝜌PPL KLD Same Top P
Llama-3.1-8B-Q4_K_M-naive 5.1777 6.291397 99.47% 0.024019 93.18%
Llama-3.1-8B-Q4_K_M-mse 5.1781 6.243892 99.64% 0.016981 93.88%
Llama-3.1-8B-Q4_K_M-fast 5.1781 6.243892 99.64% 0.016981 93.88%
Llama-3.1-8B-Q4_K_M-precise 5.1781 6.243892 99.64% 0.016981 93.88%
Llama-3.1-8B-IQ4_NL-naive 4.6526 6.321343 99.33% 0.030539 92.22%
Llama-3.1-8B-IQ4_NL-mse 4.6534 6.308061 99.40% 0.027789 92.35%
Llama-3.1-8B-IQ4_NL-fast 4.6534 6.308061 99.40% 0.027789 92.35%
Llama-3.1-8B-IQ4_NL-precise 4.6534 6.308061 99.40% 0.027789 92.35%
Llama-3.2-1B
Model BPW PPL 𝜌PPL KLD Same Top P
Llama-3.2-1B-Q4_K_M-naive 5.1779 10.022728 99.34% 0.037496 90.20%
Llama-3.2-1B-Q4_K_M-mse 5.1790 9.875390 99.54% 0.026770 91.43%
Llama-3.2-1B-Q4_K_M-fast 5.1790 9.864875 99.55% 0.026093 91.55%
Llama-3.2-1B-Q4_K_M-precise 5.1790 9.864875 99.55% 0.026093 91.55%
Llama-3.2-1B-IQ4_NL-naive 4.9535 10.094483 99.18% 0.046429 88.96%
Llama-3.2-1B-IQ4_NL-mse 4.9542 9.967031 99.41% 0.033563 90.52%
Llama-3.2-1B-IQ4_NL-fast 4.9542 9.961903 99.42% 0.033207 90.54%
Llama-3.2-1B-IQ4_NL-precise 4.9542 9.961903 99.42% 0.033207 90.54%
NVIDIA-Nemotron-Nano-9B-v2
Model BPW PPL 𝜌PPL KLD Same Top P
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-naive 5.8664 7.812578 99.83% 0.008216 95.49%
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-mse 5.8722 7.828592 99.79% 0.011882 93.62%
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-fast 5.8722 7.830588 99.79% 0.011959 93.62%
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-precise 5.8722 7.829907 99.79% 0.012017 93.62%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-naive 4.7711 7.884677 99.68% 0.016280 93.65%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-mse 4.7778 7.904368 99.61% 0.021356 91.92%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-fast 4.7772 7.905187 99.61% 0.020880 92.06%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-precise 4.7771 7.903981 99.61% 0.021088 92.00%
NVIDIA-Nemotron-Nano-12B-v2
Model BPW PPL 𝜌PPL KLD Same Top P
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-naive 4.8654 6.514963 99.70% 0.012792 94.79%
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-mse 4.8702 6.542849 99.64% 0.016636 93.85%
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-fast 4.8705 6.536551 99.67% 0.014806 94.11%
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-precise 4.8705 6.535902 99.68% 0.014691 94.15%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-naive 4.6177 6.544873 99.60% 0.018025 93.87%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-mse 4.7778 7.904368 99.60% 0.021356 91.92%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-fast 4.6228 6.520408 99.65% 0.015356 94.29%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-precise 4.6228 6.520723 99.65% 0.015515 94.23%
Qwen3-14B
Model BPW PPL 𝜌PPL KLD Same Top P
Qwen3-14B-Q4_K_M-naive 4.8730 8.451932 99.47% 0.016716 94.68%
Qwen3-14B-Q4_K_M-mse 4.8735 8.432478 99.52% 0.016641 94.40%
Qwen3-14B-Q4_K_M-fast 4.8735 8.432038 99.53% 0.016474 94.39%
Qwen3-14B-Q4_K_M-precise 4.8735 8.432038 99.53% 0.016474 94.39%
Qwen3-14B-IQ4_NL-naive 4.6236 8.381749 99.35% 0.022481 93.73%
Qwen3-14B-IQ4_NL-mse 4.6243 8.371418 99.42% 0.021300 93.71%
Qwen3-14B-IQ4_NL-fast 4.6243 8.364448 99.43% 0.020910 93.77%
Qwen3-14B-IQ4_NL-precise 4.6243 8.364448 99.43% 0.020910 93.77%

@EAddario EAddario changed the title quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants