quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

EAddario · 2025-08-24T21:44:58Z

This PR introduces a new option --target-bpw implementing an optimised quant type selection algorithm to automatically determine per-tensor quantisation types in order to achieve a target bits-per-weight (bpw), with minimal estimated quality loss.

The selection algorithm,

builds a candidate set of quant types (K or IQ types)
for each layer/tensor, it simulates quantise→dequantise per candidate type, and estimates error using a weighted MSE error function. If the imatrix includes activations, it adds a bias penalty term to better reflect forward‑pass impact, making the error estimation more accurate and thus the quant type selection
it filters candidates to the pareto frontier (lowest error for a given size), then starts from the smallest bpw mix increasing to larger formats, based on the best error‑reduction per added bit, until the global bpw budget is reached
returns a map of tensor name → ggml_type overrides, which the main quantisation pass uses. If the minimum achievable BPW already exceeds the target, it returns that minimum.

The target_bpw_type() function will look over all quantisable tensors (e.g. embedding, output, etc.) unless --output-tensor-type, --token-embedding-type, and/or --tensor-type options are also used, in which case they'll take precedence.

--prune-layers can also be used in the same run, in which case the target_bpw_type() will skip the pruned layers and only consider the remaining against the total bpw budget.

Important note:

An imatrix that includes activations is required for the algorithm to work. At the time of writing, this is only available by generating the file using #14891 with the --output-format gguf option.

Typical usage: llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw 5.18 LLM-Model-F16.gguf BPW-Quantized-Q4_K_M.gguf q4_k_m

Special thanks to @ddh0 and @compilade for their contributions during the development of this PR.

PR created in draft until testing is completed

netrunnereve · 2025-08-25T01:23:18Z

This is a very interesting idea and makes me think of video compression. In video we can use a variable bitrate algorithm that allocates more bits to scenes with lots of detail and less bits for say a still image, all while targeting a preset birate.

I'm just thinking here but maybe in the future we can consider performance as well and automatically juggle error and speed with some sort of slider like what they have for video.

EAddario · 2025-08-25T10:52:43Z

Sharing some like-for-like test results showing that this approach produces, in the majority of the cases, better quality models compared to naive quantisation (i.e. simply running standard llama-quantize with no further optimisations).

To reduce the duration of the tests, I have chosen two small but representative models: Llama-3.2-1B ("classic" transformer architecture) and Huihui-MoE-1.2B-A0.6B (typical Mixture or Experts).

The test protocol for each is:

Generate Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, IQ4_NL, IQ3_M, and IQ3_S naive quantisations (e.g. llama-quantize --imatrix imatrix-with-activations.gguf LLM-Model-F16.gguf Naive-Quantized-<TYPE>.gguf <type>)
Determine each model bits per weight (bpw). This can be easily done by using python llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --markdown BPW-Quantized-<TYPE>.gguf type
Generate equivalent quant types by setting --target-bpw to the corresponding bpw values (e.g. llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw <naive bpw> LLM-Model-F16.gguf BPW-Quantized-<TYPE>.gguf <type>)
Calculate quality scores via llama-perplexity -m <Naive|BPW>-Quantized-<TYPE>.gguf -f calibration_dataset.txt --kl-divergence-base LLM-Model-F16.logits --kl-divergence

Llama-3.2-1B results:

Model	Naive BPW	Target BPW	Naive PPL	PPL	Naive 𝜌PPL	𝜌PPL	Naive KLD	KLD
IQ3_M	4.2042	4.2058	11.21441	13.08066	97.46%	94.62%	0.14661	0.29047
IQ3_S	4.1177	4.1191	11.41846	14.10772	97.08%	93.22%	0.16744	0.36883
IQ4_NL	4.9535	4.9542	10.10609	9.98096	99.19%	99.41%	0.04641	0.03356
Q3_K_L	4.6913	4.6894	10.74840	10.30599	98.10%	98.83%	0.10510	0.06738
Q3_K_M	4.4215	4.4184	10.97909	10.42277	97.71%	98.65%	0.12602	0.07920
Q3_K_S	4.1033	4.1037	14.19578	12.11165	92.80%	95.92%	0.37986	0.22400
Q4_K_M	5.1779	5.1792	10.01618	9.88781	99.34%	99.54%	0.03732	0.02654
Q4_K_S	4.9704	4.9762	10.06778	9.97105	99.27%	99.42%	0.04243	0.03350
Q5_K_M	5.8499	5.8521	9.75894	9.79049	99.80%	99.73%	0.01128	0.01620
Q5_K_S	5.7273	5.7291	9.76039	9.79663	99.80%	99.70%	0.01135	0.01757
Q6_K	6.5639	6.5646	9.68812	9.68277	99.91%	99.94%	0.00495	0.00354
Q8_0	8.5013	8.486	9.65172	9.64781	99.99%	99.99%	0.00050	0.00048

Huihui-MoE-1.2B-A0.6B results:

Model	Naive BPW	Target BPW	Naive PPL	PPL	Naive 𝜌PPL	𝜌PPL	Naive KLD	KLD
IQ3_M	3.9173	3.9204	27.44670	30.17950	92.93%	91.42%	0.53704	0.58776
IQ3_S	3.8207	3.8239	29.30734	32.94412	92.20%	90.00%	0.52148	0.70061
IQ4_NL	4.5043	4.5092	19.55229	19.71237	98.62%	98.09%	0.08709	0.13948
Q3_K_L	4.3883	4.3923	21.48216	20.62434	96.80%	98.20%	0.20565	0.12301
Q3_K_M	4.1221	5.0412	21.94908	18.87276	96.43%	99.20%	0.23232	0.04863
Q3_K_S	3.8207	3.8519	26.05622	23.81005	93.87%	95.60%	0.41128	0.30752
Q4_K_M	4.9904	5.0412	18.91957	18.87276	99.02%	99.20%	0.05888	0.04863
Q4_K_S	4.7793	4.7826	19.12118	19.25212	98.89%	99.02%	0.06898	0.06238
Q5_K_M	5.7541	5.7950	18.28129	18.31989	99.66%	99.70%	0.01778	0.01531
Q5_K_S	5.6323	5.6342	18.38359	18.37216	99.63%	99.65%	0.02013	0.01884
Q6_K	6.5655	6.5693	18.20380	18.19202	99.80%	99.81%	0.00776	0.00725
Q8_0	8.5028	8.5071	18.09292	18.08959	99.90%	99.90%	0.00094	0.00090

PPL: the smaller the better; 𝜌PPL: the higher the better; KLD: the smaller the better; In bold: best quality

Note

Although these are very encouraging results, more testing with different model architectures and sizes will be required before categorically concluding this functionality consistently yields higher quality models.

Comments, feedback and, in particular, bug reports are very much welcome

ThiloteE · 2025-08-30T13:45:24Z

This has an effect on the file size of the quants too, right?

EAddario · 2025-08-30T15:23:23Z

That's correct @ThiloteE, file size is directly and proportionally influenced by the chosen bpw

compilade · 2025-09-06T12:29:49Z

src/llama-quant.cpp

+
+    auto name_tn = LLM_TN(model.arch);
+    auto can_quantize = [&](const ggml_tensor * t) -> bool {
+        // This list should be kept in sync with llama_tensor_quantize_impl() to avoid drift


A common function could be used by this and llama_tensor_quantize_impl to avoid drift.

I usually try not to modify existing logic unless it's absolutely necessary, and prefer to implement changes as self-contained functions, but agree this would be better handled by having a single function. I'll change accordingly

src/llama-quant.cpp

… quantize

EAddario · 2025-09-15T07:38:29Z

Latest version introduces a new algorithm to select candidates resulting in models that are much closer to the bpw target and, in the large majority of the models tested, with better quality metrics than the equivalent naive version or very close:

ERNIE-4.5-21B-A3B-PT

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
ERNIE-4.5-21B-A3B-PT-Q4_K_M-naive	4.8543	6.358778	99.48%	0.020945	93.72%
ERNIE-4.5-21B-A3B-PT-Q4_K_M-mse	4.8600	6.358717	99.57%	0.017668	94.04%
ERNIE-4.5-21B-A3B-PT-Q4_K_M-fast	4.8600	6.361663	99.57%	0.017631	94.05%
ERNIE-4.5-21B-A3B-PT-Q4_K_M-precise	4.8600	6.361663	99.57%	0.017631	94.05%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-naive	4.5325	6.373562	99.40%	0.025309	93.14%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-mse	4.5382	6.350721	99.44%	0.023833	93.18%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-fast	4.5382	6.360253	99.45%	0.023487	93.29%
ERNIE-4.5-21B-A3B-PT-IQ4_NL-precise	4.5382	6.360253	99.45%	0.023487	93.29%

Huihui-MoE-1.2B-A0.6B

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
Huihui-MoE-1.2B-A0.6B-Q4_K_M-naive	4.9904	18.919566	99.02%	0.058880	87.35%
Huihui-MoE-1.2B-A0.6B-Q4_K_M-mse	4.9934	18.790209	99.19%	0.047507	88.43%
Huihui-MoE-1.2B-A0.6B-Q4_K_M-fast	4.9934	19.042245	99.11%	0.056029	87.12%
Huihui-MoE-1.2B-A0.6B-Q4_K_M-precise	4.9928	18.850755	99.20%	0.047156	88.47%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-naive	4.7539	19.547184	98.62%	0.087197	84.89%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-mse	4.7568	19.164889	98.99%	0.064329	86.78%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-fast	4.7563	19.123570	98.88%	0.069254	86.30%
Huihui-MoE-1.2B-A0.6B-IQ4_NL-precise	4.7568	19.203988	98.87%	0.070200	86.04%

Huihui-MoE-5B-A1.7B-abliterated

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-naive	5.1797	14.744046	99.21%	0.031757	92.07%
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-mse	5.1815	14.833509	99.20%	0.032450	92.01%
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-fast	5.1812	14.879478	99.19%	0.032689	91.96%
Huihui-MoE-5B-A1.7B-abliterated-Q4_K_M-precise	5.1813	14.851407	99.20%	0.032571	92.00%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-naive	4.6252	14.911541	98.58%	0.065290	89.03%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-mse	4.6270	14.917422	98.86%	0.050793	90.09%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-fast	4.6270	14.787866	98.80%	0.051434	90.12%
Huihui-MoE-5B-A1.7B-abliterated-IQ4_NL-precise	4.6270	14.825255	98.79%	0.051980	90.09%

Llama-3.1-8B

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
Llama-3.1-8B-Q4_K_M-naive	5.1777	6.291397	99.47%	0.024019	93.18%
Llama-3.1-8B-Q4_K_M-mse	5.1781	6.243892	99.64%	0.016981	93.88%
Llama-3.1-8B-Q4_K_M-fast	5.1781	6.243892	99.64%	0.016981	93.88%
Llama-3.1-8B-Q4_K_M-precise	5.1781	6.243892	99.64%	0.016981	93.88%
Llama-3.1-8B-IQ4_NL-naive	4.6526	6.321343	99.33%	0.030539	92.22%
Llama-3.1-8B-IQ4_NL-mse	4.6534	6.308061	99.40%	0.027789	92.35%
Llama-3.1-8B-IQ4_NL-fast	4.6534	6.308061	99.40%	0.027789	92.35%
Llama-3.1-8B-IQ4_NL-precise	4.6534	6.308061	99.40%	0.027789	92.35%

Llama-3.2-1B

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
Llama-3.2-1B-Q4_K_M-naive	5.1779	10.022728	99.34%	0.037496	90.20%
Llama-3.2-1B-Q4_K_M-mse	5.1790	9.875390	99.54%	0.026770	91.43%
Llama-3.2-1B-Q4_K_M-fast	5.1790	9.864875	99.55%	0.026093	91.55%
Llama-3.2-1B-Q4_K_M-precise	5.1790	9.864875	99.55%	0.026093	91.55%
Llama-3.2-1B-IQ4_NL-naive	4.9535	10.094483	99.18%	0.046429	88.96%
Llama-3.2-1B-IQ4_NL-mse	4.9542	9.967031	99.41%	0.033563	90.52%
Llama-3.2-1B-IQ4_NL-fast	4.9542	9.961903	99.42%	0.033207	90.54%
Llama-3.2-1B-IQ4_NL-precise	4.9542	9.961903	99.42%	0.033207	90.54%

NVIDIA-Nemotron-Nano-9B-v2

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-naive	5.8664	7.812578	99.83%	0.008216	95.49%
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-mse	5.8722	7.828592	99.79%	0.011882	93.62%
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-fast	5.8722	7.830588	99.79%	0.011959	93.62%
NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M-precise	5.8722	7.829907	99.79%	0.012017	93.62%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-naive	4.7711	7.884677	99.68%	0.016280	93.65%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-mse	4.7778	7.904368	99.61%	0.021356	91.92%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-fast	4.7772	7.905187	99.61%	0.020880	92.06%
NVIDIA-Nemotron-Nano-9B-v2-IQ4_NL-precise	4.7771	7.903981	99.61%	0.021088	92.00%

NVIDIA-Nemotron-Nano-12B-v2

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-naive	4.8654	6.514963	99.70%	0.012792	94.79%
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-mse	4.8702	6.542849	99.64%	0.016636	93.85%
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-fast	4.8705	6.536551	99.67%	0.014806	94.11%
NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M-precise	4.8705	6.535902	99.68%	0.014691	94.15%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-naive	4.6177	6.544873	99.60%	0.018025	93.87%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-mse	4.7778	7.904368	99.60%	0.021356	91.92%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-fast	4.6228	6.520408	99.65%	0.015356	94.29%
NVIDIA-Nemotron-Nano-12B-v2-IQ4_NL-precise	4.6228	6.520723	99.65%	0.015515	94.23%

Qwen3-14B

Model	BPW	PPL	𝜌PPL	KLD	Same Top P
Qwen3-14B-Q4_K_M-naive	4.8730	8.451932	99.47%	0.016716	94.68%
Qwen3-14B-Q4_K_M-mse	4.8735	8.432478	99.52%	0.016641	94.40%
Qwen3-14B-Q4_K_M-fast	4.8735	8.432038	99.53%	0.016474	94.39%
Qwen3-14B-Q4_K_M-precise	4.8735	8.432038	99.53%	0.016474	94.39%
Qwen3-14B-IQ4_NL-naive	4.6236	8.381749	99.35%	0.022481	93.73%
Qwen3-14B-IQ4_NL-mse	4.6243	8.371418	99.42%	0.021300	93.71%
Qwen3-14B-IQ4_NL-fast	4.6243	8.364448	99.43%	0.020910	93.77%
Qwen3-14B-IQ4_NL-precise	4.6243	8.364448	99.43%	0.020910	93.77%

EAddario added 30 commits August 19, 2025 09:54

Refactor variable name

ba7335e

Add target_bpw parameter

4d94911

Update usage

cfec404

Add parse_target_bpw()

5e85fb3

Load activations

e6d55dc

Populate activations_data with imatrix activations if present

77b818c

Process activations

0edbf0c

Process target_bpw parameter

e877474

Populate params

1b3d5b5

Refactor variable and add target_bpw

a22a9de

Add fallback_type enum

c96b8ee

Add is_iq()

9adae08

Validate if imatrix contains activations

017945a

Add target_bpw_type() logic

92f49ab

Implement bpw_overrides call

1187f6a

Refactor variable names

5aceb9e

Update comments

ee05d6b

Avoid division by zero if truncation occurs

f22b309

Increase precision for error calculation

936294f

Merge branch 'master' into quantize

b33abae

Add F16/BF16 type

5cd69a6

Add F16/BF16 type

69586e2

Do not mix K and IQ quants

29b2dc3

Add better fallbacks for IQ mixes

43caadf

Skip if output.weight or type is COPY

52da4a4

Fix bias lambda bug

3f0118d

Optimise tensor sampling

b0b33b7

Improve error estimation using weighted MSE

35ad0fc

Exclude embeddings and output tensor

5ef493e

Change error estimate to use normalised weighted MSE

95b2ab2

github-actions bot added the examples label Aug 24, 2025

EAddario mentioned this pull request Aug 24, 2025

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Open

EAddario changed the title ~~quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error possible~~ quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error Aug 25, 2025

EAddario added 7 commits August 26, 2025 21:39

Minor comment update

4286690

Refactor epsilon into a function-wide variable

0494611

Add directional scaling

8df1d00

Add precise_lambda()

66aff8f

Add --precise-lambda option

556f6b0

Minor factoring for efficiency and correctness

eab8708

Merge branch 'master' into quantize

09198c4

Merge branch 'master' into quantize

7d04050

compilade reviewed Sep 6, 2025

View reviewed changes

EAddario and others added 12 commits September 10, 2025 18:00

Add better control over MSE and directional bias computation

04c07b3

Merge branch 'master' into quantize

f0f07bd

Increase error type precision

886536d

Capture surrounding function name

bc8762f

Improve precise_lambda() efficiency

4dff85f

Minor refactoring

7d85993

Replace greedy allocator with lagrangian relaxation

12e816b

"Convexify" candidate list

2b51606

Increase IQ options

8503d59

Fix MoE tensor estimation

c709e1a

Merge branch 'ggml-org:master' into quantize

9b857e3

Merge branch 'quantize' of https://github.com/EAddario/llama.cpp into…

ad70fca

… quantize

EAddario changed the title ~~quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error~~ quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error Sep 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

EAddario commented Aug 24, 2025 •

edited

Loading

Uh oh!

netrunnereve commented Aug 25, 2025 •

edited

Loading

Uh oh!

EAddario commented Aug 25, 2025

Uh oh!

ThiloteE commented Aug 30, 2025 •

edited

Loading

Uh oh!

EAddario commented Aug 30, 2025

Uh oh!

compilade Sep 6, 2025

Uh oh!

EAddario Sep 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EAddario commented Sep 15, 2025

Uh oh!

Uh oh!

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

Are you sure you want to change the base?

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

Conversation

EAddario commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Important note:

Uh oh!

netrunnereve commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Aug 25, 2025

The test protocol for each is:

Llama-3.2-1B results:

Huihui-MoE-1.2B-A0.6B results:

Note

Uh oh!

ThiloteE commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Aug 30, 2025

Uh oh!

compilade Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

EAddario Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EAddario commented Sep 15, 2025

Uh oh!

Uh oh!

EAddario commented Aug 24, 2025 •

edited

Loading

netrunnereve commented Aug 25, 2025 •

edited

Loading

ThiloteE commented Aug 30, 2025 •

edited

Loading