Adding IQ1_TN - 1.6875 bpw for TriLM ternary models #44

ikawrakow · 2024-09-09T10:57:26Z

For the Bitnt-1.58b ternary models I had added IQ1_BN (1.625 bpw) and IQ2_BN (2.0 bpw) quants. But for TriLM I only added IQ2_TN (2.0625 bpw). This PR fills the gap adding the corresponding 1.6875 bpw quantization type IQ1_TN.

The matrix multiplication implementation simply reuses the existing IQ1_BN implementation. We just need to add the multiplication with the row scale at the end of a vector dot product between a row in the left matrix and a column in the right matrix (in IQ1_BN there are no scales in the quantized data, and the scale is applied separately via a ggml_scale operation).

While adding IQ1_TN to the IQ1_BN implementation, I noticed an optimization opportunity. As a result, this PR also improves IQ1_BN performance and IQ2_BN performance.

As PR-8151 has now been merged in mainline llama.cpp I was curious to compare IQ1_TN with the corresponding TQ1_0 and IQ2_TN with the corresponding TQ2_0 in llama.cpp.

The CPU's used in the comparisons below are Ryzen-7950X (Zen4), Ryzen-5975WX (AVX2) and M2-Max (NEON).

IQ1_TN vs TQ1_0, 4B TriLM model

backend	threads	test	t/s (TQ1_0)	t/s (IQ1_TN)	Speedup
CPU (Zen4)	16	pp512	157.50 ± 0.40	485.83 ± 2.23	3.085
	8	tg128	51.71 ± 0.05	54.31 ± 0.13	1.050
CPU (AVX2)	32	pp512	231.71 ± 0.41	530.97 ± 1.29	2.292
	16	tg128	55.93 ± 0.01	51.07 ± 0.04	0.913
CPU (NEON)	8	pp512	75.66 ± 0.02	201.25 ± 0.06	2.660
	8	tg128	55.63 ± 0.02	58.92 ± 0.19	1.059

IQ2_TN vs TQ2_0, 4B TriLM model

backend	threads	test	t/s (TQ1_0)	t/s (IQ1_TN)	Speedup
CPU (Zen4)	16	pp512	274.65 ± 0.75	445.31 ± 0.77	1.621
	4	tg128	46.72 ± 0.10	48.88 ± 0.06	1.050
CPU (AVX2)	32	pp512	437.11 ± 0.55	494.08 ± 0.79	1.130
	8	tg128	35.88 ± 0.04	43.34 ± 0.01	1.208
CPU (NEON)	8	pp512	117.55 ± 0.09	209.86 ± 0.12	1.785
	8	tg128	69.33 ± 0.06	78.93 ± 0.26	1.138

As IQ2_BN PP performance is better than IQ1_BN, these tables indicate that my IQ2_TN implementation on Zen4/AVX2 is likely not optimal. There also seem to be a bottleneck somewhere for TG with more than 8 threads than I need to look into.

We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!

PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads.

PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads.

We now get PP-512 = 614 t/s up from 542 t/s

We now get PP-512 = 753 t/s up from 680 t/s.

ikawrakow · 2024-09-09T11:56:12Z

For the record, here is how this PR improves IQ1/2_BN performance for PP

model	backend	threads	test	t/s (main)	TS (PR)	Speedup
bitnet 3B IQ2_BN	Zen4	16	pp512	515.59 ± 2.05	606.56 ± 6.29	1.176
bitnet 3B IQ1_BN	Zen4	16	pp512	411.92 ± 0.30	571.68 ± 2.42	1.388
bitnet 3B IQ2_BN	AVX2	32	pp512	637.75 ± 0.92	772.61 ± 1.27	1.211
bitnet 3B IQ1_BN	AVX2	32	pp512	517.17 ± 0.54	650.72 ± 6.02	1.258
bitnet 3B IQ2_BN	NEON	8	pp512	242.97 ± 0.60	247.82 ± 0.68	1.020
bitnet 3B IQ1_BN	NEON	8	pp512	207.05 ± 0.48	211.21 ± 0.65	1.020

Iwan Kawrakow added 9 commits September 8, 2024 17:56

Adding iq1_tn - 1.6875 bpw for TriLM ternary models

c82bf20

iq1_tn: NEON

8d509a7

iq1_tn: faster NEON

3487e68

iq2_bn: improve performance on NEON

c85d7be

We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!

iq1_tn: improve AVX2

45db138

PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads.

iq1_tn: improve Zen4

41c8200

PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads.

iq2_bn: improve on Zen4

61738aa

We now get PP-512 = 614 t/s up from 542 t/s

iq2_bn: improve AVX2 implementation

e9bb1a5

We now get PP-512 = 753 t/s up from 680 t/s.

Remove unnecessary barrier in ggml_compute_forward_mul_mat

237a238

ikawrakow merged commit 8c86231 into main Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding IQ1_TN - 1.6875 bpw for TriLM ternary models #44

Adding IQ1_TN - 1.6875 bpw for TriLM ternary models #44

Uh oh!

ikawrakow commented Sep 9, 2024

Uh oh!

ikawrakow commented Sep 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding IQ1_TN - 1.6875 bpw for TriLM ternary models #44

Adding IQ1_TN - 1.6875 bpw for TriLM ternary models #44

Uh oh!

Conversation

ikawrakow commented Sep 9, 2024

IQ1_TN vs TQ1_0, 4B TriLM model

IQ2_TN vs TQ2_0, 4B TriLM model

Uh oh!

ikawrakow commented Sep 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants