Commit 7afb002
IQ1_M: 1.75 bpw quantization (ggml-org#6302)
* iq1_m: basics
* iq1_m: basics-2
* iq1_m: CUDA dequantize works
Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.
* iq1_m: separate shifts for each group of 8 in a block
We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105
Not bad, but slightly higher than
sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
PPL = 9.14 for LLaMA-v2-7B
PPL = 6.63 for LLaMA-v2-13B
* iq1_m: go to 3-bit scales
There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.
We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw
* iq1_m: scalar dot product
* iq1_m: AVX2 dot product
* iq1_m: very slightly faster AVX2 dot product
* iq1_m: ARM_NEON dot product
Works, but very slow (10.5 t/s)
* iq1_m: Metal - dequantize works, dot product does not
* iq1_m: Metal now works
About the same performance as iq1_s.
* iq1_m: minor
* iq1_m: checking pure iq1_m quantization
It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.
* iiq1_m: slightly faster ARM_NEON dot product
10.5 t/s -> 11.65 t/s
* iq1_m: faster ARM_NEON dot product
11.65 t/s -> 14.9 t/s
* iq1_m: another minor ARM_NEON dot product improvement
14.9 -> 15.0 t/s
* iq1_m: small PPL improvement via super-block scale adjustment
After quantizing block scales redo the super-block scale fit.
PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B ) = 8.1624
* iq1_m: adapt to CUDA refactoring
* iq1_m: remove unused variable
We have progressed to warnings being errors.
* iq1_m: add to backend-ops tests
* iq1_m: fix Windows ARM
* iq1_m: use common definition of iq1m_scale_t
* cuda: assert -> NO_DEVICE_CODE
* iq1_M: PR comments
---------
Co-authored-by: Iwan Kawrakow <[email protected]>1 parent ee9a19f commit 7afb002
File tree
16 files changed
+1006
-125
lines changed- examples/quantize
- ggml-cuda
- gguf-py/gguf
- tests
16 files changed
+1006
-125
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| 29 | + | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
| |||
370 | 371 | | |
371 | 372 | | |
372 | 373 | | |
373 | | - | |
374 | | - | |
375 | | - | |
376 | | - | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
377 | 380 | | |
378 | 381 | | |
379 | 382 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
377 | 377 | | |
378 | 378 | | |
379 | 379 | | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
380 | 394 | | |
381 | 395 | | |
382 | 396 | | |
| |||
1050 | 1064 | | |
1051 | 1065 | | |
1052 | 1066 | | |
| 1067 | + | |
1053 | 1068 | | |
1054 | 1069 | | |
1055 | 1070 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
615 | 615 | | |
616 | 616 | | |
617 | 617 | | |
| 618 | + | |
618 | 619 | | |
619 | 620 | | |
620 | 621 | | |
| |||
643 | 644 | | |
644 | 645 | | |
645 | 646 | | |
| 647 | + | |
646 | 648 | | |
647 | 649 | | |
648 | 650 | | |
| |||
2560 | 2562 | | |
2561 | 2563 | | |
2562 | 2564 | | |
2563 | | - | |
| 2565 | + | |
2564 | 2566 | | |
2565 | 2567 | | |
2566 | 2568 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
373 | 373 | | |
374 | 374 | | |
375 | 375 | | |
376 | | - | |
| 376 | + | |
377 | 377 | | |
378 | 378 | | |
379 | 379 | | |
| |||
395 | 395 | | |
396 | 396 | | |
397 | 397 | | |
398 | | - | |
| 398 | + | |
399 | 399 | | |
400 | 400 | | |
401 | 401 | | |
| |||
416 | 416 | | |
417 | 417 | | |
418 | 418 | | |
419 | | - | |
| 419 | + | |
420 | 420 | | |
421 | 421 | | |
422 | 422 | | |
| |||
444 | 444 | | |
445 | 445 | | |
446 | 446 | | |
447 | | - | |
| 447 | + | |
448 | 448 | | |
449 | 449 | | |
450 | 450 | | |
| |||
470 | 470 | | |
471 | 471 | | |
472 | 472 | | |
473 | | - | |
| 473 | + | |
474 | 474 | | |
475 | 475 | | |
476 | 476 | | |
| |||
496 | 496 | | |
497 | 497 | | |
498 | 498 | | |
499 | | - | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
500 | 530 | | |
501 | 531 | | |
502 | 532 | | |
503 | 533 | | |
| 534 | + | |
504 | 535 | | |
505 | 536 | | |
506 | 537 | | |
| |||
658 | 689 | | |
659 | 690 | | |
660 | 691 | | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
661 | 698 | | |
662 | 699 | | |
663 | 700 | | |
| |||
724 | 761 | | |
725 | 762 | | |
726 | 763 | | |
| 764 | + | |
| 765 | + | |
727 | 766 | | |
728 | 767 | | |
729 | 768 | | |
| |||
769 | 808 | | |
770 | 809 | | |
771 | 810 | | |
| 811 | + | |
| 812 | + | |
772 | 813 | | |
773 | 814 | | |
774 | 815 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
282 | 282 | | |
283 | 283 | | |
284 | 284 | | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
285 | 293 | | |
286 | 294 | | |
287 | 295 | | |
| |||
373 | 381 | | |
374 | 382 | | |
375 | 383 | | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
376 | 387 | | |
377 | 388 | | |
378 | 389 | | |
| |||
0 commit comments