Adding IQ1_TN - 1.6875 bpw for TriLM ternary models #44
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For the Bitnt-1.58b ternary models I had added
IQ1_BN(1.625 bpw) andIQ2_BN(2.0 bpw) quants. But for TriLM I only addedIQ2_TN(2.0625 bpw). This PR fills the gap adding the corresponding 1.6875 bpw quantization typeIQ1_TN.The matrix multiplication implementation simply reuses the existing
IQ1_BNimplementation. We just need to add the multiplication with the row scale at the end of a vector dot product between a row in the left matrix and a column in the right matrix (inIQ1_BNthere are no scales in the quantized data, and the scale is applied separately via aggml_scaleoperation).While adding
IQ1_TNto theIQ1_BNimplementation, I noticed an optimization opportunity. As a result, this PR also improvesIQ1_BNperformance andIQ2_BNperformance.As PR-8151 has now been merged in mainline
llama.cppI was curious to compareIQ1_TNwith the correspondingTQ1_0andIQ2_TNwith the correspondingTQ2_0inllama.cpp.The CPU's used in the comparisons below are Ryzen-7950X (Zen4), Ryzen-5975WX (AVX2) and M2-Max (NEON).
IQ1_TN vs TQ1_0, 4B TriLM model
IQ2_TN vs TQ2_0, 4B TriLM model
As
IQ2_BNPP performance is better thanIQ1_BN, these tables indicate that myIQ2_TNimplementation onZen4/AVX2is likely not optimal. There also seem to be a bottleneck somewhere for TG with more than 8 threads than I need to look into.