You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This change introduces a llamafile_mixmul() API, that allows tinyBLAS to
speed up "Mixture of Expert" models. On my Threadripper the Mixtral 8x7b
F16 weights now process prompts 2x faster. I am also seeing a 60 percent
improvement with Mixtral 8x22b Q4_0. Support is provided for Q8_0; it is
also supported by tinyBLAS. MoE models spend the most time in MUL_MAT_ID
rather than MUL_MAT, which is why llamafile_sgemm() was not able to help
them before. The new code works by decomposing the mixmul operation into
fast 2d llamafile_sgemm() calls. This also adds BF16 support to tinyBLAS
0 commit comments