-
Notifications
You must be signed in to change notification settings - Fork 630
Description
Problem
In a Mixture of Experts (MoE) LLM, the gating network outputs a categorical distribution of Mixtral-8x7b and Mixtral-8x22b). If the model was trained to choose only the top
Solution
For simplicity, let's assume that the output of each expert is an i.i.d. random vector with a norm of
The expected norm of this output is:
NOTE: The last equality holds only for a balanced distribution, where
If we change the number of experts to
To make the expected norm of the output with
With this scaling, the expected norm of the output with
Which is the same as the expected norm of the output with
Scale Factor
The scale factor
- When
$m > n$ , the scale factor$\sqrt{\frac{n}{m}}$ will be less than 1. - When
$m < n$ , the scale factor$\sqrt{\frac{n}{m}}$ will be greater than 1. - When
$m = n$ , the scale factor$\sqrt{\frac{n}{m}} = 1$ .
(sorry for the AI generated text again - but it's so much easier than trying to write all that Latex!)
This all assumes I have correctly understood what the Mixtral-style MoE architecture is doing though (it's not 100% clear from the paper).
If this shows promise then the i.i.d. assumption and the discrete uniform distribution simplification can be removed by sampling the actual outputs of the expert MLPs / gating networks (the i.i.d. assumption can be improved on if we are happy to just guess values for
I'm going to try this on Mixtral-8x7b-Instruct now and see if it improves the perplexity vs pervious attempts:
https://rentry.org/HowtoMixtral
https://old.reddit.com/r/LocalLLaMA/comments/18m6zjz/for_exllamav2_how_many_mixtral_experts_are/
@cg123 I see you already have a parameter called residual_scale so for the mergekit-moe merges it should be pretty easy to try scaling the models designed to not be in a MOE by