Mixture of experts

MoE always has the following components, but they are implemented and combined differently according to the problem being solved: • Experts f_1, ..., f_n, each taking the same input x, and producing outputs f_1(x), ..., f_n(x). • A weighting function (also known as a gating function) w, which takes input x and produces a vector of outputs (w(x)_1, ..., w(x)_n). This may or may not be a probability distribution, but in both cases, its entries are non-negative. • \theta = (\theta_0, \theta_1, ..., \theta_n) is the set of parameters. The parameter \theta_0 is for the weighting function. The parameters \theta_1, \dots, \theta_n are for the experts. • Given an input x, the mixture of experts produces a single output by combining f_1(x), ..., f_n(x) according to the weights w(x)_1, ..., w(x)_n in some way, usually by f(x) = \sum_i w(x)_i f_i(x). Both the experts and the weighting function are trained by minimizing some loss function, generally via gradient descent. There is much freedom in choosing the precise form of experts, the weighting function, and the loss function. Meta-pi network The meta-pi network, reported by Hampshire and Waibel, uses f(x) = \sum_i w(x)_i f_i(x) as the output. The model is trained by performing gradient descent on the mean-squared error loss L := \frac 1N \sum_k \|y_k - f(x_k)\|^2. The experts may be arbitrary functions. In their original publication, they were solving the problem of classifying phonemes in speech signal from 6 different Japanese speakers, 2 females and 4 males. They trained 6 experts, each being a "time-delayed neural network" (essentially a multilayered convolution network over the mel spectrogram). They found that the resulting mixture of experts dedicated 5 experts for 5 of the speakers, but the 6th (male) speaker does not have a dedicated expert, instead his voice was classified by a linear combination of the experts for the other 3 male speakers. Adaptive mixtures of local experts The adaptive mixtures of local experts uses a Gaussian mixture model. Each expert simply predicts a Gaussian distribution, and totally ignores the input. Specifically, the i-th expert predicts that the output is y \sim N(\mu_i, I), where \mu_i is a learnable parameter. The weighting function is a linear-softmax function:w(x)_i = \frac{e^{k_i^T x + b_i}}{\sum_j e^{k_j^T x + b_j}}The mixture of experts predict that the output is distributed according to the log-probability density function:\ln f_\theta(y|x) = \ln\left[\sum_i \frac{e^{k_i^T x + b_i}}{\sum_j e^{k_j^T x + b_j}} N(y | \mu_i, I)\right] = \ln\left[(2\pi)^{-d/2} \sum_i \frac{e^{k_i^T x + b_i}}{\sum_j e^{k_j^T x + b_j}} e^{-\frac 12 \|y-\mu_i\|^2}\right]It is trained by maximal likelihood estimation, that is, gradient ascent on f(y|x). The gradient for the i-th expert is \nabla_{\mu_i} \ln f_\theta(y|x) = \frac{w(x)_i N(y|\mu_i, I)}{\sum_j w(x)_j N(y|\mu_j, I)}\; (y-\mu_i) and the gradient for the weighting function is\nabla_{[k_i, b_i]}\ln f_\theta(y|x) = \begin{bmatrix}x\\ 1\end{bmatrix} \frac{w(x)_i}{\sum_j w(x)_j N(y|\mu_j, I)} (f_{i}(x)- f_\theta(y|x)) For each input-output pair (x, y), the weighting function is changed to increase the weight on all experts that performed above average, and decrease the weight on all experts that performed below average. This encourages the weighting function to learn to select only the experts that make the right predictions for each input. The i-th expert is changed to make its prediction closer to y, but the amount of change is proportional to w(x)_i N(y|\mu_i, I). This has a Bayesian interpretation. Given input x, the prior probability that expert i is the right one is w(x)_i, and N(y|\mu_i, I) is the likelihood of evidence y. So, \frac{w(x)_i N(y|\mu_i, I)}{\sum_j w(x)_j N(y|\mu_j, I)} is the posterior probability for expert i, and so the rate of change for the i-th expert is proportional to its posterior probability. In words, the experts that, in hindsight, seemed like the good experts to consult, are asked to learn on the example. The experts that, in hindsight, were not, are left alone. The combined effect is that the experts become specialized: Suppose two experts are both good at predicting a certain kind of input, but one is slightly better, then the weighting function would eventually learn to favor the better one. After that happens, the lesser expert is unable to obtain a high gradient signal, and becomes even worse at predicting such kind of input. Conversely, the lesser expert can become better at predicting other kinds of input, and increasingly pulled away into another region. This has a positive feedback effect, causing each expert to move apart from the rest and take care of a local region alone (thus the name "local experts"). Hierarchical MoE Hierarchical mixtures of experts uses multiple levels of gating in a tree. Each gating is a probability distribution over the next level of gatings, and the experts are on the leaf nodes of the tree. They are similar to decision trees. For example, a 2-level hierarchical MoE would have a first order gating function w_i, and second order gating functions w_{j|i} and experts f_{j|i}. The total prediction is then \sum_i w_i(x) \sum_j w_{j|i}(x) f_{j|i}(x). Variants The mixture of experts, being similar to the gaussian mixture model, can also be trained by the expectation-maximization algorithm, just like gaussian mixture models. Specifically, during the expectation step, the "burden" for explaining each data point is assigned over the experts, and during the maximization step, the experts are trained to improve the explanations they got a high burden for, while the gate is trained to improve its burden assignment. This can converge faster than gradient ascent on the log-likelihood. The choice of gating function is often softmax. Other than that, gating may use gaussian distributions and exponential families. only the highest ranked expert is chosen. That is, f(x) = f_{\arg\max_i w_i(x)}(x). This can accelerate training and inference time. The experts can use more general forms of multivariant gaussian distributions. For example, proposed f_i(y|x) = N(y | A_i x + b_i, \Sigma_i), where A_i, b_i, \Sigma_i are learnable parameters. In words, each expert learns to do linear regression, with a learnable uncertainty estimate. One can use different experts than gaussian distributions. For example, one can use Laplace distribution, or Student's t-distribution. For binary classification, it also proposed logistic regression experts, withf_i(y|x) = \begin{cases} \frac{1}{1+e^{\beta_i^T x + \beta_{i,0}}}, & y = 0 \\ 1-\frac{1}{1+e^{\beta_i^T x + \beta_{i,0}}}, & y= 1 \end{cases} where \beta_{i}, \beta_{i, 0} are learnable parameters. This is later generalized for multi-class classification, with multinomial logistic regression experts. One paper proposed mixture of softmaxes for autoregressive language modelling. Specifically, consider a language model that given a previous text c, predicts the next word x. The network encodes the text into a vector v_c, and predicts the probability distribution of the next word as \mathrm{Softmax}( v_c W ) for an embedding matrix W. In mixture of softmaxes, the model outputs multiple vectors v_{c, 1}, \dots, v_{c, n}, and predict the next word as \sum_{i=1}^n p_i \; \mathrm{Softmax}( v_{c,i} W_i ) , where p_i is a probability distribution by a linear-softmax operation on the activations of the hidden neurons within the model. The original paper demonstrated its effectiveness for recurrent neural networks. This was later found to work for Transformers as well. == Deep learning ==