The previous section described MoE as it was used before the era of
deep learning. After deep learning, MoE found applications in running the largest models, as a simple way to perform
conditional computation: only parts of the model are used, the parts chosen according to what the input is. The earliest paper that applies MoE to deep learning dates back to 2013, which proposed to use a different gating network at each layer in a deep neural network. Specifically, each gating is a linear-ReLU-linear-softmax network, and each expert is a linear-ReLU network. Since the output from the gating is not
sparse, all expert outputs are needed, and no conditional computation is performed. The key goal when using MoE in deep learning is to reduce computing cost. Consequently, for each query, only a small subset of the experts should be queried. This makes MoE in deep learning different from classical MoE. In classical MoE, the output for each query is a weighted sum of
all experts' outputs. In deep learning MoE, the output for each query can only involve a few experts' outputs. Consequently, the key design choice in MoE becomes routing: given a batch of queries, how to route the queries to the best experts.
Sparsely-gated MoE layer The
sparsely-gated MoE layer, published by researchers from
Google Brain, uses
feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them. Specifically, in a MoE layer, there are
feedforward networks f_1, ..., f_n, and a gating network w. The gating network is defined by w(x) = \mathrm{softmax}(\mathrm{top}_k(W x + \text{noise})) , where \mathrm{top}_k is a function that keeps the top-k entries of a vector the same, but sets all other entries to -\infty. The addition of noise helps with load balancing. The choice of k is a hyperparameter that is chosen according to application. Typical values are k = 1, 2. The k = 1 version is also called the Switch Transformer. The original Switch Transformer was applied to a
T5 language model. Table 3 shows that the MoE models used less inference time compute, despite having 30x more parameters. This architectural module was published in 2017-01, within a few months of the publication of the
Transformer architecture (2017-06-12), and they were combined into a
multimodal architecture called MultiModel published 4 days later (2017-06-16).
Load balancing Vanilla MoE tend to have issues of
load balancing: some experts are consulted often, while other experts rarely or not at all. To encourage the gate to select each expert with equal frequency (proper load balancing) within each batch, each MoE layer has two auxiliary loss functions. This is improved by Switch Transformer into a single
auxiliary loss function. Specifically, let n be the number of experts, then for a given batch of queries \{x_1, x_2, ..., x_T\}, the auxiliary loss for the batch isn\sum_{i=1}^n f_i P_iHere, f_i = \frac 1T \#(\text{queries sent to expert }i) is the fraction of tokens that chose expert i, and P_i = \frac 1T \sum_{j=1}^T \frac{w_i(x_j)}{\sum_{i' \in \text{experts} } w_{i'}(x_j)} is the fraction of weight on expert i. This loss is minimized at 1, precisely when every expert has equal weight 1/n in all situations.Researchers at
DeepSeek designed a variant of MoE, with "shared experts" that are always queried, and "routed experts" that might not be. They found that standard load balancing encourages the experts to be equally consulted, but this then causes experts to replicate the same core capacity, such as English grammar. They proposed the shared experts to learn core capacities that are often used, and let the routed experts to learn the peripheral capacities that are rarely used. They also proposed "auxiliary-loss-free load balancing strategy", which does not use auxiliary loss. Instead, each expert i has an extra "expert bias" b_i. If an expert is being neglected, then their bias increases, and vice versa. During token assignment, each token picks the top-k experts, but with the bias added in. That is:f(x) = \sum_{i \text{ is in the top-k of }\{w(x)_j + b_j\}_ j} w(x)_i f_i(x)Note that the expert bias matters for picking the experts, but not in adding up the responses from the experts.
Capacity factor Suppose there are n experts in a layer. For a given batch of queries \{x_1, x_2, ..., x_T\}, each query is routed to one or more experts. For example, if each query is routed to one expert as in Switch Transformers, and if the experts are load-balanced, then each expert should expect on average T/n queries in a batch. In practice, the experts cannot expect perfect load balancing: in some batches, one expert might be underworked, while in other batches, it would be overworked. Since the inputs cannot move through the layer until every expert in the layer has finished the queries it is assigned, load balancing is important. The
capacity factor is sometimes used to enforce a hard constraint on load balancing. Each expert is only allowed to process up to c \cdot T/n queries in a batch. The ST-MoE report found c \in [1.25, 2] to work well in practice. Generally speaking, routing is an
assignment problem: How to assign tokens to experts, such that a variety of constraints are followed (such as throughput, load balancing, etc.)? There are typically three classes of routing algorithm: the experts choose the tokens ("
expert choice"), the tokens choose the experts (the original sparsely-gated MoE), and a global assigner matching experts and tokens. During inference, the MoE works over a large batch of tokens at any time. If the tokens were to choose the experts, then some experts might get few tokens, while a few experts get so many tokens that it exceeds their maximum batch size, so they would have to ignore some of the tokens. Similarly, if the experts were to choose the tokens, then some tokens might not be picked by any expert. This is the "
token drop" problem. Dropping a token is not necessarily a serious problem, since in Transformers, due to
residual connections, if a token is "dropped", it does not disappear. Instead, its vector representation simply passes through the feedforward layer without change. using
reinforcement learning to train the routing algorithm (since picking an expert is a discrete action, like in RL). The token-expert match may involve no learning ("
static routing"): It can be done by a deterministic
hash function or a random number generator.
Applications to transformer models MoE layers are used in the largest
transformer models, for which learning and inferring over the full model is too costly. They are typically sparsely-gated, with sparsity 1 or 2. In Transformer models, the MoE layers are often used to select the
feedforward layers (typically a linear-ReLU-linear network), appearing in each Transformer block after the multiheaded attention. This is because the feedforward layers take up an increasing portion of the computing cost as models grow larger. For example, in the Palm-540B model, 90% of parameters are in its feedforward layers. A trained Transformer can be converted to a MoE by duplicating its feedforward layers, with randomly initialized gating, then trained further. This is a technique called "sparse upcycling". There are a large number of design choices involved in Transformer MoE that affect the training stability and final performance. The OLMoE report describes these in some detail. , models large enough to use MoE tend to be
large language models, where each expert has on the order of 10 billion parameters. Other than language models, Vision MoE is a Transformer model with MoE layers. They demonstrated it by training a model with 15 billion parameters. MoE Transformer has also been applied for
diffusion models. A series of large language models from
Google used MoE. GShard uses MoE with up to top-2 experts per layer. Specifically, the top-1 expert is always selected, and the top-2th expert is selected with probability proportional to that experts' weight according to the gating function. Later, GLaM demonstrated a language model with 1.2 trillion parameters, each MoE layer using top-2 out of 64 experts. Switch Transformers Each MoE layer uses a hierarchical MoE with two levels. On the first level, the gating function chooses to use either a "shared" feedforward layer, or to use the experts. If using the experts, then another gating function computes the weights and chooses the top-2 experts. MoE large language models can be adapted for downstream tasks by
instruction tuning. In December 2023,
Mistral AI released Mixtral 8x7B under Apache 2.0 license. It is a MoE language model with 46.7B parameters, 8 experts, and sparsity 2. They also released a version finetuned for instruction following. In March 2024, Databricks released
DBRX. It is a MoE language model with 132B parameters, 16 experts, and sparsity 4. They also released a version finetuned for instruction following. == See also ==