Gated Linear Units (GLUs) adapt the gating mechanism for use in
feedforward neural networks, often within
transformer-based architectures. They are defined as: \mathrm{GLU}(a,b)=a \odot \sigma(b) where a,b are the first and second inputs, respectively. \sigma represents the
sigmoid activation function. Replacing \sigma with other activation functions leads to variants of GLU: \begin{aligned} \mathrm{ReGLU}(a, b) &= a \odot \text{ReLU}(b)\\ \mathrm{GEGLU}(a, b) &= a \odot \text{GELU}(b)\\ \mathrm{SwiGLU}(a, b, \beta) &= a \odot \text{Swish}_\beta(b) \end{aligned} where
ReLU,
GELU, and
Swish are different activation functions. In transformer models, such gating units are often used in the
feedforward modules. For a single vector input, this results in: \begin{aligned} \operatorname{GLU}(x, W, V, b, c) & =\sigma(x W+b) \odot(x V+c) \\ \operatorname{Bilinear}(x, W, V, b, c) & =(x W+b) \odot(x V+c) \\ \operatorname{ReGLU}(x, W, V, b, c) & =\max (0, x W+b) \odot(x V+c) \\ \operatorname{GEGLU}(x, W, V, b, c) & =\operatorname{GELU}(x W+b) \odot(x V+c) \\ \operatorname{SwiGLU}(x, W, V, b, c, \beta) & =\operatorname{Swish}_\beta(x W+b) \odot(x V+c) \end{aligned} == Other architectures ==