Standard scaled dot-product attention For matrices: Q\in\mathbb{R}^{m\times d_k}, K\in\mathbb{R}^{n\times d_k} and V\in\mathbb{R}^{n\times d_v}, the scaled dot-product, or QKV attention, is defined as: \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\in\mathbb{R}^{m\times d_v} where {}^T denotes
transpose and the
softmax function is applied independently to every row of its argument. The matrix Q contains m queries, while matrices K, V jointly contain an
unordered set of n key-value pairs. Value vectors in matrix V are weighted using the weights resulting from the softmax operation, so that the rows of the m-by-d_v output matrix are confined to the
convex hull of the points in \mathbb{R}^{d_v} given by the rows of V. To understand the permutation invariance and permutation equivariance properties of QKV attention, let A\in\mathbb{R}^{m\times m} and B\in\mathbb{R}^{n\times n} be
permutation matrices; and D\in\mathbb{R}^{m\times n} an arbitrary matrix. The softmax function is permutation equivariant in the sense that: \text{softmax}(ADB) = A\,\text{softmax}(D)B By noting that the transpose of a permutation matrix is also its inverse, it follows that: \text{Attention}(AQ, BK, BV) = A\,\text{Attention}(Q, K, V) which shows that QKV attention is
equivariant with respect to re-ordering the queries (rows of Q); and
invariant to re-ordering of the key-value pairs in K, V. These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as: X\mapsto\text{Attention}(XT_q, XT_k, XT_v) is permutation equivariant with respect to re-ordering the rows of the input matrix X in a non-trivial way, because every row of the output is a function of all the rows of the input. Similar properties hold for
multi-head attention, which is defined below.
Masked attention When QKV attention is used as a building block for an autoregressive decoder, and when at training time all input and output matrices have n rows, a masked attention variant is used: \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}+M\right)V where the mask, M\in\mathbb{R}^{n\times n} is a
strictly upper triangular matrix, with zeros on and below the diagonal and -\infty in every element above the diagonal. The softmax output, also in \mathbb{R}^{n\times n} is then
lower triangular, with zeros in all elements above the diagonal. The masking ensures that for all 1\le i, row i of the attention output is independent of row j of any of the three input matrices. The permutation invariance and equivariance properties of standard QKV attention do not hold for the masked variant.
Multi-head attention Multi-head attention \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O where each head is computed with QKV attention as: \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) and W_i^Q, W_i^K, W_i^V, and W^O are parameter matrices. The permutation properties of (standard, unmasked) QKV attention apply here also. For permutation matrices, A, B: \text{MultiHead}(AQ, BK, BV) = A\,\text{MultiHead}(Q, K, V) from which we also see that multi-head self-attention: X\mapsto\text{MultiHead}(XT_q, XT_k, XT_v) is equivariant with respect to re-ordering of the rows of input matrix X.
Bahdanau (additive) attention \text{Attention}(Q, K, V) = \text{softmax}(\tanh(W_QQ + W_KK))V where W_Q and W_K are learnable weight matrices.
Luong attention (general) \text{Attention}(Q, K, V) = \text{softmax}(QWK^T)V where W is a learnable weight matrix.
Self-attention Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences. For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed
lookup table. This gives a sequence of hidden vectors h_0, h_1, \dots . These can then be applied to a dot-product attention mechanism, to obtain\begin{aligned} h_0' &= \mathrm{Attention}(h_0 W^Q, HW^K, H W^V) \\ h_1' &= \mathrm{Attention}(h_1 W^Q, HW^K, H W^V) \\ &\;\,\vdots \end{aligned} or more succinctly, H' = \mathrm{Attention}(H W^Q, HW^K, H W^V) . This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.
Masking For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights w_{ij} = 0 for all i , called "causal masking". This attention mechanism is the "causally masked self-attention". == See also ==