An RNN-based model can be factored into two parts: configuration and architecture. Multiple RNNs can be combined in a data flow, and the data flow itself is the configuration. Each RNN itself may have any architecture, including LSTM, GRU, etc.
Standard RNNs come in many variants. Abstractly speaking, an RNN is a function f_\theta of type (x_t, h_t) \mapsto (y_t, h_{t+1}), where • x_t: input vector; • h_t: hidden vector; • y_t: output vector; • \theta: neural network parameters. In words, it is a neural network that maps an input x_t into an output y_t, with the hidden vector h_t playing the role of "memory", a partial record of all previous input-output pairs. At each step, it transforms input to an output, and modifies its "memory" to help it to better perform future processing. The illustration to the right may be misleading to many because practical neural network topologies are frequently organized in "layers" and the drawing gives that appearance. However, what appears to be
layers are, in fact, different steps in time, "unfolded" to produce the appearance of
layers.
Stacked RNN A
stacked RNN, or
deep RNN, is composed of multiple RNNs stacked one above the other. Abstractly, it is structured as follows • Layer 1 has hidden vector h_{1, t}, parameters \theta_1, and maps f_{\theta_1} : (x_{0, t}, h_{1, t}) \mapsto (x_{1, t}, h_{1, t+1}) . • Layer 2 has hidden vector h_{2, t}, parameters \theta_2, and maps f_{\theta_2} : (x_{1, t}, h_{2, t}) \mapsto (x_{2, t}, h_{2, t+1}) . • ... • Layer n has hidden vector h_{n, t}, parameters \theta_n, and maps f_{\theta_n} : (x_{n-1, t}, h_{n, t}) \mapsto (x_{n, t}, h_{n, t+1}) . Each layer operates as a stand-alone RNN, and each layer's output sequence is used as the input sequence to the layer above. There is no conceptual limit to the depth of stacked RNN.
Bidirectional A
bidirectional RNN (biRNN) is composed of two RNNs, one processing the input sequence in one direction, and another in the opposite direction. Abstractly, it is structured as follows: • The forward RNN processes in one direction: f_{\theta}(x_0, h_0) = (y_0, h_{1}), f_{\theta}(x_1, h_1) = (y_1, h_{2}), \dots • The backward RNN processes in the opposite direction:f'_{\theta'}(x_N, h_N') = (y'_N, h_{N-1}'), f'_{\theta'}(x_{N-1}, h_{N-1}') = (y'_{N-1}, h_{N-2}'), \dots The two output sequences are then concatenated to give the total output: ((y_0, y_0'), (y_1, y_1'), \dots, (y_N, y_N')). Bidirectional RNN allows the model to process a token both in the context of what came before it and what came after it. By stacking multiple bidirectional RNNs together, the model can process a token increasingly contextually. The
ELMo model (2018) is a stacked bidirectional
LSTM which takes character-level as inputs and produces word-level embeddings.
Encoder-decoder Two RNNs can be run front-to-back in an
encoder-decoder configuration. The encoder RNN processes an input sequence into a sequence of hidden vectors, and the decoder RNN processes the sequence of hidden vectors to an output sequence, with an optional
attention mechanism. This was used to construct state of the art
neural machine translators during the 2014–2017 period. This was an instrumental step towards the development of
transformers.
PixelRNN An RNN may process data with more than one dimension. PixelRNN processes two-dimensional data, with many possible directions. For example, the row-by-row direction processes an n \times n grid of vectors x_{i, j} in the following order: x_{1, 1}, x_{1, 2}, \dots, x_{1, n}, x_{2, 1}, x_{2, 2}, \dots, x_{2, n}, \dots, x_{n, n}The
diagonal BiLSTM uses two LSTMs to process the same grid. One processes it from the top-left corner to the bottom-right, such that it processes x_{i, j} depending on its hidden state and cell state on the top and the left side: h_{i-1, j}, c_{i-1, j} and h_{i, j-1}, c_{i, j-1}. The other processes it from the top-right corner to the bottom-left. == Architectures ==