Previous work Residual connections were noticed in
neuroanatomy, such as
Lorente de No (1938).
McCulloch and
Pitts (1943) proposed artificial neural networks and considered those with residual connections. In 1961,
Frank Rosenblatt described a three-layer
multilayer perceptron (MLP) model with skip connections. The model was referred to as a "cross-coupled system", and the skip connections were forms of cross-coupled connections. During the late 1980s, "skip-layer" connections were sometimes used in neural networks. Examples include: Lang and Witbrock (1988) trained a fully connected feedforward network where each layer skip-connects to all subsequent layers, like the later DenseNet (2016). In this work, the residual connection was the form where P is a randomly-initialized projection connection. They termed it a "short-cut connection". An early neural language model used residual connections and named them "direct connections".
Degradation problem Sepp Hochreiter discovered the
vanishing gradient problem in 1991 and argued that it explained why the then-prevalent forms of
recurrent neural networks did not work for long sequences. He and
Schmidhuber later designed the LSTM architecture to solve this problem, which has a "cell state" c_t that can function as a generalized residual connection. The
highway network (2015) applied the idea of an LSTM
unfolded in time to
feedforward neural networks, resulting in the highway network. ResNet is equivalent to an open-gated highway network. During the early days of deep learning, there were attempts to train increasingly deep models. Notable examples included the
AlexNet (2012), which had 8 layers, and the
VGG-19 (2014), which had 19 layers. However, stacking too many layers led to a steep reduction in
training accuracy, known as the "degradation" problem.
Subsequent work Wide Residual Network (2016) found that using more channels and fewer layers than the original ResNet improves performance and GPU-computational efficiency, and that a block with two 3×3 convolutions is superior to other configurations of convolution blocks.
DenseNet (2016) connects the output of each layer to the input to each subsequent layer: x_{\ell+1} = F(x_1, x_2, \dots, x_{\ell-1}, x_{\ell})
Stochastic depth is a
regularization method that randomly drops a subset of layers and lets the signal propagate through the identity skip connections. Also known as
DropPath, this regularizes training for deep models, such as
vision transformers.
ResNeXt (2017) combines the
Inception module with ResNet.
Squeeze-and-Excitation Networks (2018) added squeeze-and-excitation (SE) modules to ResNet. An SE module is applied after a convolution, and takes a tensor of shape \R^{H \times W \times C} (height, width, channels) as input. Each channel is averaged, resulting in a vector of shape \R^C. This is then passed through a
multilayer perceptron (with an architecture such as
linear-ReLU-linear-sigmoid) before it is multiplied with the original tensor. It won the
ILSVRC in 2017. == References ==