Before 1980 There are two
types of artificial neural network (ANN):
feedforward neural network (FNN) or
multilayer perceptron (MLP) and
recurrent neural networks (RNN). RNNs have cycles in their connectivity structure, whereas FNNs do not. In the 1920s,
Wilhelm Lenz and
Ernst Ising created the
Ising model which is essentially a non-learning RNN architecture consisting of neuron-like threshold elements. In 1972,
Shun'ichi Amari made this architecture adaptive. Other early
recurrent neural networks were published by Kaoru Nakano in 1971. Already in 1948,
Alan Turing produced work on "Intelligent Machinery" that was not published in his lifetime, containing "ideas related to artificial evolution and learning RNNs". proposed the perceptron, an MLP with 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and an output layer. He later published a 1962 book that also introduced variants and computer experiments, including a version with four-layer perceptrons "with adaptive preterminal networks" where the last two layers have learned weights (here he credits H. D. Block and B. W. Knight). The book cites an earlier network by R. D. Joseph (1960) "functionally equivalent to a variation of" this four-layer system (the book mentions Joseph over 30 times). Should Joseph therefore be considered the originator of proper adaptive
multilayer perceptrons with learning hidden units? Unfortunately, the learning algorithm was not a functional one, and fell into oblivion. The first working deep learning algorithm was the
Group method of data handling, a method to train arbitrarily deep neural networks, published by
Alexey Ivakhnenko and Lapa in 1965. They regarded it as a form of polynomial regression, or a generalization of Rosenblatt's perceptron to handle more complex, nonlinear, and hierarchical relationships. A 1971 paper described a deep network with eight layers trained by this method, which is based on layer by layer training through regression analysis. Superfluous hidden units are pruned using a separate validation set. Since the activation functions of the nodes are Kolmogorov-Gabor polynomials, these were also the first deep networks with multiplicative units or "gates". was published in 1967 by
Shun'ichi Amari. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned
internal representations to classify non-linearily separable pattern classes. Subsequent developments in hardware and hyperparameter tunings have made end-to-end stochastic gradient descent the currently dominant training technique. In 1969,
Kunihiko Fukushima introduced the
ReLU (rectified linear unit)
activation function. Deep learning architectures for
convolutional neural networks (CNNs) with convolutional layers and downsampling layers began with the
Neocognitron introduced by
Kunihiko Fukushima in 1979, though not trained by backpropagation.
Backpropagation is an efficient application of the
chain rule derived by
Gottfried Wilhelm Leibniz in 1673 to networks of differentiable nodes. The terminology "back-propagating errors" was actually introduced in 1962 by Rosenblatt, The modern form of backpropagation was first published in
Seppo Linnainmaa's master thesis (1970). (his 1974 PhD thesis, reprinted in a 1994 book, did not yet describe the algorithm). In 1986,
David E. Rumelhart et al. popularised backpropagation but did not cite the original work.
1980s-2000s The
time delay neural network (TDNN) was introduced in 1987 by
Alex Waibel to apply CNN to phoneme recognition. It used convolutions, weight sharing, and backpropagation. In 1988, Wei Zhang applied a backpropagation-trained CNN to alphabet recognition. In 1989,
Yann LeCun et al. created a CNN called
LeNet for
recognizing handwritten ZIP codes on mail. Training required 3 days. In 1990, Wei Zhang implemented a CNN on
optical computing hardware. In 1991, a CNN was applied to medical image object segmentation and breast cancer detection in mammograms. LeNet-5 (1998), a 7-level CNN by Yann LeCun et al., that classifies digits, was applied by several banks to recognize hand-written numbers on checks digitized in 32x32 pixel images.
Recurrent neural networks (RNN) and the
Elman network (1990), which applied RNN to study problems in
cognitive psychology. In the 1980s, backpropagation did not work well for deep learning with long credit assignment paths. To overcome this problem, in 1991,
Jürgen Schmidhuber proposed a hierarchy of RNNs pre-trained one level at a time by
self-supervised learning where each RNN tries to predict its own next input, which is the next unexpected input of the RNN below. This "neural history compressor" uses
predictive coding to learn
internal representations at multiple self-organizing time scales. This can substantially facilitate downstream deep learning. The RNN hierarchy can be
collapsed into a single RNN, by
distilling a higher level
chunker network into a lower level
automatizer network. The "P" in
ChatGPT refers to such pre-training.
Sepp Hochreiter's diploma thesis (1991) implemented the neural history compressor, Hochreiter proposed recurrent
residual connections to solve the vanishing gradient problem. This led to the
long short-term memory (LSTM), published in 1995. LSTM can learn "very deep learning" tasks which became the standard RNN architecture. In 1991,
Jürgen Schmidhuber also published adversarial neural networks that contest with each other in the form of a
zero-sum game, where one network's gain is the other network's loss. The first network is a
generative model that models a
probability distribution over output patterns. The second network learns by
gradient descent to predict the reactions of the environment to these patterns. This was called "artificial curiosity". In 2014, this principle was used in
generative adversarial networks (GANs). During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by
Terry Sejnowski,
Peter Dayan,
Geoffrey Hinton, etc., including the
Boltzmann machine,
restricted Boltzmann machine,
Helmholtz machine, and the
wake-sleep algorithm. These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986. (p. 112 ). A 1988 network became state of the art in
protein structure prediction, an early application of deep learning to bioinformatics. Both shallow and deep learning (e.g., recurrent nets) of ANNs for
speech recognition have been explored for many years. These methods never outperformed non-uniform internal-handcrafting Gaussian
mixture model/
Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively. Key difficulties have been analyzed, including gradient diminishing Additional difficulties were the lack of training data and limited computing power. Most
speech recognition researchers moved away from neural nets to pursue generative modeling. An exception was at
SRI International in the late 1990s. Funded by the US government's
NSA and
DARPA, SRI researched in speech and
speaker recognition. The speaker recognition team led by
Larry Heck reported significant success with deep neural networks in speech processing in the 1998
NIST Speaker Recognition benchmark. It was deployed in the Nuance Verifier, representing the first major industrial application of deep learning. The principle of elevating "raw" features over hand-crafted optimization was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear
filter-bank features in the late 1990s,
2000s Neural networks entered a lull, and simpler models that use task-specific handcrafted features such as
Gabor filters and
support vector machines (SVMs) became the preferred choices in the 1990s and 2000s, because of artificial neural networks' computational cost and a lack of understanding of how the brain wires its biological networks. In 2003, LSTM became competitive with traditional speech recognizers on certain tasks. In 2006,
Alex Graves, Santiago Fernández, Faustino Gomez, and Schmidhuber combined it with
connectionist temporal classification (CTC) in stacks of LSTMs. In 2009, it became the first RNN to win a
pattern recognition contest, in connected
handwriting recognition.
deep belief networks were developed for generative modeling. They are trained by training one restricted Boltzmann machine, then freezing it and training another one on top of the first one, and so on, then optionally
fine-tuned using supervised backpropagation. The impact of deep learning in industry began in the early 2000s, when CNNs already processed an estimated 10% to 20% of all the checks written in the US, according to Yann LeCun. Industrial applications of deep learning to large-scale speech recognition started around 2010. The 2009 NIPS Workshop on Deep Learning for Speech Recognition was motivated by the limitations of deep generative models of speech, and the possibility that given more capable hardware and large-scale data sets that deep neural nets might become practical. It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets. However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems. The nature of the recognition errors produced by the two types of systems was characteristically different, Analysis around 2009–2010, contrasting the GMM (and other generative speech models) vs. DNN models, stimulated early industrial investment in deep learning for speech recognition. In 2010, researchers extended deep learning from
TIMIT to large vocabulary speech recognition, by adopting large output layers of the DNN based on context-dependent HMM states constructed by
decision trees. including CNNs, faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning. A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004. In 2011, a CNN named
DanNet by Dan Ciresan, Ueli Meier, Jonathan Masci,
Luca Maria Gambardella, and
Jürgen Schmidhuber achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3. They also showed how
max-pooling CNNs on GPU improved performance significantly. In October 2012,
AlexNet by
Alex Krizhevsky,
Ilya Sutskever, and
Geoffrey Hinton and Google's
Inceptionv3. The success in image classification was then extended to the more challenging task of
generating descriptions (captions) for images, often as a combination of CNNs and LSTMs. In 2014, the state of the art was training "very deep neural network" with 20 to 30 layers. Stacking too many layers led to a steep reduction in
training accuracy, known as the "degradation" problem. In 2015, two techniques were developed to train very deep networks: the
highway network was published in May 2015, and the
residual neural network (ResNet) both of which were based on pretrained image classification neural networks, such as
VGG-19.
Generative adversarial network (GAN) by (
Ian Goodfellow et al., 2014) (based on
Jürgen Schmidhuber's principle of artificial curiosity based on the Progressive GAN by Tero Karras et al. Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning
deepfakes.
Diffusion models (2015) eclipsed GANs in generative modeling since then, with systems such as
DALL·E 2 (2022) and
Stable Diffusion (2022). In 2015, Google's speech recognition improved by 49% by an LSTM-based model, which they made available through
Google Voice Search on
smartphone. Deep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and
automatic speech recognition (ASR). Results on commonly used evaluation sets such as
TIMIT (ASR) and
MNIST (
image classification), as well as a range of large-vocabulary speech recognition tasks have steadily improved. Convolutional neural networks were superseded for ASR by
LSTM. but are more successful in computer vision.
Yoshua Bengio,
Geoffrey Hinton and
Yann LeCun were awarded the 2018
Turing Award for "conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing". == Neural networks ==