Precursors Backpropagation had been derived repeatedly, as it is essentially an efficient application of the
chain rule (first written down by
Gottfried Wilhelm Leibniz in 1676) to neural networks. The terminology "back-propagating error correction" was introduced in 1962 by
Frank Rosenblatt, but he did not know how to implement this. In any case, he only studied neurons whose outputs were discrete levels, which only had zero derivatives, making backpropagation impossible. Precursors to backpropagation appeared in
optimal control theory since 1950s.
Yann LeCun et al credits 1950s work by
Pontryagin and others in optimal control theory, especially the
adjoint state method, for being a continuous-time version of backpropagation.
Hecht-Nielsen credits the
Robbins–Monro algorithm (1951) and
Arthur Bryson and
Yu-Chi Ho's
Applied Optimal Control (1969) as presages of backpropagation. Other precursors were
Henry J. Kelley 1960, and
Arthur E. Bryson (1961). In 1962,
Stuart Dreyfus published a simpler derivation based only on the
chain rule. In 1973, he adapted
parameters of controllers in proportion to error gradients. Unlike modern backpropagation, these precursors used standard Jacobian matrix calculations from one stage to the previous one, neither addressing direct links across several stages nor potential additional efficiency gains due to network sparsity. The
ADALINE (1960) learning algorithm was gradient descent with a squared error loss for a single layer. The first
multilayer perceptron (MLP) with more than one layer trained by
stochastic gradient descent The MLP had 5 layers, with 2 learnable layers, and it learned to classify patterns not linearly separable. for discrete connected networks of nested
differentiable functions. In 1982,
Paul Werbos applied backpropagation to MLPs in the way that has become standard. Werbos described how he developed backpropagation in an interview. In 1971, during his PhD work, he developed backpropagation to mathematicize
Freud's "flow of psychic energy". He faced repeated difficulty in publishing the work, only managing in 1981. He also claimed that "the first practical application of back-propagation was for estimating a dynamic model to predict nationalism and social communications in 1974" by him. Around 1982, backpropagation and taught the algorithm to others in his research circle. He did not cite previous work as he was unaware of them. He published the algorithm first in a 1985 paper, then in a 1986
Nature paper an experimental analysis of the technique. These papers became highly cited, contributed to the popularization of backpropagation, and coincided with the resurging research interest in neural networks during the 1980s. In 1985, the method was also described by David Parker.
Yann LeCun proposed an alternative form of backpropagation for neural networks in his PhD thesis in 1987. Gradient descent took a considerable amount of time to reach acceptance. Some early objections were: there were no guarantees that gradient descent could reach a global minimum, only local minimum; neurons were "known" by physiologists as making discrete signals (0/1), not continuous ones, and with discrete signals, there is no gradient to take. See the interview with
Geoffrey Hinton,
Early successes Contributing to the acceptance were several applications in training neural networks via backpropagation, sometimes achieving popularity outside the research circles. In 1987,
NETtalk learned to convert English text into pronunciation. Sejnowski tried training it with both backpropagation and Boltzmann machine, but found the backpropagation significantly faster, so he used it for the final NETtalk. In 1989, Dean A. Pomerleau published ALVINN, a neural network trained to
drive autonomously using backpropagation. The
LeNet was published in 1989 to recognize handwritten zip codes. In 1992,
TD-Gammon achieved top human level play in backgammon. It was a reinforcement learning agent with a neural network with two layers, trained by backpropagation. In 1993, Eric Wan won an international pattern recognition contest through backpropagation.
After backpropagation During the 2000s it fell out of favour, but returned in the 2010s, benefiting from cheap, powerful
GPU-based computing systems. This has been especially so in
speech recognition,
machine vision,
natural language processing, and language structure learning research (in which it has been used to explain a variety of phenomena related to first and second language learning.) Error backpropagation has been suggested to explain human brain
event-related potential (ERP) components like the
N400 and
P600. In 2023, a backpropagation algorithm was implemented on a
photonic processor by a team at
Stanford University. ==See also==