Backpropagation

Backpropagation computes the gradient in weight space of a feedforward neural network, with respect to a loss function. Denote: • x: input (vector of features) • y: target output • :For classification, output will be a vector of class probabilities (e.g., (0.1, 0.7, 0.2), and target output is a specific class, encoded by the one-hot/dummy variable (e.g., (0, 1, 0)). • C: loss function or "cost function" • :For classification, this is usually cross-entropy (XC, log loss), while for regression it is usually squared error loss (SEL). • L: the number of layers • W^l = (w^l_{jk}): the weights between layer l - 1 and l, where w^l_{jk} is the weight between the k-th node in layer l - 1 and the j-th node in layer l • f^l: activation functions at layer l • :For classification the last layer is usually the logistic function for binary classification, and softmax (softargmax) for multi-class classification, while for the hidden layers this was traditionally a sigmoid function (logistic function or others) on each node (coordinate), but today is more varied, with rectifier (ramp, ReLU) being common. • a^l_j: activation of the j-th node in layer l. In the derivation of backpropagation, other intermediate quantities are used by introducing them as needed below. Bias terms are not treated specially since they correspond to a weight with a fixed input of 1. For backpropagation the specific loss function and activation functions do not matter as long as they and their derivatives can be evaluated efficiently. Traditional activation functions include sigmoid, tanh, and ReLU. Swish, Mish, and many others. The overall network is a combination of function composition and matrix multiplication: :g(x) := f^L(W^L f^{L-1}(W^{L-1} \cdots f^1(W^1 x)\cdots)) For a training set there will be a set of input–output pairs, \left\{(x_i, y_i)\right\}. For each input–output pair (x_i, y_i) in the training set, the loss of the model on that pair is the cost of the difference between the predicted output g(x_i) and the target output y_i: :C(y_i, g(x_i)) Note the distinction: during model evaluation the weights are fixed while the inputs vary (and the target output may be unknown), and the network ends with the output layer (it does not include the loss function). During model training the input–output pair is fixed while the weights vary, and the network ends with the loss function. Backpropagation computes the gradient for a fixed input–output pair (x_i, y_i), where the weights w^l_{jk} can vary. Each individual component of the gradient, \partial C/\partial w^l_{jk}, can be computed by the chain rule; but doing this separately for each weight is inefficient. Backpropagation efficiently computes the gradient by avoiding duplicate calculations and not computing unnecessary intermediate values, by computing the gradient of each layer – specifically the gradient of the weighted input of each layer, denoted by \delta^l – from back to front. Informally, the key point is that since the only way a weight in W^l affects the loss is through its effect on the next layer, and it does so linearly, \delta^l are the only data you need to compute the gradients of the weights at layer l, and then the gradients of weights of previous layer can be computed by \delta^{l-1} and repeated recursively. This avoids inefficiency in two ways. First, it avoids duplication because when computing the gradient at layer l, it is unnecessary to recompute all derivatives on later layers l+1, l+2, \ldots each time. Second, it avoids unnecessary intermediate calculations, because at each stage it directly computes the gradient of the weights with respect to the ultimate output (the loss), rather than unnecessarily computing the derivatives of the values of hidden layers with respect to changes in weights \partial a^{l'}_{j'}/\partial w^l_{jk}. Backpropagation can be expressed for simple feedforward networks in terms of matrix multiplication, or more generally in terms of the adjoint graph. ==Matrix multiplication==