Cross-entropy can be used to define a loss function in
machine learning and
optimization. Mao, Mohri, and Zhong (2023) give an extensive analysis of the properties of the family of cross-entropy loss functions in machine learning, including theoretical learning guarantees and extensions to
adversarial learning. The true probability p_i is the true label, and the given distribution q_i is the predicted value of the current model. This is also known as the
log loss (or
logarithmic loss or
logistic loss); the terms "log loss" and "cross-entropy loss" are used interchangeably. More specifically, consider a
binary regression model which can be used to classify observations into two possible classes (often simply labelled 0 and 1). The output of the model for a given observation, given a vector of input features x , can be interpreted as a probability, which serves as the basis for classifying the observation. In
logistic regression, the probability is modeled using the
logistic function g(z) = 1/(1+e^{-z}) where z is some function of the input vector x, commonly just a linear function. The probability of the output y=1 is given by q_{y=1} = \hat{y} \equiv g(\mathbf{w}\cdot\mathbf{x}) = \frac 1 {1+e^{-\mathbf{w}\cdot\mathbf{x}}}, where the vector of weights \mathbf{w} is optimized through some appropriate algorithm such as
gradient descent. Similarly, the complementary probability of finding the output y=0 is simply given by q_{y=0} = 1-\hat{y}. Having set up our notation, p\in\{y,1-y\} and q\in\{\hat{y},1-\hat{y}\}, we can use cross-entropy to get a measure of dissimilarity between p and q: \begin{align} H(p,q) &= -\sum_m p_m \log q_m = -y\log\hat{y} - (1-y) \log(1-\hat{y}). \end{align} Logistic regression typically optimizes the log loss for all the observations on which it is trained, which is the same as optimizing the average cross-entropy in the sample. Other loss functions that penalize errors differently can be also used for training, resulting in models with different final test accuracy. For example, suppose we have N samples with each sample indexed by n=1,\dots,N. The
average of the loss function is then given by \begin{align} J(\mathbf{w}) &= \frac{1}{N} \sum_{i=1}^N H(p_i,q_i) \\ &= -\frac{1}{N} \sum_{i=1}^N\ \left[y_i \log \hat y_i + (1 - y_i) \log (1 - \hat y_i)\right], \end{align} where \hat{y}_i\equiv g(\mathbf{w}\cdot\mathbf{x}_i) = 1/(1+e^{-\mathbf{w}\cdot\mathbf{x}_i}) , with g(z) as the logistic function as before. == Relation to linear regression ==