File:Bias-variance decomposition.png|thumb|Bias–variance decomposition in the case of mean squared loss. The green dots are samples of test label y at a fixed test feature x. Their variance around the mean \mathbb E_{y \sim p(\cdot | x)}[y] is the irreducible error \sigma^2. The red dots are test label predictions f(x | D) as the training set D is randomly sampled. Their variance around the mean \mathbb E_D[f(x | D)] is the variance \operatorname{Var}_D\big[f(x | D)\big] . The difference between the red dash and the green dash is the bias \operatorname{Bias}_D\big[f (x | D)\big] . The bias–variance decomposition is then visually clear: the mean squared error between the red dots and the green dots is the sum of the three components. Suppose that we have a training set consisting of a set of points x_1, \dots, x_n and real-valued labels y_i associated with the points x_i. We assume that the data is generated by a function f(x) such as y = f(x) + \varepsilon, where the noise, \varepsilon, has zero mean and unit variance \sigma^2. That is, y_i = f(x_i) + \varepsilon_i, where \varepsilon_i is a noise sample. We want to find a function \hat{f}\!(x;D), that approximates the true function f(x) as well as possible, by means of some learning algorithm based on a training dataset (sample) D=\{(x_1,y_1) \dots, (x_n, y_n)\}. We make "as well as possible" precise by measuring the
mean squared error between y and \hat{f}\!(x;D): we want (y - \hat{f}\!(x;D))^2 to be minimal, both for x_1, \dots, x_n
and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the y_i contain noise \varepsilon; this means we must be prepared to accept an
irreducible error in any function we come up with. Finding an \hat{f} that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function \hat{f} we select, we can decompose its
expected error on an unseen sample x (i.e. conditional on x) as follows: \mathbb{E}_{D, \varepsilon} \Big[\big(y - \hat{f}\!(x;D)\big)^2\Big] = \Big(\operatorname{Bias}_D\big[\hat{f}\!(x;D)\big] \Big) ^2 + \operatorname{Var}_D \big[\hat{f}\!(x;D)\big] + \sigma^2 where \begin{align} \operatorname{Bias}_D\big[\hat{f}\!(x;D)\big] &\triangleq \mathbb{E}_D\big[\hat{f}\!(x;D)- f(x)\big]\\ &= \mathbb{E}_D\big[\hat{f}\!(x;D)\big] \, - \, f(x)\\ &= \mathbb{E}_D\big[\hat{f}\!(x;D)\big] \, - \, \mathbb{E}_{y|x}\big[y(x)\big] \end{align} and \operatorname{Var}_D\big[\hat{f}\!(x;D)\big] \triangleq \mathbb{E}_D \left[ \left( \mathbb{E}_D[\hat{f}\!(x;D)] - \hat{f}\!(x;D) \right)^2 \right] and \sigma^2 = \operatorname{E}_y \Big[ \big( y - \underbrace{f(x)}_{E_{y|x}[y]} \big)^2 \Big] The expectation ranges over different choices of the training set D=\{(x_1,y_1) \dots, (x_n, y_n)\}, all sampled from the same joint distribution P(x,y) which can for example be done via
bootstrapping. The three terms represent: • the square of the
bias of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function f(x) using a learning method for
linear models, there will be error in the estimates \hat{f}\!(x) due to this assumption; • the
variance of the learning method, or, intuitively, how much the learning method \hat{f}\!(x) will move around its mean; • the irreducible error \sigma^2. Since all three terms are non-negative, the irreducible error forms a lower bound on the expected error on unseen samples. For convenience, we drop the D subscript in the following lines, such that \hat{f}\!(x;D) = \hat{f}\!(x). Let us write the mean-squared error of our model: \begin{align} \text{MSE}(x) &\triangleq \mathbb{E}\Big[\big(y - \hat{f}\!(x)\big)^2\Big]\\ &= \mathbb{E}\Big[\big(f(x) + \varepsilon - \hat{f}\!(x)\big)^2\Big] && \text{since } y \triangleq f(x) + \varepsilon\\ &= \mathbb{E}\Big[\big(f(x) - \hat{f}\!(x)\big)^2\Big] \, + \, 2 \ \mathbb{E}\Big[ \big(f(x) - \hat{f}\!(x)\big) \varepsilon \Big] \, + \, \mathbb{E}[\varepsilon^2] \end{align} We can show that the second term of this equation is null: \begin{align} \mathbb{E}\Big[ \big(f(x) - \hat{f}\!(x)\big) \varepsilon \Big] &= \mathbb{E} \big[ f(x) - \hat{f}\!(x) \big] \ \mathbb{E} \big[ \varepsilon \big] && \text{since } \varepsilon \text{ is independent from } x\\ &= 0 && \text{since } \mathbb{E} \big[ \varepsilon \big] = 0 \end{align} Moreover, the third term of this equation is nothing but \sigma^2, the variance of \varepsilon. Let us now expand the remaining term: \begin{align} & \operatorname\mathbb{E}\left[\left(f(x) - \hat{f}\!(x)\right)^2\right] \\[1ex] &= \operatorname\mathbb{E}\left[\left(f(x) - \operatorname\mathbb{E}[\hat{f}\!(x)] + \operatorname\mathbb{E}[\hat{f}\!(x)] - \hat{f}\!(x)\right)^2\right]\\[1ex] & = {\color{Blue} \operatorname\mathbb{E}\left[ \left( f(x) - \operatorname\mathbb{E}[\hat{f}\!(x)] \right)^2 \right]} \, + \, \operatorname\mathbb{E} \left[ \left( \operatorname\mathbb{E}[\hat{f}\!(x)] - \hat{f}\!(x) \right)^2 \right] \\ &\quad\, + \, 2 \ {\color{PineGreen} \operatorname\mathbb{E} \left[ \left( f(x) - \operatorname\mathbb{E}[\hat{f}\!(x)] \right) \left( \operatorname\mathbb{E}[\hat{f}\!(x)] - \hat{f}\!(x) \right) \right]} \end{align} We show that: \begin{align} {\color{Blue} \mathbb{E}\Big[ \big( f(x) - \mathbb{E} \big[ \hat{f}(x) \big] \big)^2 \Big]} &= \mathbb{E} \big[ f(x) ^2 \big] \, - \, 2 \ \mathbb{E} \Big[ f(x) \ \mathbb{E} \big[ \hat{f}(x) \big] \Big] \, + \, \mathbb{E} \Big[ \mathbb{E} \big[ \hat{f}(x) \big]^2 \Big]\\ &= f(x)^2 \, - \, 2 \ f(x) \ \mathbb{E} \big[ \hat{f}(x) \big] \, + \, \mathbb{E} \big[ \hat{f}(x) \big]^2\\ &= \Big( f(x) - \mathbb{E} \big[ \hat{f}(x) \big] \Big)^2 \end{align} This last series of equalities comes from the fact that f(x) is not a random variable, but a fixed, deterministic function of x. Therefore, \operatorname\mathbb{E}\left[f(x)\right] = f(x). Similarly \operatorname\mathbb{E}\left[ f(x)^2 \right] = f(x)^2, and \operatorname\mathbb{E} \left[ f(x) \, \operatorname\mathbb{E}[\hat{f}\!(x)] \right] = f(x) \, \operatorname\mathbb{E} \left[ \operatorname\mathbb{E}[\hat{f}\!(x)] \right] = f(x) \operatorname\mathbb{E}[\hat{f}\!(x)]. Using the same reasoning, we can expand the second term and show that it is null: \begin{align} &{\color{PineGreen} \operatorname\mathbb{E} \left[ \left( f(x) - \operatorname\mathbb{E}[\hat{f}\!(x)] \right) \left( \operatorname\mathbb{E}[\hat{f}\!(x)] - \hat{f}\!(x) \right) \right]} \\ &= \operatorname\mathbb{E} \left[ f(x)\, \operatorname\mathbb{E}[\hat{f}\!(x)] \, - \, f(x) \hat{f}\!(x) \, - \, \operatorname\mathbb{E}[\hat{f}\!(x)]^2 + \operatorname\mathbb{E}[\hat{f}\!(x)] \, \hat{f}\!(x) \right] \\ &= f(x) \, \operatorname\mathbb{E}[\hat{f}\!(x)] \, - \, f(x) \, \operatorname\mathbb{E}[\hat{f}\!(x)] \, - \, \operatorname\mathbb{E}[\hat{f}\!(x)]^2 \, + \, \operatorname\mathbb{E}[\hat{f}\!(x)]^2\\ &= 0 \end{align} Eventually, we plug our derivations back into the original equation, and identify each term: \begin{align} \text{MSE}(x) &= \left( f(x) - \operatorname\mathbb{E}[\hat{f}\!(x)] \right)^2 + \operatorname\mathbb{E} \left[ \left( \operatorname\mathbb{E}[\hat{f}\!(x)] - \hat{f}\!(x) \right)^2 \right] + \sigma^2\\ &= \operatorname{Bias}\left[\hat{f}\!(x) \right]^2 + \, \operatorname{Var} \left[ \hat{f}\!(x) \right] \, + \, \sigma^2 \end{align} Finally, the MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over x\sim P: \text{MSE} = \operatorname\mathbb{E}_x \left[ \text{MSE}(x) \right] = \operatorname\mathbb{E}_x \left\{\operatorname{Bias}_D\!\left[\hat{f}\!(x;D)\right]^2 + \operatorname{Var}_D\left[\hat{f}\!(x;D)\right]\right\} + \sigma^2. ==Approaches==