Law of total variance

The law of total variance is a fundamental result in probability theory that expresses the variance of a random variable Y in terms of its conditional variances and conditional means given another random variable X. Informally, it states that the overall variability of Y can be split into an “unexplained” component and an “explained” component.

Explanation

Let be a random variable and another random variable on the same probability space. The law of total variance can be understood by noting: • \operatorname{Var}(Y \mid X) measures how much varies around its conditional mean \operatorname{E}[Y\mid X]. • Taking the expectation of this conditional variance across all values of gives \operatorname{E}[\operatorname{Var}(Y \mid X)], often termed the “unexplained” or within-group part. • The variance of the conditional mean, \operatorname{Var}(\operatorname{E}[Y\mid X]), measures how much these conditional means differ (i.e. the “explained” or between-group part). Adding these components yields the total variance \operatorname{Var}(Y), mirroring how analysis of variance partitions variation. == Examples ==

Examples

Example 1 (Exam scores) Suppose five students take an exam scored 0–100. Let = student’s score and indicate whether the student is *international* or *domestic*: • Mean and variance for international: \operatorname{E}[Y\mid X=\text{Intl}] = 50,\; \operatorname{Var}(Y\mid X=\text{Intl}) \approx 1266.7. • Mean and variance for domestic: \operatorname{E}[Y\mid X=\text{Dom}] = 50,\; \operatorname{Var}(Y\mid X=\text{Dom}) = 100. Both groups share the same mean (50), so the explained variance \operatorname{Var}(\operatorname{E}[Y\mid X]) is 0, and the total variance equals the average of the within-group variances (weighted by group size), i.e. 800. Example 2 (Mixture of two Gaussians) Let be a coin flip taking values with probability and with probability . Given Heads, Y \sim \mathrm{Normal}(\mu_h,\sigma_h^2); given Tails, Y \sim \mathrm{Normal}(\mu_t,\sigma_t^2). Then \operatorname{E}[\operatorname{Var}(Y\mid X)] = h\,\sigma_h^2 + (1 - h)\,\sigma_t^2, \operatorname{Var}(\operatorname{E}[Y\mid X]) = h\,(1 - h)\,(\mu_h - \mu_t)^2, so \operatorname{Var}(Y) = h\,\sigma_h^2 + (1 - h)\,\sigma_t^2 \;+\; h\,(1 - h)\,(\mu_h-\mu_t)^2. Example 3 (Dice and coins) Consider a two-stage experiment: • Roll a fair die (values 1–6) to choose one of six biased coins. • Flip that chosen coin; let =1 if Heads, 0 if Tails. Then \operatorname{E}[Y\mid X=i] = p_i, \; \operatorname{Var}(Y\mid X=i)=p_i(1-p_i). The overall variance of becomes \operatorname{Var}(Y) = \operatorname{E}\bigl[p_X(1 - p_X)\bigr] + \operatorname{Var}\bigl(p_X\bigr), with p_X uniform on \{p_1,\dots,p_6\}. == Proof ==

Proof

Discrete/finite proof Let (X_i,Y_i), i=1,\ldots,n, be observed pairs. Define \overline{Y} = \operatorname{E}[Y]. Then \begin{align} \operatorname{Var}(Y) &= \frac{1}{n} \sum_{i=1}^n \left(Y_i - \overline{Y}\right)^2 \\[1ex] &= \frac{1}{n} \sum_{i=1}^n \left[\left(Y_i - \overline{Y}_{\!\!X_i}\right) + \left(\overline{Y}_{\!\!X_i} - \overline{Y}\right)\right]^2, \end{align} where \overline{Y}_{X_i}=\operatorname{E}[Y\mid X=X_i]. Expanding the square and noting the cross term cancels in summation yields: \operatorname{Var}(Y) = \operatorname{E}\left[\operatorname{Var}(Y\mid X)\right] + \operatorname{Var}\left(\operatorname{E}[Y\mid X]\right). General case Using \operatorname{Var}(Y) = \operatorname{E}[Y^2] - \operatorname{E}[Y]^2 and the law of total expectation: \begin{align} \operatorname{E}[Y^2] &= \operatorname{E}\left[\operatorname{E}(Y^2 \mid X)\right] \\ &= \operatorname{E}\left[\operatorname{Var}(Y\mid X) + (\operatorname{E}[Y\mid X])^2 \right]. \end{align} Subtract \operatorname{E}[Y]^2 = \left(\operatorname{E}[\operatorname{E}(Y\mid X)]\right)^2 and regroup to arrive at \operatorname{Var}(Y) = \operatorname{E}\left[\operatorname{Var}(Y\mid X)\right] + \operatorname{Var}\left(\operatorname{E}[Y\mid X]\right). == Applications ==

Applications

Analysis of Variance (ANOVA) In a one-way analysis of variance, the total sum of squares (proportional to \operatorname{Var}(Y)) is split into a “between-group” sum of squares (\operatorname{Var}(\operatorname{E}[Y\mid X])) plus a “within-group” sum of squares (\operatorname{E}[\operatorname{Var}(Y\mid X)]). The F-test examines whether the explained component is sufficiently large to indicate has a significant effect on . Regression and R² In linear regression and related models, if \hat{Y}=\operatorname{E}[Y\mid X], the fraction of variance explained is \begin{align} R^2 = \frac{\operatorname{Var}(\hat{Y})}{\operatorname{Var}(Y)} &= \frac{\operatorname{Var}(\operatorname{E}[Y\mid X])}{\operatorname{Var}(Y)} \\[1ex] &= 1 - \frac{\operatorname{E}[\operatorname{Var}(Y\mid X)]}{\operatorname{Var}(Y)}. \end{align} In the simple linear case (one predictor), R^2 also equals the square of the Pearson correlation coefficient between and . Machine learning and Bayesian inference In many Bayesian and ensemble methods, one decomposes prediction uncertainty via the law of total variance. For a Bayesian neural network with random parameters \theta: \operatorname{Var}(Y) = \operatorname{E}\left[\operatorname{Var}(Y\mid \theta)\right] + \operatorname{Var}\left(\operatorname{E}[Y\mid \theta]\right), often referred to as “aleatoric” (within-model) vs. “epistemic” (between-model) uncertainty. Actuarial science Credibility theory uses the same partitioning: the expected value of process variance (EVPV), \operatorname{E}[\operatorname{Var}(Y\mid X)], and the variance of hypothetical means (VHM), \operatorname{Var}(\operatorname{E}[Y\mid X]). The ratio of explained to total variance determines how much “credibility” to give to individual risk classifications. In non-Gaussian settings, a high explained-variance ratio still indicates significant information about contained in . Generalizations The law of total variance generalizes to multiple or nested conditionings. For example, with two conditioning variables X_1 and X_2: \operatorname{Var}(Y) = \operatorname{E}\left[\operatorname{Var}(Y\mid X_1,X_2)\right] + \operatorname{E}\left[\operatorname{Var}(\operatorname{E}[Y\mid X_1,X_2]\mid X_1)\right] + \operatorname{Var}(\operatorname{E}[Y\mid X_1]). More generally, the law of total cumulance extends this approach to higher moments. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com