Kurtosis

The kurtosis is the fourth standardized moment, defined as \begin{align} \operatorname{Kurt}[X] &:= \tilde{\mu}_4 \equiv \frac{\mu_4}{\sigma^4}\\ &= \operatorname{E}\left[{\left(\frac{X - \mu}{\sigma}\right)}^4\right] = \frac{\operatorname{E}\left[(X - \mu)^4\right]}{\left(\operatorname{E}\left[(X - \mu)^2\right]\right)^2}\\ \end{align} where is the fourth central moment and is the standard deviation. Several letters are used in the literature to denote the kurtosis. A very common choice is , which is fine as long as it is clear that it does not refer to a cumulant. Other choices include , to be similar to the notation for skewness, although sometimes this is instead reserved for the excess kurtosis. Pearson is systematically using . The kurtosis is bounded below by the squared skewness plus 1: \frac{\mu_4}{\sigma^4} \geq \left(\frac{\mu_3}{\sigma^3}\right)^2 + 1, where is the third central moment. The lower bound is realized by the Bernoulli distribution. There is no upper limit to the kurtosis of a general probability distribution, and it may be infinite. A reason why some authors favor the excess kurtosis is that cumulants are extensive. Formulas related to the extensive property are more naturally expressed in terms of the excess kurtosis. For example, let be independent random variables for which the fourth moment exists, and let be the random variable defined by the sum of the . The excess kurtosis of is\operatorname{Kurt}[Y] - 3 = \frac{\sum_{i=1}^n \sigma_i^{\,4} \cdot \left(\operatorname{Kurt}\left[X_i\right] - 3\right)}{\left( \sum_{j=1}^n \sigma_j^{\,2}\right)^2},where \sigma_i is the standard deviation of . In particular if all of the have the same variance, then this simplifies to\operatorname{Kurt}[Y] - 3 = \frac{1}{n^2} \sum_{i=1}^n \left(\operatorname{Kurt}\left[X_i\right] - 3\right). The reason not to subtract 3 is that the bare moment better generalizes to multivariate distributions, especially when independence is not assumed. The cokurtosis between pairs of variables is an order four tensor. For a bivariate normal distribution, the cokurtosis tensor has off-diagonal terms that are neither 0 nor 3 in general, so attempting to "correct" for an excess becomes confusing. It is true, however, that the joint cumulants of degree greater than two for any multivariate normal distribution are zero. For two random variables, and , not necessarily independent, the kurtosis of the sum, , is \begin{align} \operatorname{Kurt}[X+Y] &= \frac{1}{\sigma_{X+Y}^4} \big(\sigma_X^4\operatorname{Kurt}[X] \\ & {} + 4\sigma_X^3 \sigma_Y \operatorname{Cokurt}[X,X,X,Y] \\[6pt] & {} + 6\sigma_X^2 \sigma_Y^2 \operatorname{Cokurt}[X,X,Y,Y] \\[6pt] & {} + 4\sigma_X \sigma_Y^3 \operatorname{Cokurt}[X,Y,Y,Y] \\[6pt] & {} + \sigma_Y^4 \operatorname{Kurt}[Y] \big). \end{align} Note that the fourth-power binomial coefficients (1, 4, 6, 4, 1) appear in the above equation. Interpretation The interpretation of the Pearson measure of kurtosis (or excess kurtosis) was once debated, but it is now well-established. As noted by Westfall in 2014, "... its unambiguous interpretation relates to tail extremity". Specifically, it reflects either the presence of existing outliers (for sample kurtosis) or the tendency to produce outliers (for the kurtosis of a probability distribution). The underlying logic is straightforward: kurtosis represents the average (or expected value) of standardized data raised to the fourth power. Standardized values less than 1—corresponding to data within one standard deviation of the mean (where the peak occurs)—contribute minimally to kurtosis. This is because raising a number less than 1 to the fourth power brings it closer to zero. The meaningful contributors to kurtosis are data values outside the peak region, i.e., the outliers. Therefore, kurtosis primarily measures outliers and provides no information about the central peak. Numerous misconceptions about kurtosis relate to notions of peakedness. One such misconception is that kurtosis measures both the peakedness of a distribution and the heaviness of its tail. Other incorrect interpretations include notions like lack of shoulders (where the shoulder refers vaguely to the area between the peak and the tail, or more specifically, the region about one standard deviation from the mean) or bimodality. Balanda and MacGillivray argue that the standard definition of kurtosis "poorly captures the kurtosis, peakedness, or tail weight of a distribution." Instead, they propose a vague definition of kurtosis as the location- and scale-free movement of probability mass from the distribution's shoulders into its center and tails. Moors' interpretation In 1986, Moors gave an interpretation of kurtosis. Let Z = \frac{ X - \mu } \sigma, where is a random variable, is the mean and is the standard deviation. Now by definition of the kurtosis \kappa, and by the well-known identity \operatorname{E}\left[V^2\right] = \operatorname{var}[V] + \operatorname{E}[V]^2, \begin{align} \kappa & = \operatorname{E}\left[ Z^4 \right] \\ & = \operatorname{var}\left[ Z^2 \right] + \operatorname{E}{\!\left[Z^2\right]}^2 \\ & = \operatorname{var}\left[ Z^2 \right] + \operatorname{var}[Z]^2 = \operatorname{var}\left[ Z^2 \right] + 1. \end{align} The kurtosis can now be seen as a measure of the dispersion of around its expectation. Alternatively it can be seen to be a measure of the dispersion of around and . attains its minimal value in a symmetric two-point distribution. In terms of the original variable , the kurtosis is a measure of the dispersion of around the two values . High values of arise where the probability mass is concentrated around the mean and the data-generating process produces occasional values far from the mean, or where the probability mass is concentrated in the tails of the distribution. Maximal entropy The entropy of a distribution is -\!\int p(x) \ln p(x) \, dx. For any \mu \in \R^n, \Sigma \in \R^{n\times n} with \Sigma positive definite, among all probability distributions on \R^n with mean \mu and covariance \Sigma, the normal distribution \mathcal N(\mu, \Sigma) has the largest entropy. Since mean \mu and covariance \Sigma are the first two moments, it is natural to consider extension to higher moments. In fact, by Lagrange multiplier method, for any prescribed first n moments, if there exists some probability distribution of form p(x) \propto e^{\sum_i a_i x_i + \sum_{ij} b_{ij} x_i x_j + \cdots + \sum_{i_1 \cdots i_n} x_{i_1} \cdots x_{i_n}} that has the prescribed moments (if it is feasible), then it is the maximal entropy distribution under the given constraints. By serial expansion, \begin{align} & \int \frac{1}{\sqrt{2\pi}} e^{-\frac 12 x^2 - \frac 14 gx^4} x^{2n} \, dx \\[6pt] &= \frac{1}{\sqrt{2\pi}} \int e^{-\frac 12 x^2 - \frac 14 gx^4} x^{2n} \, dx \\[6pt] &= \sum_k \frac{1}{k!} \left(-\frac{g}{4}\right)^k (2n+4k-1)!! \\[6pt] &= (2n-1)!! - \tfrac{1}{4} g (2n+3)!! + O(g^2) \end{align} so if a random variable has probability distribution p(x) = e^{-\frac 12 x^2 - \frac 14 gx^4}/Z, where Z is a normalization constant, then its kurtosis is . == Excess kurtosis ==