The power transformation is defined as a continuous function of power parameter
λ, typically given in piece-wise form that makes it continuous at the point of singularity (
λ = 0). For data vectors (
y1,...,
yn) in which each
yi > 0, the power transform is : y_i^{(\lambda)} = \begin{cases} \dfrac{y_i^\lambda-1}{\lambda(\operatorname{GM}(y))^{\lambda -1}} , &\text{if } \lambda \neq 0 \\[12pt] \operatorname{GM}(y)\ln{y_i} , &\text{if } \lambda = 0 \end{cases} where : \operatorname{GM}(y) = \left(\prod_{i=1}^n y_i\right)^\frac{1}{n} = \sqrt[n]{y_1 y_2 \cdots y_n} \, is the
geometric mean of the observations
y1, ...,
yn. The case for \lambda = 0 is the limit as \lambda approaches 0. To see this, note that y_i^\lambda = \exp({\lambda \ln(y_i)}) = 1 + \lambda \ln(y_i) + O((\lambda \ln(y_i))^2) - using
Taylor series. Then \dfrac{y_i^\lambda-1}\lambda = \ln(y_i) + O(\lambda), and everything but \ln(y_i) becomes negligible for \lambda sufficiently small. The inclusion of the (
λ − 1)th power of the geometric mean in the denominator simplifies the
scientific interpretation of any equation involving y_i^{(\lambda)}, because the units of measurement do not change as
λ changes.
Box and
Cox (1964) introduced the geometric mean into this transformation by first including the
Jacobian of rescaled power transformation : \frac{y^\lambda-1} \lambda. with the likelihood. This Jacobian is as follows: : J(\lambda; y_1, \ldots, y_n) = \prod_{i=1}^n |d y_i^{(\lambda)} / dy| = \prod_{i=1}^n y_i^{\lambda-1} = \operatorname{GM}(y)^{n(\lambda-1)} This allows the normal
log likelihood at its maximum to be written as follows: : \begin{align} \log ( \mathcal{L} (\hat\mu,\hat\sigma)) & = (-n/2)(\log(2\pi\hat\sigma^2) +1) + n(\lambda-1) \log(\operatorname{GM}(y)) \\[5pt] & = (-n/2)(\log(2\pi\hat\sigma^2 / \operatorname{GM}(y)^{2(\lambda-1)}) + 1). \end{align} From here, absorbing \operatorname{GM}(y)^{2(\lambda-1)} into the expression for \hat\sigma^2 produces an expression that establishes that minimizing the sum of squares of
residuals from y_i^{(\lambda)} is equivalent to maximizing the sum of the normal
log likelihood of deviations from (y^\lambda-1)/\lambda and the log of the Jacobian of the transformation. The value at
Y = 1 for any
λ is 0, and the
derivative with respect to
Y there is 1 for any
λ. Sometimes
Y is a version of some other variable scaled to give
Y = 1 at some sort of average value. The transformation is a
power transformation, but done in such a way as to make it
continuous with the parameter
λ at
λ = 0. It has proved popular in
regression analysis, including
econometrics. Box and Cox also proposed a more general form of the transformation that incorporates a shift parameter. :\tau(y_i;\lambda, \alpha) = \begin{cases} \dfrac{(y_i + \alpha)^\lambda - 1}{\lambda (\operatorname{GM}(y+\alpha))^{\lambda - 1}} & \text{if } \lambda\neq 0, \\ \\ \operatorname{GM}(y+\alpha)\ln(y_i + \alpha)& \text{if } \lambda=0,\end{cases} which holds if
yi + α > 0 for all
i. If τ(
Y, λ, α) follows a
truncated normal distribution, then
Y is said to follow a
Box–Cox distribution. Bickel and Doksum eliminated the need to use a
truncated distribution by extending the range of the transformation to all
y, as follows: :\tau(y_i;\lambda, \alpha) = \begin{cases} \dfrac{\operatorname{sgn}(y_i + \alpha)|y_i + \alpha|^\lambda - 1}{\lambda (\operatorname{GM}(y+\alpha))^{\lambda - 1}} & \text{if } \lambda\neq 0, \\ \\ \operatorname{GM}(y+\alpha)\operatorname{sgn}(y+\alpha)\ln(y_i + \alpha)& \text{if } \lambda=0,\end{cases} where sgn(.) is the
sign function. This change in definition has little practical import as long as \alpha is less than \operatorname{min}(y_i), which it usually is.{{Cite journal | last1 = Bickel | first1 = Peter J. | last2 = Doksum | first2 = Kjell A. Bickel and Doksum also proved that the parameter estimates are
consistent and
asymptotically normal under appropriate regularity conditions, though the standard
Cramér–Rao lower bound can substantially underestimate the variance when parameter values are small relative to the noise variance. However, this problem of underestimating the variance may not be a substantive problem in many applications.{{Citation | last = Sakia | first = R. M. | year = 1992 ==Box–Cox transformation==