MarketBeta distribution
Company Profile

Beta distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

Definitions
Probability density function The probability density function (PDF) of the beta distribution, for 0 \leq x \leq 1 or 0 , and shape parameters \alpha , \beta > 0 , is a power function of the variable x and of its reflection (1-x) as follows: \begin{align} f(x;\alpha,\beta) & = \mathrm{constant}\cdot x^{\alpha-1}(1-x)^{\beta-1} \\[3pt] & = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\displaystyle \int_0^1 u^{\alpha-1} (1-u)^{\beta-1}\, du} \\[6pt] & = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\, x^{\alpha-1}(1-x)^{\beta-1} \\[6pt] & = \frac{1}{\Beta(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1} \end{align} where \Gamma(z) is the gamma function. The beta function, \Beta, is a normalization constant to ensure that the total probability is 1. In the above equations x is a realization—an observed value that actually occurred—of a random variable X . Several authors, including N. L. Johnson and S. Kotz, and X \sim \beta_{\alpha, \beta}. Cumulative distribution function The cumulative distribution function is F(x;\alpha,\beta) = \frac{\Beta{}(x;\alpha,\beta)}{\Beta{}(\alpha,\beta)} = I_x(\alpha,\beta) where \Beta(x;\alpha,\beta) is the incomplete beta function and I_x(\alpha,\beta) is the regularized incomplete beta function. For positive integers α and β, the cumulative distribution function of a beta distribution can be expressed in terms of the cumulative distribution function of a binomial distribution with F_{\text{beta}}(x;\alpha,\beta) = F_{\text{binomial}}(\beta-1;\alpha+\beta-1,1-x). Alternative parameterizations Two parameters Mean and sample size The beta distribution may also be reparameterized in terms of its mean μ and the sum of the two shape parameters ( p. 83). Denoting by αPosterior and βPosterior the shape parameters of the posterior beta distribution resulting from applying Bayes' theorem to a binomial likelihood function and a prior probability, the interpretation of the addition of both shape parameters to be sample size = ν = α·Posterior + β·Posterior is only correct for the Haldane prior probability Beta(0,0). Specifically, for the Bayes (uniform) prior Beta(1,1) the correct interpretation would be sample size = α·Posterior + β Posterior − 2, or ν = (sample size) + 2. For sample size much larger than 2, the difference between these two priors becomes negligible. (See section Bayesian inference for further details.) ν = α + β is referred to as the "sample size" of a beta distribution, but one should remember that it is, strictly speaking, the "sample size" of a binomial likelihood function only when using a Haldane Beta(0,0) prior in Bayes' theorem. This parametrization may be useful in Bayesian parameter estimation. For example, one may administer a test to a number of individuals. If it is assumed that each person's score (0 ≤ θ ≤ 1) is drawn from a population-level beta distribution, then an important statistic is the mean of this population-level distribution. The mean and sample size parameters are related to the shape parameters α and β via \begin{align} \alpha &= \omega (\kappa - 2) + 1 \\ \beta &= (1 - \omega)(\kappa - 2) + 1 \end{align} For the mode, 0 , to be well-defined, we need \alpha,\beta>1, or equivalently \kappa>2. If instead we define the concentration as c=\alpha+\beta-2, the condition simplifies to c>0 and the beta density at \alpha=1+c\omega and \beta=1+c(1-\omega) can be written as: f(x;\omega,c) = \frac{x^{c\omega}(1-x)^{c(1-\omega)}}{\Beta\bigl(1+c\omega,1+c(1-\omega)\bigr)} where c directly scales the sufficient statistics, \log(x) and \log(1-x). Note also that in the limit, c\to0, the distribution becomes flat. Mean and variance Solving the system of (coupled) equations given in the above sections as the equations for the mean and the variance of the beta distribution in terms of the original parameters α and β, one can express the α and β parameters in terms of the mean (μ) and the variance (var): \begin{align} \nu &= \alpha + \beta = \frac{\mu(1-\mu)}{\mathrm{var}}-1, \text{ where }\nu =(\alpha + \beta) >0,\text{ therefore: }\text{var} This parametrization of the beta distribution may lead to a more intuitive understanding than the one based on the original parameters α and β. For example, by expressing the mode, skewness, excess kurtosis and differential entropy in terms of the mean and the variance: Four parameters A beta distribution with the two shape parameters α and β is supported on the range [0,1] or (0,1). It is possible to alter the location and scale of the distribution by introducing two further parameters representing the minimum, a, and maximum c (c > a), values of the distribution, by a linear transformation substituting the non-dimensional variable x in terms of the new variable y (with support [a,c] or (a,c)) and the parameters a and c: y = x(c-a) + a, \text{ therefore } x = \frac{y-a}{c-a}. The probability density function of the four parameter beta distribution is equal to the two parameter distribution, scaled by the range (c − a), (so that the total area under the density curve equals a probability of one), and with the "y" variable shifted and scaled as follows: \begin{align} f(y; \alpha, \beta, a, c) = \frac{f(x;\alpha,\beta)}{c-a} &= \frac{\left(\frac{y-a}{c-a}\right)^{\alpha-1} \left (\frac{c-y}{c-a} \right)^{\beta-1} }{(c-a)B(\alpha, \beta)} \\[1ex] &= \frac{ (y-a)^{\alpha-1} (c-y)^{\beta-1} }{(c-a)^{\alpha+\beta-1}B(\alpha, \beta)}. \end{align} That a random variable Y is beta-distributed with four parameters α, β, a, and c will be denoted by: Y \sim \operatorname{Beta}(\alpha, \beta, a, c). Some measures of central location are scaled (by (c − a)) and shifted (by a), as follows: \begin{align} \mu_Y &= \mu_X(c-a) + a \\[1ex] & = \frac{\alpha}{\alpha+\beta} \left(c-a\right) + a = \frac{\alpha c+ \beta a}{\alpha+\beta} \end{align} \begin{align} \text{mode}(Y) &=\text{mode}(X)(c-a) + a \\[1ex] & = \frac{\alpha - 1}{\alpha+\beta - 2} \left(c-a\right) + a \\[1ex] & = \frac{(\alpha-1) c+(\beta-1) a}{\alpha+\beta-2}\ , & \text{ if } \alpha,\, \beta>1 \end{align} \begin{align} \text{median}(Y) &= \text{median}(X)(c-a) + a \\[1ex] & = I_{\frac{1}{2}}^{[-1]}(\alpha,\beta) \left(c-a\right)+a \end{align} Note: the geometric mean and harmonic mean cannot be transformed by a linear transformation in the way that the mean, median and mode can. The shape parameters of Y can be written in term of its mean and variance as \begin{align} \alpha &= \frac{\left(a - \mu_Y\right) \left(a \, c - a \, \mu_Y - c \, \mu_Y + \mu_Y^2 + \sigma_Y^2\right)}{\sigma_Y^2(c-a)} \\ \beta &= -\frac{\left(c - \mu_Y\right) \left(a \, c - a \, \mu_Y - c \, \mu_Y + \mu_Y^2 + \sigma_Y^2\right)}{\sigma_Y^2(c-a)} \end{align} The statistical dispersion measures are scaled (they do not need to be shifted because they are already centered on the mean) by the range (c − a), linearly for the mean deviation and nonlinearly for the variance: \begin{align} &\text{(mean deviation around mean)}(Y) \\[1ex] &= (\text{(mean deviation around mean)}(X))(c-a) \\ &= \frac{2 \alpha^\alpha \beta^\beta}{\Beta(\alpha,\beta)(\alpha + \beta)^{\alpha + \beta + 1}}(c-a) \end{align} \text{var}(Y) = \text{var}(X)(c-a)^2 =\frac{\alpha\beta (c-a)^2}{(\alpha+\beta)^2(\alpha+\beta+1)}. Since the skewness and excess kurtosis are non-dimensional quantities (as moments centered on the mean and normalized by the standard deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in terms of X (with support [0,1] or (0,1)): \text{skewness}(Y) =\text{skewness}(X) = \frac{2 (\beta - \alpha) \sqrt{\alpha + \beta + 1} }{(\alpha + \beta + 2) \sqrt{\alpha \beta}}. \text{kurtosis excess}(Y) =\text{kurtosis excess}(X) = \frac{6\left[(\alpha - \beta)^2 (\alpha +\beta + 1) - \alpha \beta (\alpha + \beta + 2)\right]} {\alpha \beta (\alpha + \beta + 2) (\alpha + \beta + 3)} ==Properties==
Properties
Measures of central tendency Mode The mode of a beta distributed random variable X with α, β > 1 is the most likely value of the distribution (corresponding to the peak in the PDF), and is given by the following expression: \frac{\alpha - 1} {\alpha + \beta - 2} . When both parameters are less than one (α, β 1 the mode (resp. anti-mode when ), is at the center of the distribution: it is symmetric in those cases. See Shapes section in this article for a full list of mode cases, for arbitrary values of α and β. For several of these cases, the maximum value of the density function occurs at one or both ends. In some cases the (maximum) value of the density function occurring at the end is finite. For example, in the case of α = 2, β = 1 (or α = 1, β = 2), the density function becomes a right-triangle distribution which is finite at both ends. In several other cases there is a singularity at one end, where the value of the density function approaches infinity. For example, in the case α = β = 1/2, the beta distribution simplifies to become the arcsine distribution. There is debate among mathematicians about some of these cases and whether the ends (x = 0, and x = 1) can be called modes or not. • Whether the ends are part of the domain of the density function • Whether a singularity can ever be called a mode • Whether cases with two maxima should be called bimodal Median The median of the beta distribution is the unique real number x = I_{1/2}^{[-1]}(\alpha,\beta) for which the regularized incomplete beta function I_x(\alpha,\beta) = \tfrac{1}{2} . There is no general closed-form expression for the median of the beta distribution for arbitrary values of α and β. Closed-form expressions for particular values of the parameters α and β follow: • For symmetric cases α = β, median = 1/2. • For α = 1 and β > 0, median =1-2^{-1/\beta} (this case is the mirror-image of the power function distribution) • For α > 0 and β = 1, median = 2^{-1/\alpha} (this case is the power function distribution p. 207) "the average of the two extreme observations uses all the sample information. This illustrates how, for short-tailed distributions, the extreme observations should get more weight." By contrast, it follows that the median of "U-shaped" bimodal distributions with modes at the edge of the distribution (with Beta(αβ) such that ) is not robust, as the sample median drops the extreme sample observations from consideration. A practical application of this occurs for example for random walks, since the probability for the time of the last visit to the origin in a random walk is distributed as the arcsine distribution Beta(1/2, 1/2): the mean of a number of realizations of a random walk is a much more robust estimator than the median (which is an inappropriate sample measure estimate in this case). Geometric mean The logarithm of the geometric mean GX of a distribution with random variable X is the arithmetic mean of ln(X), or, equivalently, its expected value: \ln G_X = \operatorname{E}[\ln X] For a beta distribution, the expected value integral gives: \begin{align} \operatorname{E}[\ln X] &= \int_0^1 \ln x\, f(x;\alpha,\beta)\,dx \\[4pt] &= \int_0^1 \ln x \,\frac{ x^{\alpha-1}(1-x)^{\beta-1}}{\Beta(\alpha,\beta)}\,dx \\[4pt] &= \frac{1}{\Beta(\alpha,\beta)} \, \int_0^1 \frac{\partial x^{\alpha-1}(1-x)^{\beta-1}}{\partial \alpha}\,dx \\[4pt] &= \frac{1}{\Beta(\alpha,\beta)} \frac{\partial}{\partial \alpha} \int_0^1 x^{\alpha-1}(1-x)^{\beta-1}\,dx \\[4pt] &= \frac{1}{\Beta(\alpha,\beta)} \frac{\partial \Beta(\alpha,\beta)}{\partial \alpha} \\[4pt] &= \frac{\partial \ln \Beta(\alpha,\beta)}{\partial \alpha} \\[4pt] &= \frac{\partial \ln \Gamma(\alpha)}{\partial \alpha} - \frac{\partial \ln \Gamma(\alpha + \beta)}{\partial \alpha} \\[4pt] &= \psi(\alpha) - \psi(\alpha + \beta) \end{align} where ψ is the digamma function. Therefore, the geometric mean of a beta distribution with shape parameters α and β is the exponential of the digamma functions of α and β as follows: G_X = e^{\operatorname{E}[\ln X]}= e^{\psi(\alpha) - \psi(\alpha + \beta)} While for a beta distribution with equal shape parameters α = β, it follows that skewness = 0 and mode = mean = median = 1/2, the geometric mean is less than 1/2: . The reason for this is that the logarithmic transformation strongly weights the values of X close to zero, as ln(X) strongly tends towards negative infinity as X approaches zero, while ln(X) flattens towards zero as . Along a line , the following limits apply: \begin{align} &\lim_{\alpha = \beta \to 0} G_X = 0 \\ &\lim_{\alpha = \beta \to \infty} G_X =\tfrac{1}{2} \end{align} Following are the limits with one parameter finite (non-zero) and the other approaching these limits: \begin{align} \lim_{\beta \to 0} G_X = \lim_{\alpha \to \infty} G_X = 1\\ \lim_{\alpha\to 0} G_X = \lim_{\beta \to \infty} G_X = 0 \end{align} The accompanying plot shows the difference between the mean and the geometric mean for shape parameters α and β from zero to 2. Besides the fact that the difference between them approaches zero as α and β approach infinity and that the difference becomes large for values of α and β approaching zero, one can observe an evident asymmetry of the geometric mean with respect to the shape parameters α and β. The difference between the geometric mean and the mean is larger for small values of α in relation to β than when exchanging the magnitudes of β and α. N. L.Johnson and S. Kotz This is relevant because the beta distribution is a suitable model for the random behavior of percentages and it is particularly suitable to the statistical modelling of proportions. The geometric mean plays a central role in maximum likelihood estimation, see section "Parameter estimation, maximum likelihood." Actually, when performing maximum likelihood estimation, besides the geometric mean GX based on the random variable X, also another geometric mean appears naturally: the geometric mean based on the linear transformation ––, the mirror-image of X, denoted by G(1−X): G_{1-X} = e^{\operatorname{E}[\ln(1-X)] } = e^{\psi(\beta) - \psi(\alpha + \beta)} Along a line , the following limits apply: \begin{align} &\lim_{\alpha = \beta \to 0} G_{1-X} =0 \\ &\lim_{\alpha = \beta \to \infty} G_{1-X} =\tfrac{1}{2} \end{align} Following are the limits with one parameter finite (non-zero) and the other approaching these limits: \begin{align} \lim_{\beta \to 0} G_{(1-X)} = \lim_{\alpha \to \infty} G_{(1-X)} = 0\\ \lim_{\alpha\to 0} G_{(1-X)} = \lim_{\beta \to \infty} G_{(1-X)} = 1 \end{align} It has the following approximate value: G_{(1-X)} \approx \frac{\beta - \frac{1}{2}}{\alpha+\beta-\frac{1}{2}}\text{ if } \alpha, \beta > 1. Although both GX and G1−X are asymmetric, in the case that both shape parameters are equal , the geometric means are equal: GX = G(1−X). This equality follows from the following symmetry displayed between both geometric means: G_X (\Beta(\alpha, \beta) ) = G_{1-X}(\Beta(\beta, \alpha) ). Harmonic mean X is the arithmetic mean of 1/X, or, equivalently, its expected value. Therefore, the harmonic mean (HX) of a beta distribution with shape parameters α and β is: \begin{align} H_X &= \frac{1}{\operatorname{E}\left[\frac{1}{X}\right]} \\ &=\frac{1}{\int_0^1 \frac{f(x;\alpha,\beta)}{x}\,dx} \\ &=\frac{1}{\int_0^1 \frac{x^{\alpha-1}(1-x)^{\beta-1}}{x \Beta(\alpha,\beta)}\,dx} \\ &= \frac{\alpha - 1}{\alpha + \beta - 1}\text{ if } \alpha > 1 \text{ and } \beta > 0 \\ \end{align} The harmonic mean (HX) of a beta distribution with α H_X = \frac{\alpha-1}{2\alpha-1}, showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞. Following are the limits with one parameter finite (non-zero) and the other approaching these limits: \begin{align} &\lim_{\alpha\to 0} H_X \text{ is undefined} \\ &\lim_{\alpha\to 1} H_X = \lim_{\beta \to \infty} H_X = 0 \\ &\lim_{\beta \to 0} H_X = \lim_{\alpha \to \infty} H_X = 1 \end{align} The harmonic mean plays a role in maximum likelihood estimation for the four parameter case, in addition to the geometric mean. Actually, when performing maximum likelihood estimation for the four parameter case, besides the harmonic mean HX based on the random variable X, also another harmonic mean appears naturally: the harmonic mean based on the linear transformation (1 − X), the mirror-image of X, denoted by H1 − X: H_{1-X} = \frac{1}{\operatorname{E} \left[\frac 1 {1-X}\right]} = \frac{\beta - 1}{\alpha + \beta-1} \text{ if } \beta > 1, \text{ and } \alpha> 0. The harmonic mean (H(1 − X)) of a beta distribution with β H_{(1-X)} = \frac{\beta-1}{2\beta-1}, showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞. Following are the limits with one parameter finite (non-zero) and the other approaching these limits: \begin{align} &\lim_{\beta\to 0} H_{1-X} \text{ is undefined} \\ &\lim_{\beta\to 1} H_{1-X} = \lim_{\alpha\to \infty} H_{1-X} = 0 \\ &\lim_{\alpha\to 0} H_{1-X} = \lim_{\beta\to \infty} H_{1-X} = 1 \end{align} Although both HX and H1−X are asymmetric, in the case that both shape parameters are equal α = β, the harmonic means are equal: HX = H1−X. This equality follows from the following symmetry displayed between both harmonic means: H_X (\Beta(\alpha, \beta) )=H_{1-X}(\Beta(\beta, \alpha) ) \text{ if } \alpha, \beta> 1. Measures of statistical dispersion Variance The variance (the second moment centered on the mean) of a beta distribution random variable X with parameters α and β is: \operatorname{var}(X) = \operatorname{E}\left[(X - \mu)^2\right] = \frac{\alpha \beta}{\left(\alpha + \beta\right)^2 \left(\alpha + \beta + 1\right)} Letting α = β in the above expression one obtains \operatorname{var}(X) = \frac{1}{4(2\beta + 1)}, showing that for α = β the variance decreases monotonically as increases. Setting in this expression, one finds the maximum variance var(X) = 1/4 Kurtosis has also been used to distinguish the seismic signal generated by a person's footsteps from other signals. As persons or other targets moving on the ground generate continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they generate. Kurtosis is sensitive to impulsive signals, so it's much more sensitive to the signal generated by human footsteps than other signals generated by vehicles, winds, noise, etc. Unfortunately, the notation for kurtosis has not been standardized. Kenney and Keeping use the symbol γ2 for the excess kurtosis, but Abramowitz and Stegun use different terminology. To prevent confusion between kurtosis (the fourth moment centered on the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled out as follows: \begin{align} \text{excess kurtosis} &=\text{kurtosis} - 3\\ &=\frac{\operatorname{E}[(X - \mu)^4]}{{(\operatorname{var}(X))^{2}}}-3\\ &=\frac{6[\alpha^3-\alpha^2(2\beta - 1) + \beta^2(\beta + 1) - 2\alpha\beta(\beta + 2)]}{\alpha \beta (\alpha + \beta + 2)(\alpha + \beta + 3)}\\ &=\frac{6[(\alpha - \beta)^2 (\alpha +\beta + 1) - \alpha \beta (\alpha + \beta + 2)]} {\alpha \beta (\alpha + \beta + 2) (\alpha + \beta + 3)} . \end{align} Letting α = β in the above expression one obtains \text{excess kurtosis} =- \frac{6}{3+2\alpha} \text{ if } \alpha = \beta. Therefore, for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of −2 at the limit as {α = β} → 0, and approaching a maximum value of zero as {α = β} → ∞. The value of −2 is the minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely concentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equal probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for further discussion). The description of kurtosis as a measure of the "potential outliers" (or "potential rare, extreme values") of the probability distribution, is correct for all distributions including the beta distribution. When rare, extreme values can occur in the beta distribution, the higher its kurtosis; otherwise, the kurtosis is lower. For αβ, skewed beta distributions, the excess kurtosis can reach unlimited positive values (particularly for α → 0 for finite β, or for β → 0 for finite α) because the side away from the mode will produce occasional extreme values. Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is at the center), and there is no probability mass density in between the ends. Using the parametrization in terms of mean μ and sample size ν = α + β: \begin{align} \alpha & {} = \mu \nu ,\text{ where }\nu =(\alpha + \beta) >0\\ \beta & {} = (1 - \mu) \nu , \text{ where }\nu =(\alpha + \beta) >0. \end{align} one can express the excess kurtosis in terms of the mean μ and the sample size ν as follows: \text{excess kurtosis} =\frac{6}{3 + \nu}\bigg (\frac{(1 - 2 \mu)^2 (1 + \nu)}{\mu (1 - \mu) (2 + \nu)} - 1 \bigg ) The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the sample size ν as follows: \text{excess kurtosis} =\frac{6}{(3 + \nu)(2 + \nu)}\left(\frac{1}{\text{ var }} - 6 - 5 \nu \right)\text{ if }\text{var} and, in terms of the variance var and the mean μ as follows: \text{excess kurtosis} =\frac{6 \text{ var } (1 - \text{ var } - 5 \mu (1 - \mu) )}{(\text{var } + \mu (1 - \mu))(2\text{ var } + \mu (1 - \mu) )}\text{ if }\text{var} The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess kurtosis (−2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint (μ = 1/2). This occurs for the symmetric case of α = β = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis reaches the minimum possible value (for any distribution) when the probability density function has two spikes at each end: it is bi-"peaky" with nothing in between them. On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end (μ = 0 or μ = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of the distribution approaches either end. Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of the skewness, and the sample size ν as follows: \text{excess kurtosis} =\frac{6}{3 + \nu}\bigg(\frac{(2 + \nu)}{4} (\text{skewness})^2 - 1\bigg)\text{ if (skewness)}^2-2 From this last expression, one can obtain the same limits published over a century ago by Karl Pearson : \begin{align} \varphi_X(\alpha;\beta;t) &= \operatorname{E}\left[e^{itX}\right]\\ &= \int_0^1 e^{itx} f(x;\alpha,\beta) \, dx \\ &={}_1F_1(\alpha; \alpha+\beta; it)\!\\ &=\sum_{n=0}^\infty \frac {\alpha^\overline{n} (it)^n} {(\alpha+\beta)^\overline{n} n!}\\ &= 1 +\sum_{k=1}^{\infty} \left( \prod_{r=0}^{k-1} \frac{\alpha+r}{\alpha+\beta+r} \right) \frac{(it)^k}{k!} \end{align} where : x^\overline{n}=x(x+1)(x+2)\cdots(x+n-1) is the rising factorial. The value of the characteristic function for t = 0, is one: \varphi_X(\alpha;\beta;0)={}_1F_1(\alpha; \alpha+\beta; 0) = 1. Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the origin of variable t: \operatorname{Re} \left [ {}_1F_1(\alpha; \alpha+\beta; it) \right ] = \operatorname{Re} \left [ {}_1F_1(\alpha; \alpha+\beta; - it) \right ] \operatorname{Im} \left [ {}_1F_1(\alpha; \alpha+\beta; it) \right ] = - \operatorname{Im} \left [ {}_1F_1(\alpha; \alpha+\beta; - it) \right ] The symmetric case α = β simplifies the characteristic function of the beta distribution to a Bessel function, since in the special case α + β = 2α the confluent hypergeometric function (of the first kind) reduces to a Bessel function (the modified Bessel function of the first kind I_{\alpha-\frac 1 2} ) using Kummer's second transformation as follows: \begin{align} {}_1F_1(\alpha;2\alpha; it) &= e^{\frac{it}{2}} {}_0F_1 \left(; \alpha+\tfrac{1}{2}; \frac{(it)^2}{16} \right) \\ &= e^{\frac{it}{2}} \left(\frac{it}{4}\right)^{\frac{1}{2}-\alpha} \Gamma\left(\alpha+\tfrac{1}{2}\right) I_{\alpha-\frac 1 2} \left(\frac{it}{2}\right).\end{align} In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for symmetric (α = β) and skewed (αβ) cases. Other moments Moment generating function It also follows Moments of transformed random variables Moments of linearly transformed, product and inverted random variables One can also show the following expectations for a transformed random variable, as they usually transform various shapes (including J-shapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable: \begin{align} \operatorname{E}\left[\ln \frac{X}{1-X} \right] &= \psi(\alpha) - \psi(\beta)= \operatorname{E}[\ln X] +\operatorname{E} \left[\ln \frac{1}{1-X} \right],\\ \operatorname{E}\left [\ln \frac{1-X}{X} \right ] &=\psi(\beta) - \psi(\alpha)= - \operatorname{E} \left[\ln \frac{X}{1-X} \right] . \end{align} Johnson considered the distribution of the logit – transformed variable ln(X/1 − X), including its moment generating function and approximations for large values of the shape parameters. This transformation extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞). The logit of a beta variate has the logistic-beta distribution. Higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions as follows: \begin{align} \operatorname{E} \left [\ln^2(X) \right ] &= (\psi(\alpha) - \psi(\alpha + \beta))^2+\psi_1(\alpha)-\psi_1(\alpha+\beta), \\ \operatorname{E} \left [\ln^2(1-X) \right ] &= (\psi(\beta) - \psi(\alpha + \beta))^2+\psi_1(\beta)-\psi_1(\alpha+\beta), \\ \operatorname{E} \left [\ln (X)\ln(1-X) \right ] &=(\psi(\alpha) - \psi(\alpha + \beta))(\psi(\beta) - \psi(\alpha + \beta)) -\psi_1(\alpha+\beta). \end{align} therefore the variance of the logarithmic variables and covariance of ln(X) and ln(1−X) are: \begin{align} \operatorname{cov}[\ln X, \ln(1-X)] &= \operatorname{E}\left[\ln X \ln(1-X)\right] - \operatorname{E}[\ln X]\operatorname{E}[\ln(1-X)] \\ &= -\psi_1(\alpha+\beta) \\ & \\ \operatorname{var}[\ln X] &= \operatorname{E}[\ln^2 X] - (\operatorname{E}[\ln X])^2 \\ &= \psi_1(\alpha) - \psi_1(\alpha + \beta) \\ &= \psi_1(\alpha) + \operatorname{cov}[\ln X, \ln(1-X)] \\ & \\ \operatorname{var}[\ln (1-X)] &= \operatorname{E}[\ln^2 (1-X)] - (\operatorname{E}[\ln (1-X)])^2 \\ &= \psi_1(\beta) - \psi_1(\alpha + \beta) \\ &= \psi_1(\beta) + \operatorname{cov}[\ln X, \ln(1-X)] \end{align} where the trigamma function, denoted ψ1(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function: \psi_1(\alpha) = \frac{d^2\ln\Gamma(\alpha)}{d\alpha^2}= \frac{d \psi(\alpha)}{d\alpha}. The variances and covariance of the logarithmically transformed variables X and (1 − X) are different, in general, because the logarithmic transformation destroys the mirror-symmetry of the original variables X and (1 − X), as the logarithm approaches negative infinity for the variable approaching zero. These logarithmic variances and covariance are the elements of the Fisher information matrix for the beta distribution. They are also a measure of the curvature of the log likelihood function (see section on Maximum likelihood estimation). The variances of the log inverse variables are identical to the variances of the log variables: \begin{align} \operatorname{var}\left[\ln \frac{1}{X} \right] &=\operatorname{var}[\ln X] = \psi_1(\alpha) - \psi_1(\alpha + \beta), \\ \operatorname{var}\left[\ln \frac{1}{1-X} \right] &=\operatorname{var}[\ln (1-X)] = \psi_1(\beta) - \psi_1(\alpha + \beta), \\ \operatorname{cov}\left[\ln \frac{1}{X} ,\, \ln \frac{1}{1-X} \right] &=\operatorname{cov}[\ln X, \ln(1-X)]= -\psi_1(\alpha + \beta).\end{align} It also follows that the variances of the logit-transformed variables are \begin{align} \operatorname{var}\left[\ln \frac{X}{1-X} \right] &= \operatorname{var}\left[\ln \frac{1-X}{X} \right] \\ &= -\operatorname{cov}\left [\ln \frac{X}{1-X}, \, \ln \frac{1-X}{X} \right] \\[1ex] &= \psi_1(\alpha) + \psi_1(\beta). \end{align} Quantities of information (entropy) Given a beta distributed random variable, X ~ Beta(αβ), the differential entropy of X is (measured in nats), the expected value of the negative of the logarithm of the probability density function: \begin{align} h(X) &= \operatorname{E}\left[-\ln f(X;\alpha,\beta)\right] \\[4pt] &= \int_0^1 -f(x;\alpha,\beta) \ln f(x;\alpha,\beta) \, dx \\[4pt] &= \ln \Beta(\alpha,\beta) - (\alpha-1)\psi(\alpha) - (\beta-1) \psi(\beta)+(\alpha+\beta-2) \psi(\alpha+\beta) \end{align} where f(x; α, β) is the probability density function of the beta distribution: f(x;\alpha,\beta) = \frac{x^{\alpha-1} \left(1-x\right)^{\beta-1}}{\Beta(\alpha,\beta)} The digamma function ψ appears in the formula for the differential entropy as a consequence of Euler's integral formula for the harmonic numbers which follows from the integral: \int_0^1 \frac {1-x^{\alpha-1}}{1-x} \, dx = \psi(\alpha)-\psi(1) The differential entropy of the beta distribution is negative for all values of α and β greater than zero, except at α = β = 1 (for which values the beta distribution is the same as the uniform distribution), where the differential entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable. For α or β approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either or both) α or β approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) α or β approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order. If either α or β approaches infinity (and the other is finite) all the probability density is concentrated at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric case), α = β, and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero probability everywhere else. The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part of the same paper where he defined the discrete entropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy. Given two beta distributed random variables, X1 ~ Beta(αβ) and X2 ~ Beta(', '), the cross-entropy is (measured in nats) \begin{align} H(X_1,X_2) &= \int_0^1 - f(x;\alpha,\beta) \ln f(x;\alpha',\beta') \,dx \\[4pt] &= \ln \Beta(\alpha',\beta') - (\alpha'-1)\psi(\alpha) - (\beta'-1)\psi(\beta) + \left(\alpha'+\beta'-2\right) \psi(\alpha+\beta). \end{align} The cross entropy has been used as an error metric to measure the distance between two hypotheses. Its absolute value is minimum when the two distributions are identical. It is the information measure most closely related to the log maximum likelihood Expressing the mode (only for α, β > 1), and the mean in terms of α and β: \frac{ \alpha - 1 }{ \alpha + \beta - 2 } \le \text{median} \le \frac{ \alpha }{ \alpha + \beta } , If 1 1 the absolute distance between the mean and the median is less than 5% of the distance between the maximum and minimum values of x. On the other hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum and minimum values of x, for the (pathological) case of α = 1 and β = 1, for which values the beta distribution approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum "disorder". For example, for α = 1.0001 and β = 1.00000001: • mode = 0.9999; PDF(mode) = 1.00010 • mean = 0.500025; PDF(mean) = 1.00003 • median = 0.500035; PDF(median) = 1.00003 • mean − mode = −0.499875 • mean − median = −9.65538 × 10−6 where PDF stands for the value of the probability density function. Mean, geometric mean and harmonic mean relationship As remarked by Feller, published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of the skewness as the horizontal axis (abscissa), in which a number of distributions were displayed. The region occupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the (skewness2,excess kurtosis) plane: (\text{skewness})^2+1 or, equivalently, (\text{skewness})^2-2 At a time when there were no powerful digital computers, Karl Pearson accurately computed further boundaries, (Pearson 1895, pp. 357, 360, 373–376) also showed that the gamma distribution is a Pearson type III distribution. Hence this boundary line for Pearson's type III distribution is known as the gamma line. (This can be shown from the fact that the excess kurtosis of the gamma distribution is 6/k and the square of the skewness is 4/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied by the gamma distribution regardless of the value of the parameter "k"). Pearson later noted that the chi-squared distribution is a special case of Pearson's type III and also shares this boundary line (as it is apparent from the fact that for the chi-squared distribution the excess kurtosis is 12/k and the square of the skewness is 8/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied regardless of the value of the parameter "k"). This is to be expected, since the chi-squared distribution X ~ χ2(k) is a special case of the gamma distribution, with parametrization X ~ Γ(k/2, 1/2) where k is a positive integer that specifies the "number of degrees of freedom" of the chi-squared distribution. An example of a beta distribution near the upper boundary (excess kurtosis − (3/2) skewness2 = 0) is given by α = 0.1, β = 1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below. An example of a beta distribution near the lower boundary (excess kurtosis + 2 − skewness2 = 0) is given by α= 0.0001, β = 0.1, for which values the expression (excess kurtosis + 2)/(skewness2) = 1.01621 approaches the lower limit of 1 from above. In the infinitesimal limit for both α and β approaching zero symmetrically, the excess kurtosis reaches its minimum value at −2. This minimum value occurs at the point at which the lower boundary line intersects the vertical axis (ordinate). (However, in Pearson's original chart, the ordinate is kurtosis, instead of excess kurtosis, and it increases downwards rather than upwards). Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 − skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region". The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal U-shaped distributions for which the parameters α and β approach zero and hence all the probability density is concentrated at the ends: x = 0, 1 with practically nothing in between them. Since for αβ ≈ 0 the probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a Bernoulli distribution, where the two only possible outcomes occur with respective probabilities p and q = 1 − p. For cases approaching this limit boundary with symmetry α = β, skewness ≈ 0, excess kurtosis ≈ −2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are pq ≈ 1/2. For cases approaching this limit boundary with skewness, excess kurtosis ≈ −2 + skewness2, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities p = \tfrac{\beta}{\alpha + \beta} at the left end x = 0 and q = 1-p = \tfrac{\alpha}{\alpha + \beta} at the right end x = 1. Symmetry All statements are conditional on α, β > 0: • Probability density function reflection symmetry f(x;\alpha,\beta) = f(1-x;\beta,\alpha) • Cumulative distribution function reflection symmetry plus unitary translation F(x;\alpha,\beta) = I_x(\alpha,\beta) = 1- F(1- x;\beta,\alpha) = 1 - I_{1-x}(\beta,\alpha) • Mode reflection symmetry plus unitary translation \operatorname{mode}(\Beta(\alpha, \beta))= 1-\operatorname{mode}(\Beta(\beta, \alpha)),\text{ if }\Beta(\beta, \alpha)\ne \Beta(1,1) • Median reflection symmetry plus unitary translation \operatorname{median} (\Beta(\alpha, \beta) )= 1 - \operatorname{median} (\Beta(\beta, \alpha)) • Mean reflection symmetry plus unitary translation \mu (\Beta(\alpha, \beta) )= 1 - \mu (\Beta(\beta, \alpha) ) • Geometric means each is individually asymmetric, the following symmetry applies between the geometric mean based on X and the geometric mean based on its reflection 1−X G_X (\Beta(\alpha, \beta) ) = G_{1-X}(\Beta(\beta, \alpha) ) • Harmonic means each is individually asymmetric, the following symmetry applies between the harmonic mean based on X and the harmonic mean based on its reflection 1−X H_X (\Beta(\alpha, \beta) ) = H_{1-X}(\Beta(\beta, \alpha) ) \text{ if } \alpha, \beta > 1. • Variance symmetry \operatorname{var} (\Beta(\alpha, \beta) )=\operatorname{var} (\Beta(\beta, \alpha) ) • Geometric variances each is individually asymmetric, the following symmetry applies between the log geometric variance based on X and the log geometric variance based on its reflection 1−X \ln(\operatorname{var}_{GX} (\Beta(\alpha, \beta))) = \ln(\operatorname{var}_{G(1-X)}(\Beta(\beta, \alpha))) • Geometric covariance symmetry \ln \operatorname{cov}_{GX,(1-X)}(\Beta(\alpha, \beta))=\ln \operatorname{cov}_{GX,(1-X)}(\Beta(\beta, \alpha)) • Mean absolute deviation around the mean symmetry \operatorname{E}[|X - E[X]| ] (\Beta(\alpha, \beta))=\operatorname{E}[| X - E[X]|] (\Beta(\beta, \alpha)) • Skewness skew-symmetry \operatorname{skewness} (\Beta(\alpha, \beta) )= - \operatorname{ skewness} (\Beta(\beta, \alpha) ) • Excess kurtosis symmetry \text{excess kurtosis} (\Beta(\alpha, \beta) )= \text{excess kurtosis} (\Beta(\beta, \alpha) ) • Characteristic function symmetry of Real part (with respect to the origin of variable "t") \text{Re} [{}_1F_1(\alpha; \alpha+\beta; it) ] = \text{Re} [ {}_1F_1(\alpha; \alpha+\beta; - it)] • Characteristic function skew-symmetry of Imaginary part (with respect to the origin of variable "t") \text{Im} [{}_1F_1(\alpha; \alpha+\beta; it) ] = - \text{Im} [ {}_1F_1(\alpha; \alpha+\beta; - it) ] • Characteristic function symmetry of Absolute value (with respect to the origin of variable "t") \text{Abs} [ {}_1F_1(\alpha; \alpha+\beta; it) ] = \text{Abs} [ {}_1F_1(\alpha; \alpha+\beta; - it) ] • Differential entropy symmetry h(\Beta(\alpha, \beta) )= h(\Beta(\beta, \alpha) ) • Relative entropy (also called Kullback&ndash;Leibler divergence) symmetry D_{\mathrm{KL}}(X_1\parallel X_2) = D_{\mathrm{KL}}(X_2\parallel X_1), \text{ if }h(X_1) = h(X_2)\text{, for (skewed) }\alpha \neq \beta • Fisher information matrix symmetry {\mathcal{I}}_{i, j} = {\mathcal{I}}_{j, i} Geometry of the probability density function Inflection points For certain values of the shape parameters α and β, the probability density function has inflection points, at which the curvature changes sign. The position of these inflection points can be useful as a measure of the dispersion or spread of the distribution. Defining the following quantity: \kappa =\frac{\sqrt{\frac{(\alpha-1)(\beta-1)}{\alpha+\beta-3}}}{\alpha+\beta-2} Points of inflection occur, • α = β → 0 is a 2-point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin being x = 0 and the other face being x = 1. • \lim_{\alpha = \beta \to 0} \operatorname{var}(X) = \tfrac{1}{4} • \lim_{\alpha = \beta \to 0} \operatorname{excess \ kurtosis}(X) = - 2 a lower value than this is impossible for any distribution to reach. • The differential entropy approaches a minimum value of −∞ • α = β = 1 • the uniform distribution (continuous)|uniform [0, 1] distribution • no mode • var(X) = 1/12 • excess kurtosis(X) = −6/5 • The (negative anywhere else) differential entropy reaches its maximum value of zero • CF = Sinc (t) • '''α = β > 1''' • symmetric unimodal • mode = 1/2. • 0 2 is bell-shaped, with inflection points located to either side of the mode • 0 \lim_{\alpha = \beta \to \infty} \operatorname{var}(X) = 0 • \lim_{\alpha = \beta \to \infty} \operatorname{excess \ kurtosis}(X) = 0 • The differential entropy approaches a minimum value of −∞ Skewed (αβ) The density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of the initial curve, some more specific cases: • '''α β. • bimodal: left mode = 0, right mode = 1, anti-mode = \tfrac{\alpha-1}{\alpha + \beta-2} • 0 1, β > 1''' • unimodal (magenta & cyan plots), • Positive skew for α β. • \text{mode}= \tfrac{\alpha-1}{\alpha + \beta-2} • 0 0 (maximum variance occurs for \alpha=\tfrac{-1+\sqrt{5}}{2}, \beta=1, or α = Φ the golden ratio conjugate) • '''α ≥ 1, β 0 (maximum variance occurs for \alpha=1, \beta=\tfrac{-1+\sqrt{5}}{2}, or β = Φ the golden ratio conjugate) • '''α = 1, β > 1''' • positively skewed, • strictly decreasing (red plot), • a reversed (mirror-image) power function distribution • mean = 1 / (β + 1) • median = 1 - 1/21/β • mode = 0 • α = 1, 1 1-\tfrac{1}{\sqrt{2}} • 1/18 \text{median}=1-\tfrac {1}{\sqrt{2}} • var(X) = 1/18 • α = 1, β > 2 • reverse J-shaped with a right tail, • convex • 0 • 0 1, β = 1''' • negatively skewed, • strictly increasing (green plot), • the power function distribution • mean = α / (α + 1) • median = 1/21/α • mode = 1 • 2 > α > 1, β = 1 • concave • \tfrac{1}{2} • 1/18 \text{median}=\tfrac {1}{\sqrt{2}} • var(X) = 1/18 • α > 2, β = 1 • J-shaped with a left tail, convex • \tfrac{1}{\sqrt{2}} • 0 < var(X) < 1/18 == Related distributions ==
Related distributions
Transformations • If X ~ Beta(α, β) then 1 − X ~ Beta(β, α) mirror-image symmetry • If X ~ Beta(α, β) then \tfrac{X}{1-X} \sim {\beta'}(\alpha,\beta). The beta prime distribution, also called "beta distribution of the second kind". • If X\sim\text{Beta}(\alpha,\beta), then Y=\log\frac{X}{1-X} has a generalized logistic distribution, with density \frac{\sigma(y)^\alpha\sigma(-y)^\beta}{B(\alpha,\beta)}, where \sigma is the logistic sigmoid. • If X ~ Beta(α, β) then \tfrac{1}{X} -1 \sim {\beta'}(\beta,\alpha). • If X\sim\text{Beta}(\alpha_1,\beta_1) and Y\sim\text{Beta}(\alpha_2,\beta_2) then Z = \tfrac{X}{Y} has density \tfrac{B(\alpha_1 +\alpha_2, \beta_2) z^{\alpha_1 - 1} {}_2F_1(\alpha_1 + \alpha_2, 1- \beta_1; \alpha_1 +\alpha_2 + \beta_2; z) }{B(\alpha_1, \beta_1)B(\alpha_2, \beta_2)} for 0 and \tfrac{B(\alpha_1 +\alpha_2, \beta_1) z^{-(\alpha_2 + 1)} {}_2F_1(\alpha_1 + \alpha_2, 1- \beta_2; \alpha_1 +\alpha_2 + \beta_1; \tfrac{1}{z})}{B(\alpha_1, \beta_1)B(\alpha_2, \beta_2)} for z \geq 1 , where {}_2F_1(a, b; c; x) is the Hypergeometric function. • If X ~ Beta(n/2, m/2) then \tfrac{mX}{n(1-X)} \sim F(n,m) (assuming n > 0 and m > 0), the Fisher–Snedecor F distribution. • If X \sim \operatorname{Beta}\left(1+\lambda\tfrac{m-\min}{\max-\min}, 1 + \lambda\tfrac{\max-m}{\max-\min}\right) then min + X(max − min) ~ PERT(min, max, m, λ) where PERT denotes a PERT distribution used in PERT analysis, and m=most likely value. Traditionally λ = 4 in PERT analysis. • If X ~ Beta(1, β) then X ~ Kumaraswamy distribution with parameters (1, β) • If X ~ Beta(α, 1) then X ~ Kumaraswamy distribution with parameters (α, 1) • If X ~ Beta(α, 1) then −ln(X) ~ Exponential(α) Special and limiting cases probability density was proposed by Harold Jeffreys to represent uncertainty for a Bernoulli or a binomial distribution in Bayesian inference, and is now commonly referred to as Jeffreys prior: p−1/2(1 − p)−1/2. This distribution also appears in several random walk fundamental theorems • Beta(1, 1) ~ U(0, 1) with density 1 on that interval. • Beta(n, 1) ~ Maximum of n independent rvs. with U(0, 1), sometimes called a a standard power function distribution with density n xn–1 on that interval. • Beta(1, n) ~ Minimum of n independent rvs. with U(0, 1) with density n(1 − x)n−1 on that interval. • If X ~ Beta(3/2, 3/2) and r > 0 then 2rX − r ~ Wigner semicircle distribution. • Beta(1/2, 1/2) is equivalent to the arcsine distribution. This distribution is also Jeffreys prior probability for the Bernoulli and binomial distributions. • \lim_{n \to \infty} n \operatorname{Beta}(1,n) = \operatorname{Exponential}(1) the exponential distribution. • \lim_{n \to \infty} n \operatorname{Beta}(k,n) = \operatorname{Gamma}(k,1) the gamma distribution. • For large n, \operatorname{Beta}(\alpha n,\beta n) \to \mathcal{N}\left(\frac{\alpha}{\alpha+\beta},\frac{\alpha\beta}{(\alpha+\beta)^3}\frac{1}{n}\right) the normal distribution. More precisely, if X_n \sim \operatorname{Beta}(\alpha n,\beta n) then \sqrt{n}\left(X_n -\tfrac{\alpha}{\alpha+\beta}\right) converges in distribution to a normal distribution with mean 0 and variance \tfrac{\alpha\beta}{(\alpha+\beta)^3} as n increases. Derived from other distributions • The kth order statistic of a sample of size n from the uniform distribution is a beta random variable, U(k) ~ Beta(k, n+1−k). • Gamma distribution: If X ~ Gamma(α, θ) and Y ~ Gamma(β, θ) are independent, then \tfrac{X}{X+Y} \sim \operatorname{Beta}(\alpha, \beta)\,. • Chi-squared distribution: If X \sim \chi^2(\alpha)\, and Y \sim \chi^2(\beta)\, are independent, then \tfrac{X}{X+Y} \sim \operatorname{Beta}(\tfrac{\alpha}{2}, \tfrac{\beta}{2}). • The power transformation for the uniform distribution: If X ~ U(0, 1) and α > 0 then X1/α ~ Beta(α, 1). • Cauchy distribution: If X ~ Cauchy(0, 1) then \tfrac{1}{1+X^2} \sim \operatorname{Beta}\left(\tfrac12, \tfrac12\right)\, Combination with other distributionsX ~ Beta(α, β) and Y ~ F(2β,2α) then \Pr(X \leq \tfrac \alpha {\alpha+\beta x}) = \Pr(Y \geq x)\, for all x > 0. Compounding with other distributions • If p ~ Beta(α, β) and X ~ Bin(k, p) then X ~ beta-binomial distribution • If p ~ Beta(α, β) and X ~ NB(r, p) then X ~ beta negative binomial distribution Generalisations • The generalization to multiple variables, i.e. a multivariate Beta distribution, is called a Dirichlet distribution. Univariate marginals of the Dirichlet distribution have a beta distribution. The beta distribution is conjugate to the binomial and Bernoulli distributions in exactly the same way as the Dirichlet distribution is conjugate to the multinomial distribution and categorical distribution. • The Pearson type I distribution is identical to the beta distribution (except for arbitrary shifting and re-scaling that can also be accomplished with the four parameter parametrization of the beta distribution). • The beta distribution is the special case of the noncentral beta distribution where \lambda = 0: \operatorname{Beta}(\alpha, \beta) = \operatorname{NonCentralBeta}(\alpha,\beta,0). • The generalized beta distribution is a five-parameter distribution family which has the beta distribution as a special case. • The matrix variate beta distribution is a distribution for positive-definite matrices. == Statistical inference ==
Statistical inference
Parameter estimation Method of moments Two unknown parameters Two unknown parameters ( (\hat{\alpha}, \hat{\beta}) of a beta distribution supported in the [0,1] interval) can be estimated, using the method of moments, with the first two moments (sample mean and sample variance) as follows. Let: \text{sample mean(X)}=\bar{x} = \frac{1}{N}\sum_{i=1}^N X_i be the sample mean estimate and \text{sample variance(X)} =\bar{v} = \frac{1}{N-1}\sum_{i=1}^N \left(X_i - \bar{x}\right)^2 be the sample variance estimate. The method-of-moments estimates of the parameters are \hat{\alpha} = \bar{x} \left(\frac{\bar{x} (1 - \bar{x})}{\bar{v}} - 1 \right)\ \text{if}\ \bar{v} \hat{\beta} = (1-\bar{x}) \left(\frac{\bar{x} (1 - \bar{x})}{\bar{v}} - 1 \right)\ \text{if}\ \bar{v} When the distribution is required over a known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace \bar{x} with \frac{\bar{y}-a}{c-a}, and \bar{v} with \frac{\bar{v_Y}}{(c-a)^2} in the above couple of equations for the shape parameters (see the "Four unknown parameters" section below), where: \text{sample mean(Y)}=\bar{y} = \frac{1}{N}\sum_{i=1}^N Y_i \text{sample variance(Y)} = \bar{v}_Y = \frac{1}{N-1}\sum_{i=1}^N \left(Y_i - \bar{y}\right)^2 Four unknown parameters All four parameters (\hat{\alpha}, \hat{\beta}, \hat{a}, \hat{c} of a beta distribution supported in the [a, c] interval, see section "Alternative parametrizations, Four parameters") can be estimated, using the method of moments developed by Karl Pearson, by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis). The excess kurtosis was expressed in terms of the square of the skewness, and the sample size ν = α + β, (see previous section "Kurtosis") as follows: \text{excess kurtosis} =\frac{6}{3 + \nu}\left(\frac{(2 + \nu)}{4} (\text{skewness})^2 - 1\right)\text{ if (skewness)}^2-2 One can use this equation to solve for the sample size ν= α + β in terms of the square of the skewness and the excess kurtosis as follows: sampling in the neighborhood of the line (sample excess kurtosis - (3/2)(sample skewness)2 = 0) (the just-J-shaped portion of the rear edge where blue meets beige), "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate ν = α + β becomes zero and hence ν approaches infinity as that line is approached. Bowman and Shenton concluded that the skewness and kurtosis estimators used in BMDP and in MINITAB (at that time) had smaller variance and mean-squared error in normal samples, but the skewness and kurtosis estimators used in DAP/SAS, PSPP/SPSS, namely G1 and G2, had smaller mean-squared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill Gnanadesikan et al. give numerical solutions for a few cases. N.L.Johnson and S.Kotz digamma function of the right hand side of this equation: \psi(\hat{\alpha})=\frac{1}{N}\sum_{i=1}^N \ln\frac{X_i}{1-X_i} + \psi(\hat{\beta}) \hat{\alpha} = \psi^{-1} \left(\ln \hat{G}_X - \ln \hat{G}_{(1-X)} + \psi(\hat{\beta})\right) In particular, if one of the shape parameters has a value of unity, for example for \hat{\beta} = 1 (the power function distribution with bounded support [0,1]), using the identity ψ(x + 1) = ψ(x) + 1/x in the equation \psi(\hat{\alpha}) - \psi(\hat{\alpha} + \hat{\beta})= \ln \hat{G}_X, the maximum likelihood estimator for the unknown parameter \hat{\alpha} is, then the Fisher information may also be written as follows (which is often a more convenient form for calculation purposes): \mathcal{I}(\alpha) = - \operatorname{E} \left [\frac{\partial^2}{\partial\alpha^2} \ln \mathcal{L}(\alpha\mid X) \right]. Thus, the Fisher information is the negative of the expectation of the second derivative with respect to the parameter α of the log likelihood function. Therefore, Fisher information is a measure of the curvature of the log likelihood function of α. A low curvature (and therefore high radius of curvature), flatter log likelihood function curve has low Fisher information; while a log likelihood function curve with large curvature (and therefore low radius of curvature) has high Fisher information. When the Fisher information matrix is computed at the evaluates of the parameters ("the observed Fisher information matrix") it is equivalent to the replacement of the true log likelihood surface by a Taylor's series approximation, taken as far as the quadratic terms. The word information, in the context of Fisher information, refers to information about the parameters. Information such as: estimation, sufficiency and properties of variances of estimators. The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any estimator of a parameter α: \operatorname{var}[\hat\alpha] \geq \frac{1}{\mathcal{I}(\alpha)}. The precision to which one can estimate the estimator of a parameter α is limited by the Fisher Information of the log likelihood function. The Fisher information is a measure of the minimum error involved in estimating a parameter of a distribution and it can be viewed as a measure of the resolving power of an experiment needed to discriminate between two alternative hypothesis of a parameter. When there are N parameters \begin{bmatrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_N \end{bmatrix}, then the Fisher information takes the form of an N×N positive semidefinite symmetric matrix, the Fisher information matrix, with typical element: (\mathcal{I}(\theta))_{i, j} = \operatorname{E} \left [\frac{\partial \ln \mathcal{L}}{\partial\theta_i} \cdot \frac {\partial \ln \mathcal{L}} {\partial\theta_j} \right ]. Under certain regularity conditions, show that the (Shannon) differential entropy h(X) is related to the volume of the typical set (having the sample entropy close to the true entropy), while the Fisher information is related to the surface of this typical set. Two parameters For X1, ..., XN independent random variables each having a beta distribution parametrized with shape parameters α and β, the joint log likelihood function for N iid observations is: \ln \mathcal{L} (\alpha, \beta\mid X) = (\alpha - 1)\sum_{i=1}^N \ln X_i + (\beta- 1)\sum_{i=1}^N \ln (1-X_i)- N \ln \Beta(\alpha,\beta) therefore the joint log likelihood function per N iid observations is \frac{1}{N} \ln \mathcal{L} (\alpha, \beta \mid X) = (\alpha - 1)\frac{1}{N}\sum_{i=1}^N \ln X_i + (\beta- 1) \frac{1}{N}\sum_{i=1}^N \ln (1-X_i)-\, \ln \Beta(\alpha,\beta). For the two parameter case, the Fisher information has 4 components: 2 diagonal and 2 off-diagonal. Since the Fisher information matrix is symmetric, one of these off diagonal components is independent. Therefore, the Fisher information matrix has 3 independent components (2 diagonal and 1 off diagonal). Aryal and Nadarajah calculated Fisher's information matrix for the four-parameter case, from which the two parameter case can be obtained as follows: - \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\partial \alpha^2}= \operatorname{var}[\ln (X)]= \psi_1(\alpha) - \psi_1(\alpha + \beta) ={\mathcal{I}}_{\alpha, \alpha}= \operatorname{E}\left [- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\partial \alpha^2} \right ] = \ln \operatorname{var}_{GX} - \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\,\partial \beta^2} = \operatorname{var}[\ln (1-X)] = \psi_1(\beta) - \psi_1(\alpha + \beta) ={\mathcal{I}}_{\beta, \beta}= \operatorname{E}\left [- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\partial \beta^2} \right]= \ln \operatorname{var}_{G(1-X)} - \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N \, \partial \alpha \, \partial \beta} = \operatorname{cov}[\ln X,\ln(1-X)] = -\psi_1(\alpha+\beta) ={\mathcal{I}}_{\alpha, \beta}= \operatorname{E}\left [- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\,\partial \alpha\,\partial \beta} \right] = \ln \operatorname{cov}_{G{X,(1-X)}} Since the Fisher information matrix is symmetric \mathcal{I}_{\alpha, \beta}= \mathcal{I}_{\beta, \alpha}= \ln \operatorname{cov}_{G{X,(1-X)}} The Fisher information components are equal to the log geometric variances and log geometric covariance. Therefore, they can be expressed as trigamma functions, denoted ψ1(α), the second of the polygamma functions, defined as the derivative of the digamma function: \psi_1(\alpha) = \frac{d^2\ln\Gamma(\alpha)}{\partial\alpha^2}=\, \frac{\partial \psi(\alpha)}{\partial\alpha}. These derivatives are also derived in the and plots of the log likelihood function are also shown in that section. contains plots and further discussion of the Fisher information matrix components: the log geometric variances and log geometric covariance as a function of the shape parameters α and β. contains formulas for moments of logarithmically transformed random variables. Images for the Fisher information components \mathcal{I}_{\alpha, \alpha}, \mathcal{I}_{\beta, \beta} and \mathcal{I}_{\alpha, \beta} are shown in . The determinant of Fisher's information matrix is of interest (for example for the calculation of Jeffreys prior probability). From the expressions for the individual components of the Fisher information matrix, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution is: \begin{align} \det(\mathcal{I}(\alpha, \beta))&= \mathcal{I}_{\alpha, \alpha} \mathcal{I}_{\beta, \beta}-\mathcal{I}_{\alpha, \beta} \mathcal{I}_{\alpha, \beta} \\[4pt] &=(\psi_1(\alpha) - \psi_1(\alpha + \beta))(\psi_1(\beta) - \psi_1(\alpha + \beta))-( -\psi_1(\alpha+\beta))( -\psi_1(\alpha+\beta))\\[4pt] &= \psi_1(\alpha)\psi_1(\beta)-( \psi_1(\alpha)+\psi_1(\beta))\psi_1(\alpha + \beta)\\[4pt] \lim_{\alpha\to 0} \det(\mathcal{I}(\alpha, \beta)) &=\lim_{\beta \to 0} \det(\mathcal{I}(\alpha, \beta)) = \infty\\[4pt] \lim_{\alpha\to \infty} \det(\mathcal{I}(\alpha, \beta)) &=\lim_{\beta \to \infty} \det(\mathcal{I}(\alpha, \beta)) = 0 \end{align} From Sylvester's criterion (checking whether the diagonal elements are all positive), it follows that the Fisher information matrix for the two parameter case is positive-definite (under the standard condition that the shape parameters are positive α > 0 and β > 0). Four parameters If Y1, ..., YN are independent random variables each having a beta distribution with four parameters: the exponents α and β, and also a (the minimum of the distribution range), and c (the maximum of the distribution range) (section titled "Alternative parametrizations", "Four parameters"), with probability density function: f(y; \alpha, \beta, a, c) = \frac{f(x;\alpha,\beta)}{c-a} =\frac{ \left (\frac{y-a}{c-a} \right )^{\alpha-1} \left (\frac{c-y}{c-a} \right)^{\beta-1} }{(c-a)B(\alpha, \beta)}=\frac{ (y-a)^{\alpha-1} (c-y)^{\beta-1} }{(c-a)^{\alpha+\beta-1}B(\alpha, \beta)}. the joint log likelihood function per N iid observations is: \frac{1}{N} \ln \mathcal{L} (\alpha, \beta, a, c\mid Y) = \frac{\alpha -1}{N}\sum_{i=1}^N \ln (Y_i - a) + \frac{\beta -1}{N}\sum_{i=1}^N \ln (c - Y_i)- \ln \Beta(\alpha,\beta) - (\alpha+\beta -1) \ln (c-a) For the four parameter case, the Fisher information has 4*4=16 components. It has 12 off-diagonal components = (4×4 total − 4 diagonal). Since the Fisher information matrix is symmetric, half of these components (12/2=6) are independent. Therefore, the Fisher information matrix has 6 independent off-diagonal + 4 diagonal = 10 independent components. Aryal and Nadarajah in the course of treating the sunrise problem. It states that, given s successes in n conditionally independent Bernoulli trials with probability p, that the estimate of the expected value in the next trial is \frac{s+1}{n+2}. This estimate is the expected value of the posterior distribution over p, namely Beta(s+1, ns+1), which is given by Bayes' rule if one assumes a uniform prior probability over p (i.e., Beta(1, 1)) and then observes that p generated s successes in n trials. Laplace's rule of succession has been criticized by prominent scientists. R. T. Cox described Laplace's application of the rule of succession to the sunrise problem ( p. 89) as "a travesty of the proper use of the principle". Keynes remarks ( Ch.XXX, p. 382) "indeed this is so foolish a theorem that to entertain it is discreditable". Karl Pearson showed that the probability that the next (n + 1) trials will be successes, after n successes in n trials, is only 50%, which has been considered too low by scientists like Jeffreys and unacceptable as a representation of the scientific process of experimentation to test a proposed scientific law. As pointed out by Jeffreys ( ) Laplace's rule of succession establishes a high probability of success ((n+1)/(n+2)) in the next trial, but only a moderate probability (50%) that a further sample (n+1) comparable in size will be equally successful. As pointed out by Perks, "The rule of succession itself is hard to accept. It assigns a probability to the next trial which implies the assumption that the actual run observed is an average run and that we are always at the end of an average run. It would, one would think, be more reasonable to assume that we were in the middle of an average run. Clearly a higher value for both probabilities is necessary if they are to accord with reasonable belief." These problems with Laplace's rule of succession motivated Haldane, Perks, Jeffreys and others to search for other forms of prior probability (see the next ). According to Jaynes, who suggested that the prior probability representing complete uncertainty should be proportional to p−1(1−p)−1. The function p−1(1−p)−1 can be viewed as the limit of the numerator of the beta distribution as both shape parameters approach zero: α, β → 0. The Beta function (in the denominator of the beta distribution) approaches infinity, for both parameters approaching zero, α, β → 0. Therefore, p−1(1−p)−1 divided by the Beta function approaches a 2-point Bernoulli distribution with equal probability 1/2 at each end, at 0 and 1, and nothing in between, as α, β → 0. A coin-toss: one face of the coin being at 0 and the other face being at 1. The Haldane prior probability distribution Beta(0,0) is an "improper prior" because its integration (from 0 to 1) fails to strictly converge to 1 due to the singularities at each end. However, this is not an issue for computing posterior probabilities unless the sample size is very small. Furthermore, Zellner points out that on the log-odds scale, (the logit transformation \log(p/(1-p))), the Haldane prior is the uniformly flat prior. The fact that a uniform prior probability on the logit transformed variable ln(p/1 − p) (with domain (−∞, ∞)) is equivalent to the Haldane prior on the domain [0, 1] was pointed out by Harold Jeffreys in the first edition (1939) of his book Theory of Probability ( proposed to use an uninformative prior probability measure that should be invariant under reparameterization: proportional to the square root of the determinant of Fisher's information matrix. For the Bernoulli distribution, this can be shown as follows: for a coin that is "heads" with probability p ∈ [0, 1] and is "tails" with probability 1 − p, for a given (H,T) ∈ {(0,1), (1,0)} the probability is pH(1 − p)T. Since T = 1 − H, the Bernoulli distribution is pH(1 − p)1 − H. Considering p as the only parameter, it follows that the log likelihood for the Bernoulli distribution is \ln \mathcal{L} (p\mid H) = H \ln p + (1-H) \ln(1-p). The Fisher information matrix has only one component (it is a scalar, because there is only one parameter: p), therefore: \begin{align} \sqrt{\mathcal{I}(p)} &= \sqrt{\operatorname{E}\!\left[ \left( \frac{d}{dp} \ln \mathcal{L} (p\mid H) \right)^2\right]} \\[6pt] &= \sqrt{\operatorname{E}\!\left[ \left( \frac{H}{p} - \frac{1-H}{1-p}\right)^2 \right]} \\[6pt] &= \sqrt{p^1 (1-p)^0 \left( \frac{1}{p} - \frac{0}{1-p}\right)^2 + p^0 (1-p)^1 \left(\frac{0}{p} - \frac{1}{1-p}\right)^2} \\ &= \frac{1}{\sqrt{p(1-p)}}. \end{align} Similarly, for the Binomial distribution with n Bernoulli trials, it can be shown that \sqrt{\mathcal{I}(p)}= \sqrt{\frac{n}{p(1-p)}}. Thus, for the Bernoulli, and Binomial distributions, Jeffreys prior is proportional to \scriptstyle \frac{1}{\sqrt{p(1-p)}}, which happens to be proportional to a beta distribution with domain variable x = p, and shape parameters α = β = 1/2, the arcsine distribution: \operatorname{Beta}(\tfrac{1}{2}, \tfrac{1}{2}) = \frac{1}{\pi \sqrt{p(1-p)}}. It will be shown in the next section that the normalizing constant for Jeffreys prior is immaterial to the final result because the normalizing constant cancels out in Bayes' theorem for the posterior probability. Hence Beta(1/2,1/2) is used as the Jeffreys prior for both Bernoulli and binomial distributions. As shown in the next section, when using this expression as a prior probability times the likelihood in Bayes' theorem, the posterior probability turns out to be a beta distribution. It is important to realize, however, that Jeffreys prior is proportional to \frac{1}{\sqrt{p(1-p)}} for the Bernoulli and binomial distribution, but not for the beta distribution. Jeffreys prior for the beta distribution is given by the determinant of Fisher's information for the beta distribution, which, as shown in the is a function of the trigamma function ψ1 of shape parameters α and β as follows: \begin{align} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &= \sqrt{\psi_1(\alpha)\psi_1(\beta)-(\psi_1(\alpha)+\psi_1(\beta))\psi_1(\alpha + \beta)} \\ \lim_{\alpha\to 0} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &=\lim_{\beta \to 0} \sqrt{\det(\mathcal{I}(\alpha, \beta))} = \infty\\ \lim_{\alpha\to \infty} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &=\lim_{\beta \to \infty} \sqrt{\det(\mathcal{I}(\alpha, \beta))} = 0 \end{align} As previously discussed, Jeffreys prior for the Bernoulli and binomial distributions is proportional to the arcsine distribution Beta(1/2,1/2), a one-dimensional curve that looks like a basin as a function of the parameter p of the Bernoulli and binomial distributions. The walls of the basin are formed by p approaching the singularities at the ends p → 0 and p → 1, where Beta(1/2,1/2) approaches infinity. Jeffreys prior for the beta distribution is a 2-dimensional surface (embedded in a three-dimensional space) that looks like a basin with only two of its walls meeting at the corner α = β = 0 (and missing the other two walls) as a function of the shape parameters α and β of the beta distribution. The two adjoining walls of this 2-dimensional surface are formed by the shape parameters α and β approaching the singularities (of the trigamma function) at α, β → 0. It has no walls for α, β → ∞ because in this case the determinant of Fisher's information matrix for the beta distribution approaches zero. It will be shown in the next section that Jeffreys prior probability results in posterior probabilities (when multiplied by the binomial likelihood function) that are intermediate between the posterior probability results of the Haldane and Bayes prior probabilities. Jeffreys prior may be difficult to obtain analytically, and for some cases it just doesn't exist (even for simple distribution functions like the asymmetric triangular distribution). Berger, Bernardo and Sun, in a 2009 paper defined a reference prior probability distribution that (unlike Jeffreys prior) exists for the asymmetric triangular distribution. They cannot obtain a closed-form expression for their reference prior, but numerical calculations show it to be nearly perfectly fitted by the (proper) prior \operatorname{Beta}(\tfrac{1}{2}, \tfrac{1}{2}) \sim\frac{1}{\sqrt{\theta(1-\theta)}} where θ is the vertex variable for the asymmetric triangular distribution with support [0, 1] (corresponding to the following parameter values in Wikipedia's article on the triangular distribution: vertex c = θ, left end a = 0, and right end b = 1). Berger et al. also give a heuristic argument that Beta(1/2,1/2) could indeed be the exact Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution. Therefore, Beta(1/2,1/2) not only is Jeffreys prior for the Bernoulli and binomial distributions, but also seems to be the Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution (for which the Jeffreys prior does not exist), a distribution used in project management and PERT analysis to describe the cost and duration of project tasks. Clarke and Barron prove that, among continuous positive priors, Jeffreys prior (when it exists) asymptotically maximizes Shannon's mutual information between a sample of size n and the parameter, and therefore Jeffreys prior is the most uninformative prior (measuring information as Shannon information). The proof rests on an examination of the Kullback–Leibler divergence between probability density functions for iid random variables. Effect of different prior probability choices on the posterior beta distribution If samples are drawn from the population of a random variable X that result in s successes and f failures in n Bernoulli trials n = s + f, then the likelihood function for parameters s and f given x = p (the notation x = p in the expressions below will emphasize that the domain x stands for the value of the parameter p in the binomial distribution), is the following binomial distribution: \mathcal{L}(s,f\mid x=p) = {s+f \choose s} x^s(1-x)^f = {n \choose s} x^s(1-x)^{n - s}. If beliefs about prior probability information are reasonably well approximated by a beta distribution with parameters α Prior and β Prior, then: {\operatorname{PriorProbability}}(x=p;\alpha \operatorname{Prior},\beta \operatorname{Prior}) = \frac{ x^{\alpha \operatorname{Prior}-1}(1-x)^{\beta \operatorname{Prior}-1}}{\Beta(\alpha \operatorname{Prior},\beta \operatorname{Prior})} According to Bayes' theorem for a continuous event space, the posterior probability density is given by the product of the prior probability and the likelihood function (given the evidence s and f = n − s), normalized so that the area under the curve equals one, as follows: \begin{align} & \text{posterior probability density}(x=p\mid s,n-s) \\[6pt] = {} & \frac{\operatorname{prior probability density}(x=p;\alpha \operatorname{prior},\beta \operatorname{prior}) \mathcal{L}(s,f\mid x=p)} {\int_0^1\text{prior probability density}(x=p;\alpha \operatorname{prior},\beta \operatorname{prior}) \mathcal{L}(s,f\mid x=p) \, dx} \\[6pt] = {} & \frac{{{n \choose s} x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1} / \Beta(\alpha \operatorname{prior},\beta \operatorname{prior})}}{\int_0^1 \left({n \choose s} x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1} /\Beta(\alpha \operatorname{prior}, \beta \operatorname{prior})\right) \, dx} \\[6pt] = {} & \frac{x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}}{\int_0^1 \left(x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}\right) \, dx} \\[6pt] = {} & \frac{x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}}{\Beta(s+\alpha \operatorname{prior},n-s+\beta \operatorname{prior})}. \end{align} The binomial coefficient {s+f \choose s}={n \choose s}=\frac{(s+f)!}{s! f!}=\frac{n!}{s!(n-s)!} appears both in the numerator and the denominator of the posterior probability, and it does not depend on the integration variable x, hence it cancels out, and it is irrelevant to the final result. Similarly the normalizing factor for the prior probability, the beta function B(αPrior,βPrior) cancels out and it is immaterial to the final result. The same posterior probability result can be obtained if one uses an un-normalized prior x^{\alpha \operatorname{prior}-1}(1-x)^{\beta \operatorname{prior}-1} because the normalizing factors all cancel out. Several authors (including Jeffreys himself) thus use an un-normalized prior formula since the normalization constant cancels out. The numerator of the posterior probability ends up being just the (un-normalized) product of the prior probability and the likelihood function, and the denominator is its integral from zero to one. The beta function in the denominator, B(s + α Prior, n − s + β Prior), appears as a normalization constant to ensure that the total posterior probability integrates to unity. The ratio s/n of the number of successes to the total number of trials is a sufficient statistic in the binomial case, which is relevant for the following results. For the Bayes' prior probability (Beta(1,1)), the posterior probability is: \operatorname{posterior probability}(p=x\mid s,f) = \frac{x^{s}(1-x)^{n-s}}{\Beta(s+1,n-s+1)}, \text{ with mean }=\frac{s+1}{n+2},\text{ (and mode}=\frac{s}{n}\text{ if } 0 For the Jeffreys' prior probability (Beta(1/2,1/2)), the posterior probability is: \operatorname{posterior probability}(p=x\mid s,f) = {x^{s-\tfrac{1}{2}}(1-x)^{n-s-\frac{1}{2}} \over \Beta(s+\tfrac{1}{2},n-s+\tfrac{1}{2})} ,\text{ with mean} = \frac{s+\tfrac{1}{2}}{n+1},\text{ (and mode}=\frac{s-\tfrac{1}{2}}{n-1}\text{ if } \tfrac{1}{2} and for the Haldane prior probability (Beta(0,0)), the posterior probability is: \operatorname{posterior probability}(p=x\mid s,f) = \frac{x^{s-1}(1-x)^{n-s-1}}{\Beta(s,n-s)}, \text{ with mean} = \frac{s}{n},\text{ (and mode}=\frac{s-1}{n-2}\text{ if } 1 From the above expressions it follows that for s/n = 1/2) all the above three prior probabilities result in the identical location for the posterior probability mean = mode = 1/2. For s/n  mean for Jeffreys prior > mean for Haldane prior. For s/n > 1/2 the order of these inequalities is reversed such that the Haldane prior probability results in the largest posterior mean. The Haldane prior probability Beta(0,0) results in a posterior probability density with mean (the expected value for the probability of success in the "next" trial) identical to the ratio s/n of the number of successes to the total number of trials. Therefore, the Haldane prior results in a posterior probability with expected value in the next trial equal to the maximum likelihood. The Bayes prior probability Beta(1,1) results in a posterior probability density with mode identical to the ratio s/n (the maximum likelihood). In the case that 100% of the trials have been successful s = n, the Bayes prior probability Beta(1,1) results in a posterior expected value equal to the rule of succession (n + 1)/(n + 2), while the Haldane prior Beta(0,0) results in a posterior expected value of 1 (absolute certainty of success in the next trial). Jeffreys prior probability results in a posterior expected value equal to (n + 1/2)/(n + 1). Perks (p. 144 of 1900 edition) maintained that the Bayes (Beta(1,1) uniform prior was not a complete ignorance prior, and that it should be used when prior information justified to "distribute our ignorance equally"". K. Pearson wrote: "Yet the only supposition that we appear to have made is this: that, knowing nothing of nature, routine and anomy (from the Greek ανομία, namely: a- "without", and nomos "law") are to be considered as equally likely to occur. Now we were not really justified in making even this assumption, for it involves a knowledge that we do not possess regarding nature. We use our experience of the constitution and action of coins in general to assert that heads and tails are equally probable, but we have no right to assert before experience that, as we know nothing of nature, routine and breach are equally probable. In our ignorance we ought to consider before experience that nature may consist of all routines, all anomies (normlessness), or a mixture of the two in any proportion whatever, and that all such are equally probable. Which of these constitutions after experience is the most probable must clearly depend on what that experience has been like." If there is sufficient sampling data, and the posterior probability mode is not located at one of the extremes of the domain (x = 0 or x = 1), the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar posterior probability densities. Otherwise, as Gelman et al. (p. 65) point out, "if so few data are available that the choice of noninformative prior distribution makes a difference, one should put relevant information into the prior distribution", or as Berger (p. 125) points out "when different reasonable priors yield substantially different answers, can it be right to state that there is a single answer? Would it not be better to admit that there is scientific uncertainty, with the conclusion depending on prior beliefs?." ==Occurrence and applications==
Occurrence and applications
Order statistics The beta distribution has an important application in the theory of order statistics. A basic result is that the distribution of the kth smallest of a sample of size n from a continuous uniform distribution has a beta distribution. This result is summarized as U_{(k)} \sim \operatorname{Beta}(k,n+1-k). From this, and application of the theory related to the probability integral transform, the distribution of any individual order statistic from any continuous distribution can be derived. Wavelet analysis A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" that promptly decays. Wavelets can be used to extract information from many different kinds of data, including – but certainly not limited to – audio signals and images. Thus, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized in frequency. Therefore, standard Fourier Transforms are only applicable to stationary processes, while wavelets are applicable to non-stationary processes. Continuous wavelets can be constructed based on the beta distribution. Beta wavelets can be viewed as a soft variety of Haar wavelets whose shape is fine-tuned by two shape parameters α and β. Population genetics The Balding–Nichols model is a two-parameter parametrization of the beta distribution used in population genetics. It is a statistical description of the allele frequencies in the components of a sub-divided population: \begin{align} \alpha &= \mu \nu,\\ \beta &= (1 - \mu) \nu, \end{align} where \nu =\alpha+\beta= \frac{1-F}{F} and 0 ; here F is (Wright's) genetic distance between two populations. Project management: task cost and schedule modeling The beta distribution can be used to model events which are constrained to take place within an interval defined by a minimum and maximum value. For this reason, the beta distribution — along with the triangular distribution — is used extensively in PERT, critical path method (CPM), Joint Cost Schedule Modeling (JCSM) and other project management/control systems to describe the time to completion and the cost of a task. In project management, shorthand computations are widely used to estimate the mean and standard deviation of the beta distribution: \begin{align} \mu(X) & = \frac{a + 4b + c}{6} \\[8pt] \sigma(X) & = \frac{c-a}{6} \end{align} where a is the minimum, c is the maximum, and b is the most likely value (the mode for α > 1 and β > 1). The above estimate for the mean \mu(X)= \frac{a + 4b + c}{6} is known as the PERT three-point estimation and it is exact for either of the following values of β (for arbitrary α within these ranges): :β = α > 1 (symmetric case) with standard deviation \sigma(X) = \frac{c-a}{2 \sqrt {1+2\alpha}}, skewness = 0, and excess kurtosis = \frac{-6}{3+2 \alpha} or :β = 6 − α for 5 > α > 1 (skewed case) with standard deviation \sigma(X) = \frac{(c-a)\sqrt{\alpha(6-\alpha)}}{6 \sqrt 7}, skewness{}=\frac{(3-\alpha) \sqrt 7}{2\sqrt{\alpha(6-\alpha)}}, and excess kurtosis{}=\frac{21}{\alpha (6- \alpha)} - 3 The above estimate for the standard deviation σ(X) = (ca)/6 is exact for either of the following values of α and β: :α = β = 4 (symmetric) with skewness = 0, and excess kurtosis = −6/11. :β = 6 − α and \alpha = 3 - \sqrt2 (right-tailed, positive skew) with skewness{}=\frac{1}{\sqrt 2}, and excess kurtosis = 0 :β = 6 − α and \alpha = 3 + \sqrt2 (left-tailed, negative skew) with skewness{}= \frac{-1}{\sqrt 2}, and excess kurtosis = 0 Otherwise, these can be poor approximations for beta distributions with other values of α and β, exhibiting average errors of 40% in the mean and 549% in the variance. ==Random variate generation==
Random variate generation
If X and Y are independent, with X \sim \Gamma(\alpha, \theta) and Y \sim \Gamma(\beta, \theta) then \frac{X}{X+Y} \sim \Beta(\alpha, \beta). So one algorithm for generating beta variates is to generate \frac{X}{X + Y}, where X is a gamma variate with parameters (α, 1) and Y is an independent gamma variate with parameters (β, 1). In fact, here \frac{X}{X+Y} and X+Y are independent, and X+Y \sim \Gamma(\alpha + \beta, \theta). If Z \sim \Gamma(\gamma, \theta) and Z is independent of X and Y, then \frac{X+Y}{X+Y+Z} \sim \Beta(\alpha+\beta,\gamma) and \frac{X+Y}{X+Y+Z} is independent of \frac{X}{X+Y}. This shows that the product of independent \Beta(\alpha,\beta) and \Beta(\alpha+\beta,\gamma) random variables is a \Beta(\alpha,\beta+\gamma) random variable. Also, the kth order statistic of n uniformly distributed variates is \Beta(k, n+1-k), so an alternative if α and β are small integers is to generate α + β − 1 uniform variates and choose the α-th smallest. Another way to generate the Beta distribution is by Pólya urn model. According to this method, one starts with an "urn" with α "black" balls and β "white" balls and draws uniformly with replacement. Every trial an additional ball is added according to the color of the last ball which was drawn. Asymptotically, the proportion of black and white balls will be distributed according to the Beta distribution, where each repetition of the experiment will produce a different value. It is also possible to use the inverse transform sampling. ==Normal approximation to the Beta distribution==
Normal approximation to the Beta distribution
A beta distribution \Beta(\alpha,\beta) with \alpha\sim\beta and \alpha and \beta >> 1 is approximately normal with mean 1/2 and variance 1/(4(2\alpha + 1)). If \alpha \geq \beta the normal approximation can be improved by taking the cube-root of the logarithm of the reciprocal of \Beta(\alpha,\beta) ==History==
History
Thomas Bayes, in a posthumous paper published in 1763 by Richard Price, obtained a beta distribution as the density of the probability of success in Bernoulli trials (see ), but the paper does not analyze any of the moments of the beta distribution or discuss any of its properties. analyzed the beta distribution as the solution Type I of Pearson distributions The first systematic modern discussion of the beta distribution is probably due to Karl Pearson. In Pearson's papers further analyzes the beta distribution as Pearson's Type I distribution, including a full discussion of the method of moments for the four parameter case, and diagrams of (what Elderton describes as) U-shaped, J-shaped, twisted J-shaped, "cocked-hat" shapes, horizontal and angled straight-line cases. Elderton wrote "I am chiefly indebted to Professor Pearson, but the indebtedness is of a kind for which it is impossible to offer formal thanks." Elderton in his 1906 monograph (published three years after his retirement from University College, London, where his position had been divided between Fisher and Pearson's son Egon) in which Pearson writes "I read (Koshai's paper in the Journal of the Royal Statistical Society, 1933) which as far as I am aware is the only case at present published of the application of Professor Fisher's method. To my astonishment that method depends on first working out the constants of the frequency curve by the (Pearson) Method of Moments and then superposing on it, by what Fisher terms "the Method of Maximum Likelihood" a further approximation to obtain, what he holds, he will thus get, 'more efficient values' of the curve constants". David and Edwards's treatise on the history of statistics cites the first modern treatment of the beta distribution, in 1911, using the beta designation that has become standard, due to Corrado Gini, an Italian statistician, demographer, and sociologist, who developed the Gini coefficient. N.L.Johnson and S.Kotz, in their comprehensive and very informative monograph on leading historical personalities in statistical sciences credit Corrado Gini as "an early Bayesian...who dealt with the problem of eliciting the parameters of an initial Beta distribution, by singling out techniques which anticipated the advent of the so-called empirical Bayes approach." ==References==
tickerdossier.comtickerdossier.substack.com