MarketFisher's noncentral hypergeometric distribution
Company Profile

Fisher's noncentral hypergeometric distribution

In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional distribution of two or more binomially distributed variables dependent upon their fixed sum.

Univariate distribution
{{Probability distribution | name =Univariate Fisher's noncentral hypergeometric distribution| type =mass| pdf_image =| cdf_image =| parameters =m_1, m_2 \in \mathbb{N}N = m_1 + m_2n \in [0,N)\omega \in \mathbb{R}_+| support =x \in [x_\min,x_\max]x_\min=\max(0,n-m_2)x_\max=\min(n,m_1)| pdf =\frac{\binom{m_1}{x} \binom{m_2}{n-x} \omega^x}{P_0}where P_0 = \sum_{y=x_\min}^{x_\max} \binom{m_1}{y} \binom{m_2}{n-y} \omega^y| cdf =| mean =\frac{P_1}{P_0}, where P_k = \sum_{y=x_\min}^{x_\max} \binom{m_1}{y} \binom{m_2}{n-y} \omega^y\, y^k| median =| mode =\,\, \left\lfloor \frac{-2C}{B - \sqrt{B^2-4AC}} \right\rfloor \, , where A=\omega-1, B = m_1 + n - N -(m_1+n+2)\omega, C = (m_1+1)(n+1)\omega.| variance =\frac{P_2}{P_0} - \left( \frac{P_1}{P_0} \right)^2, where Pk is given above.| skewness =| kurtosis =| entropy =| mgf =| char = }} The probability function, mean and variance are given in the adjacent table. An alternative expression of the distribution has both the number of balls taken of each color and the number of balls not taken as random variables, whereby the expression for the probability becomes symmetric. The calculation time for the probability function can be high when the sum in P0 has many terms. The calculation time can be reduced by calculating the terms in the sum recursively relative to the term for y = x and ignoring negligible terms in the tails (Liao and Rosen, 2001). The mean can be approximated by: :\mu \approx \frac{b - \sqrt{b^2-4ac}}{2a} \, , where a=\omega-1, b=m_1 + n - N -(m_1+n)\omega, c=m_1 n \omega. The variance can be approximated by: :\sigma^2 \approx \frac{N}{N-1} \bigg/ \left( \frac{1}{\mu}+ \frac{1}{m_1-\mu}+ \frac{1}{n-\mu}+ \frac{1}{\mu+m_2-n} \right) . Better approximations to the mean and variance are given by Levin (1984, 1990), McCullagh and Nelder (1989), Liao (1992), and Eisinga and Pelzer (2011). The saddlepoint methods to approximate the mean and the variance suggested Eisinga and Pelzer (2011) offer extremely accurate results. Properties The following symmetry relations apply: :\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(n-x;n,m_2,N,1/\omega)\,. :\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(x;m_1,n,N,\omega)\,. :\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(m_1-x;N-n,m_1,N,1/\omega)\,. Recurrence relation: :\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(x-1;n,m_1,N,\omega) \frac{(m_1-x+1)(n-x+1)}{x(m_2-n+x)}\omega\,. The distribution is affectionately called "finchy-pig," based on the abbreviation convention above. Derivation The univariate noncentral hypergeometric distribution may be derived alternatively as a conditional distribution in the context of two binomially distributed random variables, for example when considering the response to a particular treatment in two different groups of patients participating in a clinical trial. An important application of the noncentral hypergeometric distribution in this context is the computation of exact confidence intervals for the odds ratio comparing treatment response between the two groups. Suppose X and Y are binomially distributed random variables counting the number of responders in two corresponding groups of size mX and mY respectively, : X \sim \operatorname{Bin}(m_X, \pi_X),\quad Y \sim \operatorname{Bin}(m_Y, \pi_Y) \, . Their odds ratio is given as : \omega = \frac{\omega_X}{\omega_Y} = \frac{\pi_X/(1-\pi_X)}{\pi_Y/(1-\pi_Y)} . The responder prevalence \pi_i is fully defined in terms of the odds \omega_i, i \in \{X,Y\}, which correspond to the sampling bias in the urn scheme above, i.e. :\pi_i = \frac{\omega_i}{1+\omega_i}. The trial can be summarized and analyzed in terms of the following contingency table. In the table, n=x+y corresponds to the total number of responders across groups, and N to the total number of patients recruited into the trial. The dots denote corresponding frequency counts of no further relevance. The sampling distribution of responders in group X conditional upon the trial outcome and prevalences, Pr(X = x \; | \; X+Y = n,m_X,m_Y,\omega_X,\omega_Y), is noncentral hypergeometric: \begin{align} F(X,\omega) :&= Pr(X = x \; | \; X+Y = n,m_X,m_Y,\omega_X,\omega_Y)\\ &= \frac{Pr(X = x, X+Y = n \; | \; m_X,m_Y,\omega_X,\omega_Y)}{Pr(X+Y = n \; | \; m_X,m_Y,\omega_X,\omega_Y)}\\ &= \frac{Pr(X = x \; | \; m_X,\omega_X)Pr(Y = n-x \; | \; m_Y,\omega_Y,X=x)}{Pr(X+Y = n \; | \; m_X,m_Y,\omega_X,\omega_Y)}\\ &= \frac{\binom{m_X}{x}\pi_X^x(1-\pi_X)^{m_X-x}\binom{m_Y}{n-x}\pi_Y^{n-x}(1-\pi_Y)^{m_Y-(n-x)}}{Pr(X+Y = n \; | \; m_X,m_Y,\omega_X,\omega_Y)}\\ &= \frac{\binom{m_X}{x}\omega_X^x(1-\pi_X)^{m_X}\binom{m_Y}{n-x}\omega_Y^{n-x}(1-\pi_Y)^{m_Y}}{Pr(X+Y = n \; | \; m_X,m_Y,\omega_X,\omega_Y)}\\ &= \frac{\binom{m_X}{x}\binom{m_Y}{n-x}\omega^x(1-\pi_X)^{m_X}\omega_Y^{n}(1-\pi_Y)^{m_Y}}{(1-\pi_X)^{m_X}\omega_Y^{n}(1-\pi_Y)^{m_Y}\sum_{u=\max(0,n-m_Y)}^{\min(m_X,n)}\binom{m_X}{u}\binom{m_Y}{n-u}\omega^u}\\ &= \frac{\binom{m_X}{x}\binom{m_Y}{n-x}\omega^x}{\sum_{u=\max(0,n-m_Y)}^{\min(m_X,n)}\binom{m_X}{u}\binom{m_Y}{n-u}\omega^u} \end{align} Note that the denominator is essentially just the numerator, summed over all events of the joint sample space (X,Y) for which it holds that X+Y = n. Terms independent of X can be factored out of the sum and cancel out with the numerator. == Multivariate distribution ==
Multivariate distribution
{{Probability distribution | name =Multivariate Fisher's Noncentral Hypergeometric Distribution| type =mass| pdf_image =| cdf_image =| parameters =c \in \mathbb{N}\mathbf{m}=(m_1,\ldots,m_c) \in \mathbb{N}^cN = \sum_{i=1}^c m_in \in [0,N)\boldsymbol{\omega} = (\omega_1,\ldots,\omega_c) \in \mathbb{R}_+^c| support =\mathrm{S} = \left\{ \mathbf{x} \in \mathbb{Z}_{0+}^c \, : \, \sum_{i=1}^{c} x_i = n \right\}| pdf =\frac{1}{P_0}\prod_{i=1}^{c} \binom{m_i}{x_i}\omega_i^{x_i}where P_0 = \sum_{(y_0,\ldots,y_c)\in \mathrm{S}}\prod_{i=1}^{c} \binom{m_i}{y_i}\omega_i^{y_i}| cdf =| mean =The mean μi of xi can be approximated by \mu_i = \frac{m_i r \omega_i}{r \omega_i + 1} where r is the unique positive solution to \sum_{i=1}^{c}\mu_i = n\,.| median =| mode =| variance =| skewness =| kurtosis =| entropy =| mgf =| char = }} The distribution can be expanded to any number of colors c of balls in the urn. The multivariate distribution is used when there are more than two colors. The probability function and a simple approximation to the mean are given to the right. Better approximations to the mean and variance are given by McCullagh and Nelder (1989). Properties The order of the colors is arbitrary so that any colors can be swapped. The weights can be arbitrarily scaled: :\operatorname{mfnchypg}(\mathbf{x};n,\mathbf{m}, \boldsymbol{\omega}) = \operatorname{mfnchypg}(\mathbf{x};n,\mathbf{m}, r\boldsymbol{\omega})\,\, for all r \in \mathbb{R}_+. Colors with zero number (mi = 0) or zero weight (ωi = 0) can be omitted from the equations. Colors with the same weight can be joined: : \begin{align} & {} \operatorname{mfnchypg}\left(\mathbf{x};n,\mathbf{m}, (\omega_1,\ldots,\omega_{c-1},\omega_{c-1})\right) \\ & {} = \operatorname{mfnchypg}\left((x_1,\ldots,x_{c-1}+x_c); n,(m_1,\ldots,m_{c-1}+m_c), (\omega_1,\ldots,\omega_{c-1})\right)\, \cdot \\ & \qquad \operatorname{hypg}(x_c; x_{c-1}+x_c, m_c, m_{c-1}+m_c) \end{align} where \operatorname{hypg}(x;n,m,N) is the (univariate, central) hypergeometric distribution probability. == Applications ==
Applications
Fisher's noncentral hypergeometric distribution is useful for models of biased sampling or biased selection where the individual items are sampled independently of each other with no competition. The bias or odds can be estimated from an experimental value of the mean. Use Wallenius' noncentral hypergeometric distribution instead if items are sampled one by one with competition. Fisher's noncentral hypergeometric distribution is used mostly for tests in contingency tables where a conditional distribution for fixed margins is desired. This can be useful, for example, for testing or measuring the effect of a medicine. See McCullagh and Nelder (1989). == Software available ==
Software available
• FisherHypergeometricDistribution in Mathematica. • An implementation for the R programming language is available as the package named BiasedUrn. Includes univariate and multivariate probability mass functions, distribution functions, quantiles, random variable generating functions, mean and variance. • The R package MCMCpack includes the univariate probability mass function and random variable generating function. • SAS System includes univariate probability mass function and distribution function. • Implementation in C++ is available from www.agner.org. • Calculation methods are described by Liao and Rosen (2001) and Fog (2008). == See also ==
tickerdossier.comtickerdossier.substack.com