Working example The classical application of the hypergeometric distribution is
sampling without replacement. Think of an
urn with two colors of
marbles, red and green. Define drawing a green marble as a success and drawing a red marble as a failure. Let
N describe the number of
all marbles in the urn (see contingency table below) and
K describe the number of
green marbles, then
N −
K corresponds to the number of
red marbles. Now, standing next to the urn, you close your eyes and draw n marbles without replacement. Define
X as a
random variable whose outcome is
k, the number of green marbles drawn in the experiment. This situation is illustrated by the following
contingency table: Indeed, we are interested in calculating the probability of drawing k green marbles in n draws, given that there are K green marbles out of a total of N marbles. For this example, assume that there are
5 green and
45 red marbles in the urn. Standing next to the urn, you close your eyes and draw
10 marbles without replacement. What is the probability that exactly
4 of the
10 are green? This problem is summarized by the following contingency table: To find the probability of
drawing k green marbles in exactly n draws out of N total draws, we identify X as a hyper-geometric random variable to use the formula P(X=k) = f(k;N,K,n) = {{{K \choose k} {{N-K} \choose {n-k}}}\over {N \choose n}}. To intuitively explain the given formula, consider the two symmetric problems represented by the identity {{K \choose k} {N-K \choose n-k}\over {N \choose n}} = {{{n \choose k} {{N-n} \choose {K-k}}} \over {N \choose K}} • left-hand side - drawing a total of only n marbles out of the urn. We want to find the probability of the outcome of
drawing k green marbles out of K total green marbles, and drawing n-k red marbles out of N-K red marbles, in these n rounds. • right hand side - alternatively, drawing all N marbles out of the urn. We want to find the probability of the outcome of drawing
k green marbles in n draws out of the total N draws, and K-k green marbles in the rest N-n draws. Back to the calculations, we use the formula above to calculate the probability of drawing exactly
k green marbles : P(X=4) = f(4;50,5,10) = {{{5 \choose 4} {{45} \choose {6}}}\over {50 \choose 10}} = {5\cdot 8145060\over 10272278170} = 0.003964583\dots. Intuitively we would expect it to be even more unlikely that all 5 green marbles will be among the 10 drawn. : P(X=5) = f(5;50,5,10) = {{{5 \choose 5} {{45} \choose {5}}}\over {50 \choose 10}} = {1\cdot 1221759 \over 10272278170} = 0.0001189375\dots, As expected, the probability of drawing 5 green marbles is roughly 35 times less likely than that of drawing 4.
Symmetries Swapping the roles of green and red marbles: : f(k;N,K,n) = f(n-k;N,N-K,n) Swapping the roles of drawn and not drawn marbles: : f(k;N,K,n) = f(K-k;N,K,N-n) Swapping the roles of green and drawn marbles: : f(k;N,K,n) = f(k;N,n,K) These symmetries generate the
dihedral group D_4.
Order of draws The probability of drawing any set of green and red marbles (the hypergeometric distribution) depends only on the numbers of green and red marbles, not on the order in which they appear; i.e., it is an
exchangeable distribution. As a result, the probability of drawing a green marble in the i^{\text{th}} draw is : P(G_i) = \frac{K}{N}. This is an
ex ante probability—that is, it is based on not knowing the results of the previous draws.
Tail bounds Let X \sim \operatorname{Hypergeometric}(N,K,n) and p=K/N. Then for 0 we can derive the following bounds: : \begin{align} \Pr[X\le (p - t)n] &\le e^{-n\text{D}(p-t\parallel p)} \le e^{-2t^2n}\\ \Pr[X\ge (p+t)n] &\le e^{-n\text{D}(p+t\parallel p)} \le e^{-2t^2n}\\ \end{align}\! where : D(a\parallel b)=a\log\frac{a}{b}+(1-a)\log\frac{1-a}{1-b} is the
Kullback–Leibler divergence and it is used that D(a\parallel b) \ge 2(a-b)^2.
Note: In order to derive the previous bounds, one has to start by observing that X = \frac{\sum_{i=1}^n Y_i}{n} where Y_i are
dependent random variables with a specific distribution D. Because most of the theorems about bounds in sum of random variables are concerned with
independent sequences of them, one has to first create a sequence Z_i of
independent random variables with the same distribution D and apply the theorems on X' = \frac{\sum_{i=1}^{n}Z_i}{n}. Then, it is proved from Hoeffding : \begin{align} \Pr[X\le (p - t)n] &\le e^{-(N-n)\text{D}(p+\tfrac{tn}{N-n}||p)} \le e^{-2 t^2 n \tfrac{n}{N-n}}\\ \\ \Pr[X\ge (p+t)n] &\le e^{-(N-n)\text{D}(p-\tfrac{tn}{N-n}||p)} \le e^{-2 t^2 n \tfrac{n}{N-n}}\\ \end{align}\! == Statistical Inference ==