Law of large numbers

In probability theory, the law of large numbers is a mathematical law which states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the law of large numbers states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

Examples

A single roll of a six-sided dice produces one of the numbers 1, 2, 3, 4, 5, or 6, each with equal probability. Therefore, the expected value of the roll is: \frac{1+2+3+4+5+6}{6} = 3.5 According to the law of large numbers, if a large number of six-sided dice are rolled, the average of their values (sometimes called the sample mean) will approach 3.5, with the precision increasing as more dice are rolled. It follows from the law of large numbers that the empirical probability of success in a series of Bernoulli trials will converge to the theoretical probability. For a Bernoulli random variable, the expected value is the theoretical probability of success, and the average of n such variables (assuming they are independent and identically distributed (i.i.d.)) is precisely the relative frequency. For example, a fair coin toss is a Bernoulli trial. When a fair coin is flipped once, the theoretical probability that the outcome will be heads is equal to . Therefore, according to the law of large numbers, the proportion of heads in a "large" number of coin flips "should be" roughly . In particular, the proportion of heads after n flips will almost surely converge to as n approaches infinity. Although the proportion of heads (and tails) approaches , almost surely the absolute difference in the number of heads and tails will become large as the number of flips becomes large. That is, the probability that the absolute difference is a small number approaches zero as the number of flips becomes large. Also, almost surely the ratio of the absolute difference to the number of flips will approach zero. Intuitively, the expected difference grows, but at a slower rate than the number of flips. Another good example of the law of large numbers is the Monte Carlo method. These methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The larger the number of repetitions, the better the approximation tends to be. The reason that this method is important is mainly that, sometimes, it is difficult or impossible to use other approaches. == Limitation ==

Limitation

The average of the results obtained from a large number of trials may fail to converge in some cases. For instance, the average of n results taken from the Cauchy distribution or some Pareto distributions (α<1) will not converge as n becomes larger; the reason is heavy tails. The Cauchy distribution and the Pareto distribution represent two cases: the Cauchy distribution does not have an expectation, whereas the expectation of the Pareto distribution (α<1) is infinite. One way to generate the Cauchy-distributed example is where the random numbers equal the tangent of an angle uniformly distributed between −90° and +90°. The median is zero, but the expected value does not exist, and indeed the average of n such variables have the same distribution as one such variable. It does not converge in probability toward zero (or any other value) as n goes to infinity. If the trials embed a selection bias, typical in human economic/rational behaviour, the law of large numbers does not help in solving the bias, even if the number of trials is increased the selection bias remains. ==History==

History

is an example of the law of large numbers. Initially, there are solute molecules on the left side of a barrier (magenta line) and none on the right. The barrier is removed, and the solute diffuses to fill the whole container.{{ubl|style=margin-top:1em| Top: With a single molecule, the motion appears to be quite random. The Italian mathematician Gerolamo Cardano (1501–1576) stated without proof that the accuracies of empirical statistics tend to improve with the number of trials. Markov, Borel, Cantelli, Kolmogorov and Khinchin. These further studies have given rise to two prominent forms of the law of large numbers. One is called the "weak" law and the other the "strong" law, in reference to two different modes of convergence of the cumulative sample means to the expected value; in particular, as explained below, the strong form implies the weak. ==Forms==

Forms

There are two different versions of the law of large numbers that are described below. They are called the strong law of large numbers and the weak law of large numbers. Mutual independence of the random variables can be replaced by pairwise independence or exchangeability in both versions of the law. The difference between the strong and the weak version is concerned with the mode of convergence being asserted. For interpretation of these modes, see Convergence of random variables. Weak law {{multiple image |width1=50 |image1=Blank300.png The weak law of large numbers (also called Khinchin's law) states that given a collection of independent and identically distributed (iid) samples from a random variable with finite mean, the sample mean converges in probability to the expected value {{NumBlk|| \overline{X}_n\ \overset{P}{\rightarrow}\ \mu \qquad\textrm{when}\ n \to \infty. |}} That is, for any positive number ε, \lim_{n\to\infty}\Pr\!\left(\,|\overline{X}_n-\mu| Interpreting this result, the weak law states that for any nonzero margin specified (ε), no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value; that is, within the margin. As mentioned earlier, the weak law applies in the case of i.i.d. random variables, but it also applies in some other cases. For example, the variance may be different for each random variable in the series, keeping the expected value constant. If the variances are bounded, then the law applies, as shown by Chebyshev as early as 1867. (If the expected values change during the series, then we can simply apply the law to the average deviation from the respective expected values. The law then states that this converges in probability to zero.) In fact, Chebyshev's proof works so long as the variance of the average of the first n values goes to zero as n goes to infinity. {{NumBlk|| \overline{X}_n\ \overset{\text{a.s.}}{\longrightarrow}\ \mu \qquad\textrm{when}\ n \to \infty. |}} That is, \Pr\!\left( \lim_{n\to\infty}\overline{X}_n = \mu \right) = 1. What this means is that, as the number of trials n goes to infinity, the probability that the average of the observations converges to the expected value, is equal to one. The modern proof of the strong law is more complex than that of the weak law, and relies on passing to an appropriate sub-sequence. If the summands are independent but not identically distributed, then {{NumBlk|| \overline{X}_n - \operatorname{E}\big[\overline{X}_n\big]\ \overset{\text{a.s.}}{\longrightarrow}\ 0, |}} provided that each Xk has a finite second moment and \sum_{k=1}^{\infty} \frac{1}{k^2} \operatorname{Var}[X_k] This statement is known as ''Kolmogorov's strong law'', see e.g. . Differences between the weak law and the strong law The weak law states that for a specified large n, the average \overline{X}_n is likely to be near μ. Thus, it leaves open the possibility that |\overline{X}_n -\mu| > \varepsilon happens an infinite number of times, although at infrequent intervals. (Not necessarily |\overline{X}_n -\mu| \neq 0 for all n). The strong law shows that this almost surely will not occur. I.e., with probability 1 for any the inequality |\overline{X}_n -\mu| holds for all large enough n. The strong law does not hold in the following cases, but the weak law does. {{ordered list E\left(\frac{\sin(X)e^X}{X}\right) =\ \int_{x=0}^{\infty}\frac{\sin(x)e^x}{x}e^{-x}dx = \frac{\pi}{2} E\left(\frac{2^X(-1)^X}{X}\right) =\ \sum_{x=1}^{\infty}\frac{2^x(-1)^x}{x}2^{-x}=-\ln(2) \begin{cases} 1-F(x)&=\frac{e}{2x\ln(x)},&x \ge e \\ F(x)&=\frac{e}{-2x\ln(-x)},&x \le -e \end{cases} then it has no expected value, but the weak law is true. }} Uniform laws of large numbers There are extensions of the law of large numbers to collections of estimators, where the convergence is uniform over the collection; thus the name uniform law of large numbers. Suppose f(x,θ) is some function defined for θ ∈ Θ, and continuous in θ. Then for any fixed θ, the sequence {f(X1,θ), f(X2,θ), ...} will be a sequence of independent and identically distributed random variables, such that the sample mean of this sequence converges in probability to E[f(X,θ)]. This is the pointwise (in θ) convergence. A particular example of a uniform law of large numbers states the conditions under which the convergence happens uniformly in θ. If • Θ is compact, • f(x,θ) is continuous at each θ ∈ Θ for almost all xs, and measurable function of x at each θ. • there exists a dominating function d(x) such that E[d(X)] \left\| f(x,\theta) \right\| \leq d(x) \quad\text{for all}\ \theta\in\Theta. Then E[f(X,θ)] is continuous in θ, and \sup_{\theta\in\Theta} \left\| \frac 1 n \sum_{i=1}^n f(X_i,\theta) - \operatorname{E}[f(X,\theta)] \right\| \overset{\mathrm{P}}{\rightarrow} \ 0. This result is useful to derive consistency of a large class of estimators (see Extremum estimator). Borel's law of large numbers '''Borel's law of large numbers''', named after Émile Borel, states that if an experiment is repeated a large number of times, independently under identical conditions, then the proportion of times that any specified event is expected to occur approximately equals the probability of the event's occurrence on any particular trial; the larger the number of repetitions, the better the approximation tends to be. More precisely, if E denotes the event in question, p its probability of occurrence, and Nn(E) the number of times E occurs in the first n trials, then with probability one, \frac{N_n(E)}{n}\to p\text{ as }n\to\infty. This theorem makes rigorous the intuitive notion of probability as the expected long-run relative frequency of an event's occurrence. It is a special case of any of several more general laws of large numbers in probability theory. ==Proof of the weak law==

Proof of the weak law

Given X1, X2, ... an infinite sequence of i.i.d. random variables with finite expected value E(X_1)=E(X_2)=\cdots=\mu, we are interested in the convergence of the sample average \overline{X}_n=\tfrac1n(X_1+\cdots+X_n). The weak law of large numbers states: {{NumBlk|| \overline{X}_n\ \overset{P}{\rightarrow}\ \mu \qquad\textrm{when}\ n \to \infty. |}} Proof using Chebyshev's inequality assuming finite variance This proof uses the assumption of finite variance \operatorname{Var} (X_i)=\sigma^2 (for all i). The independence of the random variables implies no correlation between them, and we have that \operatorname{Var}(\overline{X}_n) = \operatorname{Var}(\tfrac1n(X_1+\cdots+X_n)) = \frac{1}{n^2} \operatorname{Var}(X_1+\cdots+X_n) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}. The common mean μ of the sequence is the mean of the sample average: E(\overline{X}_n) = \mu. Using Chebyshev's inequality on \overline{X}_n results in \operatorname{P}( \left| \overline{X}_n-\mu \right| \geq \varepsilon) \leq \frac{\sigma^2}{n\varepsilon^2}. This may be used to obtain the following: \operatorname{P}( \left| \overline{X}_n-\mu \right| As n approaches infinity, the expression approaches 1. And by definition of convergence in probability, we have obtained {{NumBlk|| \overline{X}_n\ \overset{P}{\rightarrow}\ \mu \qquad\textrm{when}\ n \to \infty. |}} Proof using convergence of characteristic functions By Taylor's theorem for complex functions, the characteristic function of any random variable, X, with finite mean μ, can be written as \varphi_X(t) = 1 + it\mu + o(t), \quad t \rightarrow 0. All X1, X2, ... have the same characteristic function, so we will simply denote this φX. Among the basic properties of characteristic functions there are \varphi_{\frac 1 n X}(t)= \varphi_X(\tfrac t n) \quad \text{and} \quad \varphi_{X+Y}(t) = \varphi_X(t) \varphi_Y(t) \quad if X and Y are independent. These rules can be used to calculate the characteristic function of \overline{X}_n in terms of φX: \varphi_{\overline{X}_n}(t)= \left[\varphi_X\left({t \over n}\right)\right]^n = \left[1 + i\mu{t \over n} + o\left({t \over n}\right)\right]^n \, \rightarrow \, e^{it\mu}, \quad \text{as} \quad n \to \infty. The limit eitμ is the characteristic function of the constant random variable μ, and hence by the Lévy continuity theorem, \overline{X}_n converges in distribution to μ: \overline{X}_n \, \overset{\mathcal D}{\rightarrow} \, \mu \qquad\text{for}\qquad n \to \infty. μ is a constant, which implies that convergence in distribution to μ and convergence in probability to μ are equivalent (see Convergence of random variables.) Therefore, {{NumBlk|| \overline{X}_n\ \overset{P}{\rightarrow}\ \mu \qquad\textrm{when}\ n \to \infty. |}} This shows that the sample mean converges in probability to the derivative of the characteristic function at the origin, as long as the latter exists. ==Proof of the strong law==

Proof of the strong law

We give a relatively simple proof of the strong law under the assumptions that the X_i are iid, {\mathbb E}[X_i] =: \mu , \operatorname{Var} (X_i)=\sigma^2 , and {\mathbb E}[X_i^4] =: \tau . Let us first note that without loss of generality we can assume that \mu = 0 by centering. In this case, the strong law says that \Pr\!\left( \lim_{n\to\infty}\overline{X}_n = 0 \right) = 1, or \Pr\left(\omega: \lim_{n\to\infty}\frac{S_n(\omega)}n = 0 \right) = 1. It is equivalent to show that \Pr\left(\omega: \lim_{n\to\infty}\frac{S_n(\omega)}n \neq 0 \right) = 0, Note that \lim_{n\to\infty}\frac{S_n(\omega)}n \neq 0 \iff \exists\epsilon>0, \left|\frac{S_n(\omega)}n\right| \ge \epsilon\ \mbox{infinitely often}, and thus to prove the strong law we need to show that for every \epsilon > 0, we have \Pr\left(\omega: |S_n(\omega)| \ge n\epsilon \mbox{ infinitely often} \right) = 0. Define the events A_n = \{\omega : |S_n| \ge n\epsilon\}, and if we can show that \sum_{n=1}^\infty \Pr(A_n) then the Borel-Cantelli Lemma implies the result. So let us estimate \Pr(A_n). We compute {\mathbb E}[S_n^4] = {\mathbb E}\left[\left(\sum_{i=1}^n X_i\right)^4\right] = {\mathbb E}\left[\sum_{1 \le i,j,k,l\le n} X_iX_jX_kX_l\right]. We first claim that every term of the form X_i^3X_j, X_i^2X_jX_k, X_iX_jX_kX_l where all subscripts are distinct, must have zero expectation. This is because {\mathbb E}[X_i^3X_j] = {\mathbb E}[X_i^3]{\mathbb E}[X_j] by independence, and the last term is zero—and similarly for the other terms. Therefore the only terms in the sum with nonzero expectation are {\mathbb E}[X_i^4] and {\mathbb E}[X_i^2X_j^2]. Since the X_i are identically distributed, all of these are the same, and moreover {\mathbb E}[X_i^2X_j^2]=({\mathbb E}[X_i^2])^2. There are n terms of the form {\mathbb E}[X_i^4] and 3 n (n-1) terms of the form ({\mathbb E}[X_i^2])^2, and so {\mathbb E}[S_n^4] = n \tau + 3n(n-1)\sigma^4. Note that the right-hand side is a quadratic polynomial in n, and as such there exists a C>0 such that {\mathbb E}[S_n^4] \le Cn^2 for n sufficiently large. By Markov, \Pr(|S_n| \ge n \epsilon) \le \frac1{(n\epsilon)^4}{\mathbb E}[S_n^4] \le \frac{C}{\epsilon^4 n^2}, for n sufficiently large, and therefore this series is summable. Since this holds for any \epsilon > 0, we have established the strong law of large numbers. The proof can be strengthened immensely by dropping all finiteness assumptions on the second and fourth moments. It can also be extended for example to discuss partial sums of distributions without any finite moments. Such proofs use more intricate arguments to prove the same Borel-Cantelli predicate, a strategy attributed to Kolmogorov to conceptually bring the limit inside the probability parentheses. == Consequences ==

Consequences

The law of large numbers provides an expectation of an unknown distribution from a realization of the sequence, but also any feature of the probability distribution. By applying Borel's law of large numbers, one could easily obtain the probability mass function. For each event in the objective probability mass function, one could approximate the probability of the event's occurrence with the proportion of times that any specified event occurs. The larger the number of repetitions, the better the approximation. As for the continuous case: C=(a-h,a+h], for small positive h. Thus, for large n: \frac{N_n(C)}{n}\thickapprox p = P(X\in C) = \int_{a-h}^{a+h} f(x) \, dx \thickapprox 2hf(a) With this method, one can cover the whole x-axis with a grid (with grid size 2h) and obtain a bar graph which is called a histogram. == Applications ==

Applications

One application of the law of large numbers is an important method of approximation known as the Monte Carlo method, Using the Monte Carlo method and the LLN, we can see that as the number of samples increases, the numerical value gets ever closer to 0.4180233. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com