Uniform convergence in probability

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family uniformly converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory. Specifically, the Glivenko-Cantelli theorem and the homonymous classes of functions are fundamentally related to uniform convergence.

Definitions

For a class of predicates H defined on a set X and a set of samples x=(x_1,x_2,\dots,x_m), where x_i\in X, the empirical frequency of h\in H on x is : \widehat{Q}_x(h)=\frac 1 m |\{i:1\leq i\leq m, h(x_i)=1\}|. The theoretical probability of h\in H is defined as Q_P(h) = P\{y\in X : h(y)=1\}. The Uniform Convergence Theorem states, roughly, that if H is "simple" and we draw samples independently (with replacement) from X according to any distribution P, then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability. Here "simple" means that the Vapnik–Chervonenkis dimension of the class H is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole. The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis using the concept of growth function. ==Uniform Convergence Theorem ==

Uniform Convergence Theorem

The statement of the Uniform Convergence Theorem is as follows: If H is a set of \{0,1\}-valued functions defined on a set X and P is a probability distribution on X then for \varepsilon>0 and m a positive integer, we have: : P^m\{|Q_P(h)-\widehat{Q_x}(h)|\geq\varepsilon \text{ for some } h\in H\}\leq 4\Pi_H(2m)e^{-\varepsilon^2 m/8}. In the above, for any x\in X^m,Q_P(h)=P\{(y\in X:h(y)=1\},\widehat{Q}_x(h)=\frac 1 m |\{i:1\leq i\leq m,h(x_{i})=1\}| and |x|=m. P^m indicates that the probability is taken over x consisting of m i.i.d. draws from the distribution P. Finally, the growth function \Pi_H is defined in the following way, for any \{0,1\}-valued functions H over X and for any natural number m: \Pi_H(m)=\max|\{h\cap D: D\subseteq X, |D|=m, h\in H\}|. From the point of view of Learning Theory one can consider H to be the Concept/Hypothesis class defined over the instance set X. Crucially, the Sauer–Shelah lemma implies that \Pi_H(m)\leq m^{d}, where d is the VC dimension of H. == Proof of the Uniform Convergence Theorem ==

Proof of the Uniform Convergence Theorem

Symmetrization Lemma: Let V=\{x\in X^m:|Q_P(h)-\widehat{Q}_x(h)|\geq\varepsilon \text{ for some } h\in H\} and : R=\{(r,s)\in X^m \times X^m:|\widehat{Q_r}(h)-\widehat{Q}_s(h) |\geq \varepsilon /2 \text{ for some } h\in H\}. Then for m\geq\frac 2 {\varepsilon^2}, P^m(V)\leq 2P^{2m}(R). Proof: By the triangle inequality, if |Q_{P}(h)-\widehat{Q}_r(h)|\geq\varepsilon and |Q_P(h)-\widehat{Q}_s (h)|\leq\varepsilon /2 then |\widehat{Q}_r(h)-\widehat{Q}_s (h)|\geq\varepsilon /2. Therefore, : \begin{align} & P^{2m}(R) \\[5pt] \geq {} & P^{2m}\{\exists h\in H,|Q_{P}(h)-\widehat{Q}_r(h)| \geq \varepsilon \text{ and } |Q_P(h)-\widehat{Q}_s(h)|\leq\varepsilon /2\} \\[5pt] = {} & \int_V P^m\{s:\exists h\in H,|Q_P(h)-\widehat{Q}_r(h)|\geq\varepsilon \text{ and } |Q_P(h)-\widehat{Q}_s(h)|\leq\varepsilon /2\} \, dP^m(r) \\[5pt] = {} & A \end{align} since r and s are independent. Now for r\in V fix an h\in H such that |Q_P(h)-\widehat{Q}_r(h)|\geq\varepsilon. For this h, we shall show that : P^m \left\{ |Q_P(h)-\widehat{Q}_s(h)|\leq\frac \varepsilon 2\right\} \geq\frac 1 2. Thus for any r\in V, A\geq\frac{P^m(V)}2 and hence P^{2m}(R)\geq\frac{P^m(V)}2. And hence we perform the first step of our high level idea. Notice, m\cdot \widehat{Q}_s(h) is a binomial random variable with expectation m\cdot Q_{P}(h) and variance m\cdot Q_P(h)(1-Q_P(h)). By Chebyshev's inequality we get : P^m \left\\sum_{\sigma\in\Gamma_m} \int_{X^{2m}} 1_R(\sigma(x)) \, dP^{2m}(x) \\[5pt] = {} & \int_{X^{2m}} \frac 1 \sum_{\sigma\in\Gamma_m} 1_R (\sigma(x)) \, dP^{2m}(x) \\[5pt] & \text{(because } |\Gamma_{m}| \text{ is finite)} \\[5pt] = {} & \int_{X^{2m}} \Pr[\sigma(x)\in R] \, dP^{2m}(x) \quad \text{(the expectation)} \\[5pt] \leq {} & \max_{x\in X^{2m}}(\Pr[\sigma(x)\in R]). \end{align} The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take. Reduction to a finite class Lemma: Basing on the previous lemma, : \max_{x\in X^{2m}}(\Pr[\sigma(x)\in R])\leq 4\Pi_H(2m)e^{-\varepsilon^2 m/8} . Proof: Let us define x=(x_1,x_2,\ldots,x_{2m}) and t=|H|_x| which is at most \Pi_H(2m). This means there are functions h_1,h_2,\ldots,h_t\in H such that for any h\in H,\exists i between 1 and t with h_i(x_k)=h(x_k) for 1\leq k\leq 2m. We see that \sigma(x)\in R iff for some h in H satisfies, |\frac{1}{m}|\{1\leq i\leq m:h(x_{\sigma_{i}})=1\}|-\frac{1}{m}|\{m+1\leq i\leq 2m:h(x_{\sigma_{i}})=1\}||\geq\frac{\varepsilon}{2}. Hence if we define w^{j}_{i}=1 if h_{j}(x_{i})=1 and w^{j}_{i}=0 otherwise. For 1\leq i\leq m and 1\leq j\leq t, we have that \sigma(x)\in R iff for some j in {1,\ldots,t} satisfies |\frac 1 m \left(\sum_i w^j_{\sigma(i)}-\sum_i w^j_{\sigma(m+i)}\right)|\geq\frac \varepsilon 2 . By union bound we get : \Pr[\sigma(x)\in R]\leq t\cdot \max\left(\Pr[|\frac 1 m \left(\sum_i w^j_{\sigma_i} - \sum_i w^j_{\sigma_{m+i}}\right)| \geq \frac \varepsilon 2]\right) :\leq \Pi_{H}(2m)\cdot \max\left(\Pr\left[ \left| \frac 1 m \left(\sum_i w^j_{\sigma_i}-\sum_i w^j_{\sigma_{m+i}}\right)\right| \geq \frac \varepsilon 2 \right] \right). Since, the distribution over the permutations \sigma is uniform for each i, so w^j_{\sigma_i}-w^j_{\sigma_{m+i}} equals \pm |w^j_i-w^j_{m+i}|, with equal probability. Thus, : \Pr\left[\left|\frac 1 m \left(\sum_i \left(w^j_{\sigma_i}-w^j_{\sigma_{m+i}}\right)\right)\right|\geq\frac \varepsilon 2\right] = \Pr\left[ \left| \frac 1 m \left( \sum_i|w^j_i-w^j_{m+i}|\beta_i\right)\right|\geq\frac \varepsilon 2\right], where the probability on the right is over \beta_{i} and both the possibilities are equally likely. By Hoeffding's inequality, this is at most 2e^{-m\varepsilon^2/8}. Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem. ==References==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com