Sample complexity

Let X be a space which we call the input space, and Y be a space which we call the output space, and let Z denote the product X\times Y. For example, in the setting of binary classification, X is typically a finite-dimensional vector space and Y is the set \{-1,1\}. Fix a hypothesis space \mathcal H of functions h\colon X\to Y. A learning algorithm over \mathcal H is a computable map from Z to \mathcal H. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from X to Y. Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization. Fix a loss function \mathcal{L}\colon Y\times Y\to\R_{\geq 0}, for example, the square loss \mathcal{L}(y, y') = (y - y')^2, where h(x) = y'. For a given distribution \rho on X\times Y, the expected risk of a hypothesis (a function) h\in\mathcal H is :\mathcal E(h) :=\mathbb E_\rho[\mathcal{L}(h(x),y)]=\int_{X\times Y} \mathcal{L}(h(x),y)\,d\rho(x,y) In our setting, we have h=\mathcal{A}(S_n), where \mathcal{A} is a learning algorithm and S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n is a sequence of vectors which are all drawn independently from \rho. Define the optimal risk \mathcal E^*_\mathcal{H} = \underset{h \in \mathcal H}{\inf}\mathcal E(h). Set h_n=\mathcal{A}(S_n), for each sample size n. h_n is a random variable and depends on the random variable S_n, which is drawn from the distribution \rho^n. The algorithm \mathcal{A} is called consistent if \mathcal E(h_n) probabilistically converges to \mathcal E_\mathcal H^*. In other words, for all \epsilon, \delta > 0, there exists a positive integer N, such that, for all sample sizes n \geq N, we have \Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]The sample complexity of \mathcal{A} is then the minimum N for which this holds, as a function of \rho, \epsilon, and \delta. We write the sample complexity as N(\rho, \epsilon, \delta) to emphasize that this value of N depends on \rho, \epsilon, and \delta. If \mathcal{A} is not consistent, then we set N(\rho,\epsilon,\delta)=\infty. If there exists an algorithm for which N(\rho,\epsilon,\delta) is finite, then we say that the hypothesis space \mathcal H is learnable. In others words, the sample complexity N(\rho,\epsilon,\delta) defines the rate of consistency of the algorithm: given a desired accuracy \epsilon and confidence \delta, one needs to sample N(\rho,\epsilon,\delta) data points to guarantee that the risk of the output function is within \epsilon of the best possible, with probability at least 1 - \delta . In probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is polynomial, that is, whether N(\rho,\epsilon,\delta) is bounded by a polynomial in 1/\epsilon and 1/\delta. If N(\rho,\epsilon,\delta) is polynomial for some learning algorithm, then one says that the hypothesis space \mathcal H is PAC-learnable. This is a stronger notion than being learnable. ==Unrestricted hypothesis space: infinite sample complexity==