Bootstrapping (statistics)

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling one's data or a model which is estimated from the data. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

History

The bootstrap was first described by Bradley Efron in "Bootstrap methods: another look at the jackknife" (1979), inspired by earlier work on the jackknife. Improved estimates of the variance were developed later. A Bayesian extension was developed in 1981. The bias-corrected and accelerated (BC_a) bootstrap was developed by Efron in 1987, == Approach ==

Approach

The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and performing inference about a sample from resampled data (resampled → sample). As the population is unknown, the true error in a sample statistic against its population value is unknown. In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference of the 'true' sample from resampled data (resampled → sample) is measurable. More formally, the bootstrap works by treating inference of the true probability distribution J, given the original data, as being analogous to an inference of the empirical distribution Ĵ, given the resampled data. The accuracy of inferences regarding Ĵ using the resampled data can be assessed because we know Ĵ. If Ĵ is a reasonable approximation to J, then the quality of inference on J can in turn be inferred. As an example, assume we are interested in the average (or mean) height of people worldwide. We cannot measure all the people in the global population, so instead, we sample only a tiny part of it, and measure that. Assume the sample is of size N; that is, we measure the heights of N individuals. From that single sample, only one estimate of the mean can be obtained. In order to reason about the population, we need some sense of the variability of the mean that we have computed. The simplest bootstrap method involves taking the original data set of heights, and, using a computer, sampling from it to form a new sample (called a 'resample' or bootstrap sample) that is also of size N. The bootstrap sample is taken from the original by using sampling with replacement (e.g. we might 'resample' 5 times from [1,2,3,4,5] and get [2,5,4,4,1]), so, assuming N is sufficiently large, for all practical purposes there is virtually zero probability that it will be identical to the original "real" sample. This process is repeated a large number of times (typically 1,000 or 10,000 times), and for each of these bootstrap samples, we compute its mean (each of these is called a "bootstrap estimate"). We now can create a histogram of bootstrap means. This histogram provides an estimate of the shape of the distribution of the sample mean from which we can answer questions about how much the mean varies across samples. (The method here, described for the mean, can be applied to almost any other statistic or estimator.) == Discussion ==

Discussion

Advantages A great advantage of bootstrap is its simplicity. It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of the distribution, such as percentile points, proportions, Odds ratio, and correlation coefficients. However, despite its simplicity, bootstrapping can be applied to complex sampling designs (e.g. for population divided into s strata with ns observations per strata, one example of which is a dose-response experiment, where bootstrapping can be applied for each stratum). Although bootstrapping is (under some conditions) asymptotically consistent, it does not provide general finite-sample guarantees. The result may depend on the representative sample. The apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples or large enough of a sample size) where these would be more formally stated in other approaches. Also, bootstrapping can be time-consuming and there are not many available software for bootstrapping as it is difficult to automate using traditional statistical computer packages. In fact, according to the original developer of the bootstrapping method, even setting the number of samples at 50 is likely to lead to fairly good standard error estimates. Adèr et al. recommend the bootstrap procedure for the following situations: that if one performs a naive bootstrap on the sample mean when the underlying population lacks a finite variance (for example, a power law distribution), then the bootstrap distribution will not converge to the same limit as the sample mean. As a result, confidence intervals on the basis of a Monte Carlo simulation of the bootstrap could be misleading. Athreya states that "Unless one is reasonably sure that the underlying distribution is not heavy tailed, one should hesitate to use the naive bootstrap". == Types of bootstrap scheme ==

Types of bootstrap scheme

In univariate problems, it is usually acceptable to resample the individual observations with replacement ("case resampling" below) unlike subsampling, in which resampling is without replacement and is valid under much weaker conditions compared to the bootstrap. In small samples, a parametric bootstrap approach might be preferred. For other problems, a smooth bootstrap will likely be preferred. For regression problems, various other alternatives are available. Estimating the distribution of sample mean Consider a coin-flipping experiment. We flip the coin and record whether it lands heads or tails. Let be 10 observations from the experiment. if the flip lands heads, and 0 otherwise. By invoking the assumption that the average of the coin flips is normally distributed, we can use the t-statistic to estimate the distribution of the sample mean, : \bar{x} = \frac{1}{10} (x_1 + x_2 + \cdots + x_{10}). Such a normality assumption can be justified either as an approximation of the distribution of each individual coin flip or as an approximation of the distribution of the average of a large number of coin flips. The former is a poor approximation because the true distribution of the coin flips is Bernoulli instead of normal. The latter is a valid approximation in infinitely large samples due to the central limit theorem. However, if we are not ready to make such a justification, then we can use the bootstrap instead. Using case resampling, we can derive the distribution of \bar{x}. We first resample the data to obtain a bootstrap resample. An example of the first resample might look like this . There are some duplicates since a bootstrap resample comes from sampling with replacement from the data. Also the number of data points in a bootstrap resample is equal to the number of data points in our original observations. Then we compute the mean of this resample and obtain the first bootstrap mean: μ1*. We repeat this process to obtain the second resample X2* and compute the second bootstrap mean μ2*. If we repeat this 100 times, then we have μ1*, μ2*, ..., μ100*. This represents an empirical bootstrap distribution of sample mean. From this empirical distribution, one can derive a bootstrap confidence interval for the purpose of hypothesis testing. Regression In regression problems, case resampling refers to the simple scheme of resampling individual cases – often rows of a data set. For regression problems, as long as the data set is fairly large, this simple scheme is often acceptable. However, the method is open to criticism. In regression problems, the explanatory variables are often fixed, or at least observed with more control than the response variable. Also, the range of the explanatory variables defines the information available from them. Therefore, to resample cases means that each bootstrap sample will lose some information. As such, alternative bootstrap procedures should be considered. Bayesian bootstrap Bootstrapping can be interpreted in a Bayesian framework using a scheme that creates new data sets through reweighting the initial data. Given a set of N data points, the weighting assigned to data point i in a new data set \mathcal{D}^J is w^J_i = x^J_i - x^J_{i-1}, where \mathbf{x}^J is a low-to-high ordered list of N-1 uniformly distributed random numbers on [0,1], preceded by 0 and succeeded by 1. The distributions of a parameter inferred from considering many such data sets \mathcal{D}^J are then interpretable as posterior distributions on that parameter. Smooth bootstrap Under this scheme, a small amount of (usually normally distributed) zero-centered random noise is added onto each resampled observation. This is equivalent to sampling from a kernel density estimate of the data. Assume K to be a symmetric kernel density function with unit variance. The standard kernel estimator \hat{f\,}_h(x) of f(x) is : \hat{f\,}_h(x)={1\over nh}\sum_{i=1}^nK\left({x-X_i\over h}\right), where h is the smoothing parameter. And the corresponding distribution function estimator \hat{F\,}_h(x) is : \hat{F\,}_h(x)=\int_{-\infty}^x \hat f_h(t)\,dt. The use of a parametric model at the sampling stage of the bootstrap methodology leads to procedures which are different from those obtained by applying basic statistical theory to inference for the same model. Resampling residuals Another approach to bootstrapping in regression problems is to resample residuals. The method proceeds as follows. • Fit the model and retain the fitted values \widehat{y\,}_i and the residuals \widehat{\varepsilon\,}_i = y_i - \widehat{y\,}_i, (i = 1,\dots, n). • For each pair, (xi, yi), in which xi is the (possibly multivariate) explanatory variable, add a randomly resampled residual, \widehat{\varepsilon\,}_j, to the fitted value \widehat{y\,}_i. In other words, create synthetic response variables y^*_i = \widehat{y\,}_i + \widehat{\varepsilon\,}_j where j is selected randomly from the list (1, ..., n) for every i. • Refit the model using the fictitious response variables y^*_i, and retain the quantities of interest (often the parameters, \widehat\mu^*_i, estimated from the synthetic y^*_i). • Repeat steps 2 and 3 a large number of times. This scheme has the advantage that it retains the information in the explanatory variables. However, a question arises as to which residuals to resample. Raw residuals are one option; another is studentized residuals (in linear regression). Although there are arguments in favor of using studentized residuals; in practice, it often makes little difference, and it is easy to compare the results of both schemes. Gaussian process regression bootstrap When data are temporally correlated, straightforward bootstrapping destroys the inherent correlations. This method uses Gaussian process regression (GPR) to fit a probabilistic model from which replicates may then be drawn. GPR is a Bayesian non-linear regression method. A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian (normal) distribution. A GP is defined by a mean function and a covariance function, which specify the mean vectors and covariance matrices for each finite collection of the random variables. Regression model: : y(x)=f(x)+\varepsilon,\ \ \varepsilon\sim \mathcal{N}(0,\sigma^2), \varepsilon is a noise term. Gaussian process prior: For any finite collection of variables, x1, ..., xn, the function outputs f(x_1),\ldots,f(x_n) are jointly distributed according to a multivariate Gaussian with mean m=[m(x_1),\ldots,m(x_n)]^\intercal and covariance matrix (K)_{ij}=k(x_i,x_j). Assume f(x)\sim \mathcal{GP}(m,k). Then y(x)\sim \mathcal{GP}(m,l), where l(x_i,x_j)=k(x_i,x_j)+\sigma^2\delta(x_i,x_j), and \delta(x_i,x_j) is the standard Kronecker delta function. is suited when the model exhibits heteroskedasticity. The idea is, as the residual bootstrap, to leave the regressors at their sample value, but to resample the response variable based on the residuals values. That is, for each replicate, one computes a new y based on :y^*_i = \widehat{y\,}_i + \widehat{\varepsilon\,}_i v_i so the residuals are randomly multiplied by a random variable v_i with mean 0 and variance 1. For most distributions of v_i (but not Mammen's), this method assumes that the 'true' residual distribution is symmetric and can offer advantages over simple residual sampling for smaller sample sizes. Different forms are used for the random variable v_i, such as :*The standard normal distribution :*A distribution suggested by Mammen (1993). ::: v_i = \begin{cases} -(\sqrt{5} -1)/2 & \text{with probability } (\sqrt{5} +1)/(2\sqrt{5}), \\ (\sqrt{5} +1)/2 & \text{with probability } (\sqrt{5} -1)/(2\sqrt{5}) \end{cases} ::Approximately, Mammen's distribution is: ::: v_i = \begin{cases} -0.6180\quad\text{(with a 0 in the units' place)} & \text{with probability } 0.7236, \\ +1.6180\quad\text{(with a 1 in the units' place)} & \text{with probability } 0.2764. \end{cases} :*Or the simpler distribution, linked to the Rademacher distribution: ::: v_i =\begin{cases} -1 & \text{with probability } 1/2, \\ +1 & \text{with probability } 1/2. \end{cases} Block bootstrap The block bootstrap is used when the data, or the errors in a model, are correlated. In this case, a simple case or residual resampling will fail, as it is not able to replicate the correlation in the data. The block bootstrap tries to replicate the correlation by resampling inside blocks of data (see Blocking (statistics)). The block bootstrap has been used mainly with data correlated in time (i.e. time series) but can also be used with data correlated in space, or among groups (so-called cluster data). Time series: Simple block bootstrap In the (simple) block bootstrap, the variable of interest is split into non-overlapping blocks. Time series: Moving block bootstrap In the moving block bootstrap, introduced by Künsch (1989), data is split into n − b + 1 overlapping blocks of length b: Observation 1 to b will be block 1, observation 2 to b + 1 will be block 2, etc. Then from these n − b + 1 blocks, n/b blocks will be drawn at random with replacement. Then aligning these n/b blocks in the order they were picked, will give the bootstrap observations. This bootstrap works with dependent data, however, the bootstrapped observations will not be stationary anymore by construction. But, it was shown that varying randomly the block length can avoid this problem. This method is known as the stationary bootstrap. Other related modifications of the moving block bootstrap are the Markovian bootstrap and a stationary bootstrap method that matches subsequent blocks based on standard deviation matching. Time series: Maximum entropy bootstrap Vinod (2006), presents a method that bootstraps time series data using maximum entropy principles satisfying the Ergodic theorem with mean-preserving and mass-preserving constraints. There is an R package, meboot, that utilizes the method, which has applications in econometrics and computer science. Cluster data: block bootstrap Cluster data describes data where many observations per unit are observed. This could be observing many firms in many states or observing students in many classes. In such cases, the correlation structure is simplified, and one does usually make the assumption that data is correlated within a group/cluster, but independent between groups/clusters. The structure of the block bootstrap is easily obtained (where the block just corresponds to the group), and usually only the groups are resampled, while the observations within the groups are left unchanged. Cameron et al. (2008) discusses this for clustered errors in linear regression. == Methods for improving computational efficiency ==

Methods for improving computational efficiency

The bootstrap is a powerful technique although may require substantial computing resources in both time and memory. Some techniques have been developed to reduce this burden. They can generally be combined with many of the different types of Bootstrap schemes and various choices of statistics. Parallel processing Most bootstrap methods are embarrassingly parallel algorithms. That is, the statistic of interest for each bootstrap sample does not depend on other bootstrap samples. Such computations can therefore be performed on separate CPUs or compute nodes with the results from the separate nodes eventually aggregated for final analysis. Poisson bootstrap The nonparametric bootstrap samples items from a list of size n with counts drawn from a multinomial distribution. If W_i denotes the number times element i is included in a given bootstrap sample, then each W_i is distributed as a binomial distribution with n trials and mean 1, but W_i is not independent of W_j for i \neq j. The Poisson bootstrap instead draws samples assuming all W_i's are independently and identically distributed as Poisson variables with mean 1. The rationale is that the limit of the binomial distribution is Poisson: :\lim_{n\to \infty} \operatorname{Binomial}(n,1/n) = \operatorname{Poisson}(1) The Poisson bootstrap had been proposed by Hanley and MacGibbon as potentially useful for non-statisticians using software like SAS and SPSS, which lacked the bootstrap packages of R and S-Plus programming languages. The same authors report that for large enough n, the results are relatively similar to the nonparametric bootstrap estimates but go on to note the Poisson bootstrap has seen minimal use in applications. Another proposed advantage of the Poisson bootstrap is the independence of the W_i makes the method easier to apply for large datasets that must be processed as streams. Empirical investigation has shown this method can yield good results. Bag of Little Bootstraps For massive data sets, it is often computationally prohibitive to hold all the sample data in memory and resample from the sample data. The Bag of Little Bootstraps (BLB) provides a method of pre-aggregating data before bootstrapping to reduce computational constraints. This works by partitioning the data set into b equal-sized buckets and aggregating the data within each bucket. This pre-aggregated data set becomes the new sample data over which to draw samples with replacement. This method is similar to the Block Bootstrap, but the motivations and definitions of the blocks are very different. Under certain assumptions, the sample distribution should approximate the full bootstrapped scenario. One constraint is the number of buckets b=n^\gamma where \gamma \in [0.5, 1] and the authors recommend usage of b=n^{0.7} as a general solution. == Choice of statistic ==

Choice of statistic

The bootstrap distribution of a point estimator of a population parameter has been used to produce a bootstrapped confidence interval for the parameter's true value if the parameter can be written as a function of the population's distribution. Population parameters are estimated with many point estimators. Popular families of point-estimators include mean-unbiased minimum-variance estimators, median-unbiased estimators, Bayesian estimators (for example, the posterior distribution's mode, median, mean), and maximum-likelihood estimators. A Bayesian point estimator and a maximum-likelihood estimator have good performance when the sample size is infinite, according to asymptotic theory. For practical problems with finite samples, other estimators may be preferable. Asymptotic theory suggests techniques that often improve the performance of bootstrapped estimators; the bootstrapping of a maximum-likelihood estimator may often be improved using transformations related to pivotal quantities. == Deriving confidence intervals from the bootstrap distribution ==

Deriving confidence intervals from the bootstrap distribution

The bootstrap distribution of a parameter-estimator is often used to calculate confidence intervals for its population-parameter. The basic bootstrap is a simple scheme to construct the confidence interval: one simply takes the empirical quantiles from the bootstrap distribution of the parameter (see Davison and Hinkley 1997, equ. 5.6 p. 194): :: (2\widehat{\theta\,} -\theta^{*}_{(1-\alpha/2)},2\widehat{\theta\,} -\theta^{*}_{(\alpha/2)}) where \theta^{*}_{(1-\alpha/2)} denotes the 1-\alpha/2 percentile of the bootstrapped coefficients \theta^{*} . • Percentile bootstrap. The percentile bootstrap proceeds in a similar way to the basic bootstrap, using percentiles of the bootstrap distribution, but with a different formula (note the inversion of the left and right quantiles): :: (\theta^{*}_{(\alpha/2)},\theta^{*}_{(1-\alpha/2)}) where \theta^{*}_{(1-\alpha/2)} denotes the 1-\alpha/2 percentile of the bootstrapped coefficients \theta^{*} . :See Davison and Hinkley (1997, equ. 5.18 p. 203) and Efron and Tibshirani (1993, equ 13.5 p. 171). :This method can be applied to any statistic. It will work well in cases where the bootstrap distribution is symmetrical and centered on the observed statistic and where the sample statistic is median-unbiased and has maximum concentration (or minimum risk with respect to an absolute value loss function). When working with small sample sizes (i.e., less than 50), the basic / reversed percentile and percentile confidence intervals for (for example) the variance statistic will be too narrow. So that with a sample of 20 points, 90% confidence interval will include the true variance only 78% of the time. The basic / reverse percentile confidence intervals are easier to justify mathematically adjusts for both bias and skewness in the bootstrap distribution. This approach is accurate in a wide variety of settings, has reasonable computation requirements, and produces reasonably narrow intervals. == Bootstrap hypothesis testing ==

Bootstrap hypothesis testing

Efron and Tibshirani suggest the following algorithm for comparing the means of two independent samples: Let x_1, \ldots, x_n be a random sample from distribution F with sample mean \bar{x} and sample variance \sigma_x^2. Let y_1, \ldots, y_m be another, independent random sample from distribution G with mean \bar{y} and variance \sigma_y^2 • Calculate the test statistic t = \frac{\bar{x}-\bar{y}}{\sqrt{\sigma_x^2/n + \sigma_y^2/m}} • Create two new data sets whose values are x_i' = x_i - \bar{x} + \bar{z} and y_i' = y_i - \bar{y} + \bar{z}, where \bar{z} is the mean of the combined sample. • Draw a random sample ( x_i^* ) of size n with replacement from x_i' and another random sample ( y_i^* ) of size m with replacement from y_i' . • Calculate the test statistic t^* = \frac{\bar{x^*}-\bar{y^*}}{\sqrt{\sigma_x^{*2}/n + \sigma_y^{*2}/m}} • Repeat 3 and 4 B times (e.g. B=1000) to collect B values of the test statistic. • Estimate the p-value as p = \frac{\sum_{i=1}^B I\{t_i^* \geq t\}}{B} where I\{\text{condition}\} = 1 when condition is true and 0 otherwise. == Example applications ==

Example applications

Smoothed bootstrap In 1878, Simon Newcomb took observations on the speed of light. The data set contains two outliers, which greatly influence the sample mean. (The sample mean need not be a consistent estimator for any population mean, because no mean needs to exist for a heavy-tailed distribution.) A well-defined and robust statistic for the central tendency is the sample median, which is consistent and median-unbiased for the population median. The bootstrap distribution for Newcomb's data appears below. We can reduce the discreteness of the bootstrap distribution by adding a small amount of random noise to each bootstrap sample. A conventional choice is to add noise with a standard deviation of \sigma/\sqrt n for a sample size n; this noise is often drawn from a Student-t distribution with n-1 degrees of freedom. This results in an approximately-unbiased estimator for the variance of the sample mean. This means that samples taken from the bootstrap distribution will have a variance which is, on average, equal to the variance of the total population. Histograms of the bootstrap distribution and the smooth bootstrap distribution appear below. The bootstrap distribution of the sample-median has only a small number of values. The smoothed bootstrap distribution has a richer support. However, note that whether the smoothed or standard bootstrap procedure is favorable is case-by-case and is shown to depend on both the underlying distribution function and on the quantity being estimated. In this example, the bootstrapped 95% (percentile) confidence-interval for the population median is (26, 28.5), which is close to the interval for (25.98, 28.46) for the smoothed bootstrap. == Relation to other approaches to inference ==

Relation to other approaches to inference

Relationship to other resampling methods The bootstrap is distinguished from: • the jackknife procedure, used to estimate biases of sample statistics and to estimate variances, and • cross-validation, in which the parameters (e.g., regression weights, factor loadings) that are estimated in one subsample are applied to another subsample. Bootstrap aggregating (bagging) is a meta-algorithm based on averaging model predictions obtained from models trained on multiple bootstrap samples. U-statistics In situations where an obvious statistic can be devised to measure a required characteristic using only a small number, r, of data items, a corresponding statistic based on the entire sample can be formulated. Given an r-sample statistic, one can create an n-sample statistic by something similar to bootstrapping (taking the average of the statistic over all subsamples of size r). This procedure is known to have certain good properties and the result is a U-statistic. The sample mean and sample variance are of this form, for r = 1 and r = 2. == Asymptotic theory ==

Asymptotic theory

The bootstrap has under certain conditions desirable asymptotic properties. The asymptotic properties most often described are weak convergence / consistency of the sample paths of the bootstrap empirical process and the validity of confidence intervals derived from the bootstrap. This section describes the convergence of the empirical bootstrap. Stochastic convergence This paragraph summarizes more complete descriptions of stochastic convergence in van der Vaart and Wellner and Kosorok. The bootstrap defines a stochastic process, a collection of random variables indexed by some set T, where T is typically the real line (\mathbb{R}) or a family of functions. Processes of interest are those with bounded sample paths, i.e., sample paths in L-infinity (\ell^\infty(T)), the set of all uniformly bounded functions from T to \mathbb{R}. When equipped with the uniform distance, \ell^\infty(T) is a metric space, and when T = \mathbb{R}, two subspaces of \ell^\infty(T) are of particular interest, C[0,1], the space of all continuous functions from T to the unit interval [0,1], and D[0,1], the space of all cadlag functions from T to [0,1]. This is because C[0,1] contains the distribution functions for all continuous random variables, and D[0,1] contains the distribution functions for all random variables. Statements about the consistency of the bootstrap are statements about the convergence of the sample paths of the bootstrap process as random elements of the metric space \ell^\infty(T) or some subspace thereof, especially C[0,1] or D[0,1]. Consistency Horowitz in a recent review Horowitz goes on to recommend using a theorem from Mammen that provides easier to check necessary and sufficient conditions for consistency for statistics of a certain common form. In particular, let \{X_i : i=1, \ldots, n\} be the random sample. If T_n = \frac{\sum_{i=1}^n g_n(X_i) - t_n}{\sigma_n} for a sequence of numbers t_n and \sigma_n, then the bootstrap estimate of the cumulative distribution function estimates the empirical cumulative distribution function if and only if T_n converges in distribution to the standard normal distribution. Strong consistency Convergence in (outer) probability as described above is also called weak consistency. It can also be shown with slightly stronger assumptions, that the bootstrap is strongly consistent, where convergence in (outer) probability is replaced by convergence (outer) almost surely. When only one type of consistency is described, it is typically weak consistency. This is adequate for most statistical applications since it implies confidence bands derived from the bootstrap are asymptotically valid. The Glivenko–Cantelli theorem provides theoretical background for the bootstrap method. == Finite populations ==

Finite populations

Finite populations and drawing without replacement require adaptations of the bootstrap due to the violation of the i.i.d assumption. One example is "population bootstrap". == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com