In univariate problems, it is usually acceptable to resample the individual observations with replacement ("case resampling" below) unlike
subsampling, in which resampling is without replacement and is valid under much weaker conditions compared to the bootstrap. In small samples, a parametric bootstrap approach might be preferred. For other problems, a
smooth bootstrap will likely be preferred. For regression problems, various other alternatives are available.
Estimating the distribution of sample mean Consider a coin-flipping experiment. We flip the coin and record whether it lands heads or tails. Let be 10 observations from the experiment. if the flip lands heads, and 0 otherwise. By invoking the assumption that the average of the coin flips is normally distributed, we can use the
t-statistic to estimate the distribution of the sample mean, : \bar{x} = \frac{1}{10} (x_1 + x_2 + \cdots + x_{10}). Such a normality assumption can be justified either as an approximation of the distribution of each
individual coin flip or as an approximation of the distribution of the
average of a large number of coin flips. The former is a poor approximation because the true distribution of the coin flips is
Bernoulli instead of normal. The latter is a valid approximation in
infinitely large samples due to the
central limit theorem. However, if we are not ready to make such a justification, then we can use the bootstrap instead. Using case resampling, we can derive the distribution of \bar{x}. We first resample the data to obtain a
bootstrap resample. An example of the first resample might look like this . There are some duplicates since a bootstrap resample comes from sampling with replacement from the data. Also the number of data points in a bootstrap resample is equal to the number of data points in our original observations. Then we compute the mean of this resample and obtain the first
bootstrap mean:
μ1*. We repeat this process to obtain the second resample
X2* and compute the second bootstrap mean
μ2*. If we repeat this 100 times, then we have
μ1*,
μ2*, ...,
μ100*. This represents an
empirical bootstrap distribution of sample mean. From this empirical distribution, one can derive a
bootstrap confidence interval for the purpose of hypothesis testing.
Regression In regression problems,
case resampling refers to the simple scheme of resampling individual cases – often rows of a
data set. For regression problems, as long as the data set is fairly large, this simple scheme is often acceptable. However, the method is open to criticism. In regression problems, the
explanatory variables are often fixed, or at least observed with more control than the response variable. Also, the range of the explanatory variables defines the information available from them. Therefore, to resample cases means that each bootstrap sample will lose some information. As such, alternative bootstrap procedures should be considered.
Bayesian bootstrap Bootstrapping can be interpreted in a
Bayesian framework using a scheme that creates new data sets through reweighting the initial data. Given a set of N data points, the weighting assigned to data point i in a new data set \mathcal{D}^J is w^J_i = x^J_i - x^J_{i-1}, where \mathbf{x}^J is a low-to-high ordered list of N-1 uniformly distributed random numbers on [0,1], preceded by 0 and succeeded by 1. The distributions of a parameter inferred from considering many such data sets \mathcal{D}^J are then interpretable as
posterior distributions on that parameter.
Smooth bootstrap Under this scheme, a small amount of (usually normally distributed) zero-centered random noise is added onto each resampled observation. This is equivalent to sampling from a
kernel density estimate of the data. Assume
K to be a symmetric kernel density function with unit variance. The standard kernel estimator \hat{f\,}_h(x) of f(x) is : \hat{f\,}_h(x)={1\over nh}\sum_{i=1}^nK\left({x-X_i\over h}\right), where h is the smoothing parameter. And the corresponding distribution function estimator \hat{F\,}_h(x) is : \hat{F\,}_h(x)=\int_{-\infty}^x \hat f_h(t)\,dt. The use of a parametric model at the sampling stage of the bootstrap methodology leads to procedures which are different from those obtained by applying basic statistical theory to inference for the same model.
Resampling residuals Another approach to bootstrapping in regression problems is to resample
residuals. The method proceeds as follows. • Fit the model and retain the fitted values \widehat{y\,}_i and the residuals \widehat{\varepsilon\,}_i = y_i - \widehat{y\,}_i, (i = 1,\dots, n). • For each pair, (
xi,
yi), in which
xi is the (possibly multivariate) explanatory variable, add a randomly resampled residual, \widehat{\varepsilon\,}_j, to the fitted value \widehat{y\,}_i. In other words, create synthetic response variables y^*_i = \widehat{y\,}_i + \widehat{\varepsilon\,}_j where
j is selected randomly from the list (1, ...,
n) for every
i. • Refit the model using the fictitious response variables y^*_i, and retain the quantities of interest (often the parameters, \widehat\mu^*_i, estimated from the synthetic y^*_i). • Repeat steps 2 and 3 a large number of times. This scheme has the advantage that it retains the information in the explanatory variables. However, a question arises as to which residuals to resample. Raw residuals are one option; another is
studentized residuals (in linear regression). Although there are arguments in favor of using studentized residuals; in practice, it often makes little difference, and it is easy to compare the results of both schemes.
Gaussian process regression bootstrap When data are temporally correlated, straightforward bootstrapping destroys the inherent correlations. This method uses Gaussian process regression (GPR) to fit a probabilistic model from which replicates may then be drawn. GPR is a Bayesian non-linear regression method. A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian (normal) distribution. A GP is defined by a mean function and a covariance function, which specify the mean vectors and covariance matrices for each finite collection of the random variables.
Regression model: : y(x)=f(x)+\varepsilon,\ \ \varepsilon\sim \mathcal{N}(0,\sigma^2), \varepsilon is a noise term.
Gaussian process prior: For any finite collection of variables,
x1, ...,
xn, the function outputs f(x_1),\ldots,f(x_n) are jointly distributed according to a multivariate Gaussian with mean m=[m(x_1),\ldots,m(x_n)]^\intercal and covariance matrix (K)_{ij}=k(x_i,x_j). Assume f(x)\sim \mathcal{GP}(m,k). Then y(x)\sim \mathcal{GP}(m,l), where l(x_i,x_j)=k(x_i,x_j)+\sigma^2\delta(x_i,x_j), and \delta(x_i,x_j) is the standard Kronecker delta function. is suited when the model exhibits
heteroskedasticity. The idea is, as the residual bootstrap, to leave the regressors at their sample value, but to resample the response variable based on the residuals values. That is, for each replicate, one computes a new y based on :y^*_i = \widehat{y\,}_i + \widehat{\varepsilon\,}_i v_i so the residuals are randomly multiplied by a random variable v_i with mean 0 and variance 1. For most distributions of v_i (but not Mammen's), this method assumes that the 'true' residual distribution is symmetric and can offer advantages over simple residual sampling for smaller sample sizes. Different forms are used for the random variable v_i, such as :*The
standard normal distribution :*A distribution suggested by Mammen (1993). ::: v_i = \begin{cases} -(\sqrt{5} -1)/2 & \text{with probability } (\sqrt{5} +1)/(2\sqrt{5}), \\ (\sqrt{5} +1)/2 & \text{with probability } (\sqrt{5} -1)/(2\sqrt{5}) \end{cases} ::Approximately, Mammen's distribution is: ::: v_i = \begin{cases} -0.6180\quad\text{(with a 0 in the units' place)} & \text{with probability } 0.7236, \\ +1.6180\quad\text{(with a 1 in the units' place)} & \text{with probability } 0.2764. \end{cases} :*Or the simpler distribution, linked to the
Rademacher distribution: ::: v_i =\begin{cases} -1 & \text{with probability } 1/2, \\ +1 & \text{with probability } 1/2. \end{cases}
Block bootstrap The block bootstrap is used when the data, or the errors in a model, are correlated. In this case, a simple case or residual resampling will fail, as it is not able to replicate the correlation in the data. The block bootstrap tries to replicate the correlation by resampling inside blocks of data (see
Blocking (statistics)). The block bootstrap has been used mainly with data correlated in time (i.e. time series) but can also be used with data correlated in space, or among groups (so-called cluster data).
Time series: Simple block bootstrap In the (simple) block bootstrap, the variable of interest is split into non-overlapping blocks.
Time series: Moving block bootstrap In the moving block bootstrap, introduced by Künsch (1989), data is split into
n −
b + 1 overlapping blocks of length
b: Observation 1 to b will be block 1, observation 2 to
b + 1 will be block 2, etc. Then from these
n −
b + 1 blocks,
n/
b blocks will be drawn at random with replacement. Then aligning these n/b blocks in the order they were picked, will give the bootstrap observations. This bootstrap works with dependent data, however, the bootstrapped observations will not be stationary anymore by construction. But, it was shown that varying randomly the block length can avoid this problem. This method is known as the
stationary bootstrap. Other related modifications of the moving block bootstrap are the
Markovian bootstrap and a stationary bootstrap method that matches subsequent blocks based on standard deviation matching.
Time series: Maximum entropy bootstrap Vinod (2006), presents a method that bootstraps time series data using maximum entropy principles satisfying the Ergodic theorem with mean-preserving and mass-preserving constraints. There is an R package,
meboot, that utilizes the method, which has applications in econometrics and computer science.
Cluster data: block bootstrap Cluster data describes data where many observations per unit are observed. This could be observing many firms in many states or observing students in many classes. In such cases, the correlation structure is simplified, and one does usually make the assumption that data is correlated within a group/cluster, but independent between groups/clusters. The structure of the block bootstrap is easily obtained (where the block just corresponds to the group), and usually only the groups are resampled, while the observations within the groups are left unchanged.
Cameron et al. (2008) discusses this for clustered errors in linear regression. == Methods for improving computational efficiency ==