The basic tools used to describe and measure robustness are the
breakdown point, the
influence function and the
sensitivity curve.
Breakdown point Intuitively, the breakdown point of an
estimator is the proportion of incorrect observations (e.g. arbitrarily large observations) an estimator can handle before giving an incorrect (e.g., arbitrarily large) result. Usually, the asymptotic (infinite sample) limit is quoted as the breakdown point, although the finite-sample breakdown point may be more useful. For example, given n independent random variables (X_1,\dots,X_n) and the corresponding realizations x_1,\dots,x_n, we can use \overline{X_n}:=\frac{X_1+\cdots+X_n}{n} to estimate the mean. Such an estimator has a breakdown point of 0 (or finite-sample breakdown point of 1/n) because we can make \overline{x} arbitrarily large just by changing any of x_1,\dots,x_n. The higher the breakdown point of an estimator, the more robust it is. Intuitively, we can understand that a breakdown point cannot exceed 50% because if more than half of the observations are contaminated, it is not possible to distinguish between the underlying distribution and the contaminating distribution . Therefore, the maximum breakdown point is 0.5 and there are estimators which achieve such a breakdown point. For example, the median has a breakdown point of 0.5. The X% trimmed mean has a breakdown point of X%, for the chosen level of X. and contain more details. The level and the power breakdown points of tests are investigated in . Statistics with high breakdown points are sometimes called
resistant statistics. Example: speed-of-light data In the speed-of-light example, removing the two lowest observations causes the mean to change from 26.2 to 27.75, a change of 1.55. The estimate of scale produced by the Qn method is 6.3. We can divide this by the square root of the sample size to get a robust standard error, and we find this quantity to be 0.78. Thus, the change in the mean resulting from removing two
outliers is approximately twice the robust standard error. The 10% trimmed mean for the speed-of-light data is 27.43. Removing the two lowest observations and recomputing gives 27.67. The trimmed mean is less affected by the outliers and has a higher breakdown point. If we replace the lowest observation, −44, by −1000, the mean becomes 11.73, whereas the 10% trimmed mean is still 27.43. In many areas of applied statistics, it is common for data to be log-transformed to make them near symmetrical. Very small values become large negative when log-transformed, and zeroes become negatively infinite. Therefore, this example is of practical interest.
Empirical influence function The empirical influence function is a measure of the dependence of the estimator on the value of any one of the points in the sample. It is a model-free measure in the sense that it simply relies on calculating the estimator again with a different sample. On the right is Tukey's biweight function, which, as we will later see, is an example of what a "good" (in a sense defined later on) empirical influence function should look like. In mathematical terms, an influence function is defined as a vector in the space of the estimator, which is in turn defined for a sample which is a subset of the population: • (\Omega,\mathcal{A},P) is a probability space, • (\mathcal{X},\Sigma) is a measurable space (state space), • \Theta is a
parameter space of dimension p\in\mathbb{N}^*, • (\Gamma,S) is a measurable space, For example, • (\Omega,\mathcal{A},P) is any probability space, • (\mathcal{X},\Sigma) = (\mathbb{R},\mathcal{B}), • \Theta = \mathbb{R}\times\mathbb{R}^+ • (\Gamma,S) = (\mathbb{R},\mathcal{B}), The empirical influence function is defined as follows. Let n\in\mathbb{N}^* and X_1,\dots,X_n:(\Omega, \mathcal{A}) \to (\mathcal{X},\Sigma) are
i.i.d. and (x_1,\dots,x_n) is a sample from these variables. T_n: (\mathcal{X}^n,\Sigma^n) \to (\Gamma,S) is an estimator. Let i\in\{1,\dots,n\}. The empirical influence function EIF_i at observation i is defined by: EIF_i: x\in\mathcal{X} \mapsto n\cdot(T_n(x_1,\dots,x_{i-1},x,x_{i+1},\dots,x_n)-T_n(x_1,\dots,x_{i-1},x_i,x_{i+1},\dots,x_n)) What this means is that we are replacing the
i-th value in the sample by an arbitrary value and looking at the output of the estimator. Alternatively, the EIF is defined as the effect, scaled by n+1 instead of n, on the estimator of adding the point x to the sample.
Influence function and sensitivity curve Instead of relying solely on the data, we could use the distribution of the random variables. The approach is quite different from that of the previous paragraph. What we are now trying to do is to see what happens to an estimator when we change the distribution of the data slightly: it assumes a
distribution, and measures sensitivity to change in this distribution. By contrast, the empirical influence assumes a
sample set, and measures sensitivity to change in the samples. Let A be a convex subset of the set of all finite signed measures on \Sigma. We want to estimate the parameter \theta\in\Theta of a distribution F in A. Let the functional T:A\to\Gamma be the asymptotic value of some estimator sequence (T_n)_{n\in\mathbb{N}}. We will suppose that this functional is
Fisher consistent, i.e. \forall \theta\in\Theta, T(F_\theta)=\theta. This means that at the model F, the estimator sequence asymptotically measures the correct quantity. Let G be some distribution in A. What happens when the data doesn't follow the model F exactly but another, slightly different, "going towards" G? We're looking at: dT_{G-F}(F) = \lim_{t\to 0^+}\frac{T(tG + (1 - t)F) - T(F)}{t}, which is the
one-sided Gateaux derivative of T at F, in the direction of G-F. Let x\in\mathcal{X}. \Delta_x is the probability measure which gives mass 1 to \{x\}. We choose G=\Delta_x. The influence function is then defined by: IF(x; T; F) := \lim_{t\to 0^+}\frac{T(t\Delta_x+(1-t)F) - T(F)}{t}. It describes the effect of an infinitesimal contamination at the point x on the estimate we are seeking, standardized by the mass t of the contamination (the asymptotic bias caused by contamination in the observations). For a robust estimator, we want a bounded influence function, that is, one which does not go to infinity as x becomes arbitrarily large. The empirical influence function uses the
empirical distribution function \hat{F} instead of the distribution function F, making use of the
drop-in principle.
Desirable properties Properties of an influence function that bestow it with desirable performance are: • Finite rejection point \rho^*, • Small gross-error sensitivity \gamma^*, • Small local-shift sensitivity \lambda^*.
Rejection point \rho^* := \inf_{r>0}\{r:IF(x;T;F)=0, |x|>r\}
Gross-error sensitivity \gamma^*(T;F) := \sup_{x\in\mathcal{X}}|IF(x; T ; F)|
Local-shift sensitivity \lambda^*(T;F) := \sup_{(x,y)\in\mathcal{X}^2\atop x\neq y}\left\|\frac{IF(y ; T; F) - IF(x; T ; F)}{y-x}\right\| This value, which looks a lot like a
Lipschitz constant, represents the effect of shifting an observation slightly from x to a neighbouring point y, i.e., add an observation at y and remove one at x. == M-estimators ==