E-values

In statistical hypothesis testing, e-values quantify the evidence in the data against a null hypothesis. They serve as a more robust alternative to p-values, addressing some shortcomings of the latter.

Definition and mathematical background

Let the null hypothesis H_0 be given as a set of distributions for data Y. Usually Y= (X_1, \ldots, X_\tau ) with each X_i a single outcome and \tau a fixed sample size or some stopping time. We shall refer to such Y, which represent the full sequence of outcomes of a statistical experiment, as a sample or batch of outcomes. But in some cases Y may also be an unordered bag of outcomes or a single outcome. An e-variable or e-statistic is a nonnegative random variable E= E(Y) such that under all P \in H_0, its expected value is bounded by 1: {\mathbb E}_P[E] \leq 1 . The value taken by e-variable E is called the e-value. In practice, the term e-value (a number) is often used when one is really referring to the underlying e-variable (a random variable, that is, a measurable function of the data). == Interpretations ==

Interpretations

As the continuous interpretation of a test A test for a null hypothesis H_0 is traditionally modeled as a function \phi from the data to \{\text{not reject } H_0, \text{ reject } H_0\}. A test \phi_\alpha is said to be valid for level \alpha if P(\phi_\alpha = \text{reject } H_0) \leq \alpha, \text{ for every } P \in H_0. This is classically conveniently summarized as a function \phi_\alpha from the data to \{0, 1\} that satisfies \mathbb{E}^P[\phi_\alpha] \leq \alpha, \text{ for every } P \in H_0. Moreover, this is sometimes generalized to permit external randomization by letting the test \phi_\alpha take value in [0, 1]. Here, its value is interpreted as a probability with which one should subsequently reject the hypothesis. An issue with modelling a test in this manner, is that the traditional decision space \{\text{not reject } H_0, \text{ reject } H_0\} or \{0, 1\} does not encode the level \alpha at which the test \phi_\alpha rejects. This is odd at best, because a rejection at level 1% is a much stronger claim than a rejection at level 10%. A more suitable decision space seems to be \{\text{not reject } H_0, \text{ reject } H_0 \text{ at level } \alpha\}. The e-value can be interpreted as resolving this problem. Indeed, we can rescale from \{0, 1\} to \{0, 1/\alpha\} and [0, 1] to [0, 1/\alpha] by rescaling the test by its level: \varepsilon_\alpha = \phi_\alpha / \alpha , where we denote a test on this evidence scale by \varepsilon_\alpha to avoid confusion. Such a test is then valid if \mathbb{E}^P[\varepsilon_\alpha] \leq 1, \text{ for every } P \in H_0. That is: it is valid if it is an e-value. In fact, this reveals that e-values bounded to [0, 1/\alpha] are rescaled randomized tests, that are continuously interpreted as evidence against the hypothesis. The standard e-value that takes value in [0, \infty] appears as a generalization of a level 0 test. As bets Suppose you can buy a ticket for 1 monetary unit, with nonnegative pay-off E=E(Y). The statements "E is an e-variable" and "if the null hypothesis is true, you do not expect to gain any money if you engage in this bet" are logically equivalent. This is because E being an e-variable means that the expected gain of buying the ticket is the pay-off minus the cost, i.e. E-1, which has expectation \leq 0. Based on this interpretation, the product e-value for a sequence of tests can be interpreted as the amount of money you have gained by sequentially betting with pay-offs given by the individual e-variables and always re-investing all your gains. The betting interpretation becomes particularly visible if we rewrite an e-variable as E := 1 + \lambda U where U has expectation \leq 0 under all P \in H_0 and \lambda\in {\mathbb R} is chosen so that E \geq 0 a.s. Any e-variable can be written in the 1 + \lambda U form although with parametric nulls, writing it as a likelihood ratio is usually mathematically more convenient. The 1 + \lambda U form on the other hand is often more convenient in nonparametric settings. As a prototypical example, consider the case that Y= (X_1, \ldots, X_n) with the X_i taking values in the bounded interval [0,1]. According to H_0, the X_i are i.i.d. according to a distribution P with mean \mu; no other assumptions about P are made. Then we may first construct a family of e-variables for single outcomes, E_{i,\lambda} := 1+ \lambda (X_i - \mu) , for any \lambda \in [-1/(1-\mu),1/\mu] (these are the \lambda for which E_{i,\lambda} is guaranteed to be nonnegative). We may then define a new e-variable for the complete data vector Y by taking the product E:= \prod_{i=1}^n E_{i,\breve{\lambda}|X^{i-1}} , where \breve{\lambda}|X^{i-1} is an estimate for {\lambda} , based only on past data X^{i-1}= (X_1, \ldots,X_{i-1}) , and designed to make E_{i,\lambda} as large as possible in the "e-power" or "GRO" sense (see below). Waudby-Smith and Ramdas use this approach to construct "nonparametric" confidence intervals for the mean that tend to be significantly narrower than those based on more classical methods such as Chernoff, Hoeffding and Bernstein bounds. == A fundamental property: optional continuation ==

A fundamental property: optional continuation

E-values are more suitable than p-value when one expects follow-up tests involving the same null hypothesis with different data or experimental set-ups. This includes, for example, combining individual results in a meta-analysis. The advantage of e-values in this setting is that they allow for optional continuation. Indeed, they have been employed in what may be the world's first fully 'online' meta-analysis with explicit Type-I error control. Informally, optional continuation implies that the product of any number of e-values, E_{(1)}, E_{(2)},\ldots , defined on independent samples Y_{(1)}, Y_{(2)},\ldots , is itself an e-value, even if the definition of each e-value is allowed to depend on all previous outcomes, and no matter what rule is used to decide when to stop gathering new samples (e.g. to perform new trials). It follows that, for any significance level 0 , if the null is true, then the probability that a product of e-values will ever become larger than 1/\alpha is bounded by \alpha . Thus if we decide to combine the samples observed so far and reject the null if the product e-value is larger than 1/\alpha , then our Type-I error probability remains bounded by \alpha . We say that testing based on e-values remains safe (Type-I valid) under optional continuation. Mathematically, this is shown by first showing that the product e-variables form a nonnegative discrete-time martingale in the filtration generated by Y_{(1)}, Y_{(2)},\ldots (the individual e-variables are then increments of this martingale). The results then follow as a consequence of Doob's optional stopping theorem and Ville's inequality. We already implicitly used product e-variables in the example above, where we defined e-variables on individual outcomes X_i and designed a new e-value by taking products. Thus, in the example, the individual outcomes X_i play the role of 'batches' (full samples) Y_{(j)} above, and we can therefore even engage in optional stopping "within" the original batch Y : we may stop the data analysis at any individual outcome (not just "batch of outcomes") we like, for whatever reason, and reject if the product so far exceeds 1/\alpha . Not all e-variables defined for batches of outcomes Y can be decomposed as a product of per-outcome e-values in this way though. If this is not possible, we cannot use them for optional stopping (within a sample Y ) but only for optional continuation (from one sample Y_{(j)} to the next Y_{(j+1)} and so on). == Construction and optimality ==

Construction and optimality

If we set E:=1 independently of the data we get a trivial e-value: it is an e-variable by definition, but it will never allow us to reject the null hypothesis. This example shows that some e-variables may be better than others, in a sense to be defined below. Intuitively, a good e-variable is one that tends to be large (much larger than 1) if the alternative is true. This is analogous to the situation with p-values: both e-values and p-values can be defined without referring to an alternative, but if an alternative is available, we would like them to be small (p-values) or large (e-values) with high probability. In standard hypothesis tests, the quality of a valid test is formalized by the notion of statistical power but this notion has to be suitably modified in the context of e-values. The standard notion of quality of an e-variable relative to a given alternative H_1 , used by most authors in the field, is a generalization of the Kelly criterion in economics and (since it does exhibit close relations to classical power) is sometimes called e-power; the optimal e-variable in this sense is known as log-optimal or growth-rate optimal (often abbreviated to GRO show that, under no regularity conditions at all, E= \frac{q(Y)}{\sup_{P \in H_0} p(Y)} \left( = \frac{q(Y)}{{p}_{\hat{\theta} \mid Y } (Y) } \right) is an e-variable (with the second equality holding if the MLE (maximum likelihood estimator) \hat\theta \mid Y based on data Y is always well-defined). This way of constructing e-variables has been called the universal inference (UI) method, "universal" referring to the fact that no regularity conditions are required. Composite alternative, simple null Now let H_0 = \{ P \} be simple and H_1 = \{Q_{\theta}: \theta \in \Theta_1 \} be composite, such that all elements of H_0 \cup H_1 have densities relative to the same underlying measure. There are now two generic, closely related ways of obtaining e-variables that are close to growth-optimal (appropriately redefined but, in essence, re-discovered by Philip Dawid as "prequential plug-in" and Jorma Rissanen as "predictive MDL". The method of mixtures essentially amounts to "being Bayesian about the numerator" (the reason it is not called "Bayesian method" is that, when both null and alternative are composite, the numerator may often not be a Bayes marginal): we posit any prior distribution W on \Theta_1 and set \bar{q}_W(Y) := \int_{\Theta_1} q_{\theta} (Y) dW(\theta) and use the e-variable \bar{q}_W(Y)/p(Y) . To explicate the plug-in method, suppose that Y= (X_1, \ldots, X_n) where X_1, X_2, \ldots constitute a stochastic process and let \breve\theta \mid X^{i} be an estimator of \theta \in \Theta_1 based on data X^i=(X_1, \ldots, X_i) for i \geq 0 . In practice one usually takes a "smoothed" maximum likelihood estimator (such as, for example, the regression coefficients in ridge regression), initially set to some "default value" \breve\theta \mid X^{0}:= \theta_0 . One now recursively constructs a density \bar{q}_{\breve\theta} for X^n by setting \bar{q}_{\breve\theta}(X^n) = \prod_{i=1}^n q_{\breve\theta \mid X^{i-1}}(X_i \mid X^{i-1}) . Effectively, both the method of mixtures and the plug-in method can be thought of learning a specific instantiation of the alternative that explains the data well. However, in many other statistical testing problems, it is currently (2023) unknown whether fast implementations of the reverse information projection exist, and they may very well not exist (e.g. generalized linear models without the model-X assumption). In nonparametric settings (such as testing a mean as in the example above, or nonparametric 2-sample testing), it is often more natural to consider e-variables of the 1+ \lambda U type. However, while these superficially look very different from likelihood ratios, they can often still be interpreted as such and sometimes can even be re-interpreted as implementing a version of the RIPr-construction. Such functions are called p-to-e calibrators. Formally, a calibrator is a nonnegative decreasing function f : [0, 1] \rightarrow [0, \infty] which, when applied to a p-variable (a random variable whose value is a p-value), yields an e-variable. A calibrator f is said to dominate another calibrator g if f \geq g, and this domination is strict if the inequality is strict. An admissible calibrator is one that is not strictly dominated by any other calibrator. One can show that for a function to be a calibrator, it must have an integral of at most 1 over the uniform probability measure. One family of admissible calibrators is given by the set of functions \{f_{\kappa} : 0 with f_\kappa(p) := \kappa p^{\kappa -1}. Another calibrator is given by integrating out \kappa: : \int_0^1 \kappa p^{\kappa -1} d\kappa = \frac{1-p+p \log p}{p(-\log p)^2} Conversely, an e-to-p calibrator transforms e-values back into p-variables. Interestingly, the following calibrator dominates all other e-to-p calibrators: : f(t) := \min(1, 1/t). While of theoretical importance, calibration is not much used in the practical design of e-variables since the resulting e-variables are often far from growth-optimal for any given H_1. == E-processes ==

E-processes

Definition Now consider data X_1, X_2, \ldots arriving sequentially, constituting a discrete-time stochastic process. Let E_1, E_2, \ldots be another discrete-time process where for each n, E_n can be written as a (measurable) function of the first (X_1, \ldots, X_n) outcomes. We call E_1, E_2, \ldots an e-process if for any stopping time \tau, E_{\tau} is an e-variable, i.e. for all P \in H_0: {\mathbb E}_P[ E_{\tau} ] \leq 1 . In basic cases, the stopping time can be defined by any rule that determines, at each sample size n , based only on the data observed so far, whether to stop collecting data or not. For example, this could be "stop when you have seen four consecutive outcomes larger than 1", "stop at n=100 ", or the level-\alpha -aggressive rule, "stop as soon as you can reject at level \alpha -level, i.e. at the smallest n such that E_n \geq 1/\alpha ", and so on. With e-processes, we obtain an e-variable with any such rule. Crucially, the data analyst may not know the rule used for stopping. For example, her boss may tell her to stop data collecting and she may not know exactly why - nevertheless, she gets a valid e-variable and Type-I error control. This is in sharp contrast to data analysis based on p-values (which becomes invalid if stopping rules are not determined in advance) or in classical Wald-style sequential analysis (which works with data of varying length but again, with stopping times that need to be determined in advance). In more complex cases, the stopping time has to be defined relative to some slightly reduced filtration, but this is not a big restriction in practice. In particular, the level-\alpha -aggressive rule is always allowed. Because of this validity under optional stopping, e-processes are the fundamental building block of confidence sequences, also known as anytime-valid confidence intervals. Technically, e-processes are generalizations of test supermartingales, which are nonnegative supermartingales with starting value 1: any test supermartingale constitutes an e-process but not vice versa. Construction E-processes can be constructed in a number of ways. Often, one starts with an e-value E_i for X_i whose definition is allowed to depend on previous data, i.e., for all P \in H_0: {\mathbb E}_P[ E_{i} | X_1, \ldots, X_{i-1} ] \leq 1 (again, in complex testing problems this definition needs to be modified a bit using reduced filtrations). Then the product process M_1, M_2, \ldots with M_n = E_1 \times E_2 \cdots \times E_n is a test supermartingale, and hence also an e-process (note that we already used this construction in the example described under "e-values as bets" above: for fixed \lambda , the e-values E_{i,\lambda} were not dependent on past-data, but by using \lambda = \breve{\lambda}|X^{i-1} depending on the past, they became dependent on past data). Another way to construct an e-process is to use the universal inference construction described above for sample sizes 1, 2, \ldots The resulting sequence of e-values E_1, E_2, \ldots will then always be an e-process. == History ==

History

Historically, e-values implicitly appear as building blocks of nonnegative supermartingales in the pioneering work on anytime-valid confidence methods by well-known mathematician Herbert Robbins and some of his students. the concept did not catch on at all until 2019, when, within just a few months, several pioneering papers by several research groups appeared on arXiv (the corresponding journal publications referenced below sometimes coming years later). In these, the concept was finally given a proper name ("S-Value" and "E-Value"; in later versions of their paper, also adapted "E-Value"); describing their general properties, two generic ways to construct them, and their intimate relation to betting). Since then, interest by researchers around the world has been surging. In 2023 the first overview paper on "safe, anytime-valid methods", in which e-values play a central role, appeared. == References ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com