There are an infinite number of scoring rules, including entire parameterized families of strictly proper scoring rules. The ones shown below are simply popular examples.
Categorical variables For a categorical response variable with m mutually exclusive events, Y \in \Omega = \{1, \ldots, m\}, a probabilistic forecaster or algorithm will return a
probability vector \mathbf{p} \in [0,1]^m with probabilities for each of the m outcomes. If y=i materializes, one often abbreviates the score as \mathbf{S}(\mathbf{p}, i).
Logarithmic score The logarithmic scoring rule is a strictly proper and local scoring rule. This is also the negative of
Shannon entropy, which is commonly used as a scoring criterion in
Bayesian inference. This scoring rule has strong foundations in
information theory. :\mathbf{S}(\mathbf{p}, i) = \ln(p_i) Here, the score is calculated as the logarithm of the probability estimate for the actual outcome. That is, a prediction of 80% that correctly proved true would receive a score of . This same prediction also assigns 20% likelihood to the opposite case, and so if the prediction proves false, it would receive a score based on the 20%: . The goal of a forecaster is to maximize the score and for the score to be as large as possible, and −0.22 is indeed larger than −1.6. If one treats the truth or falsity of the prediction as a variable with value 1 or 0 respectively, and the expressed probability as , then one can write the logarithmic scoring rule as . Note that any logarithmic base may be used, since strictly proper scoring rules remain strictly proper under linear transformation. That is: :L(\mathbf{p},i) = \log_b(p_i) is strictly proper for all b>1.
Brier/Quadratic score The quadratic scoring rule is a strictly proper scoring rule :\mathbf{S}_Q(\mathbf{p},i) = 2p_i - \mathbf{p}\cdot \mathbf{p} = 2p_i -\sum_{j=1}^m p_j^2 where p_i is the probability assigned to the correct answer i. The
Brier score, originally proposed by
Glenn W. Brier in 1950, can be obtained by an
affine transform from the quadratic scoring rule. :\mathbf{S}_B(\mathbf{p},i) = \sum_{j=1}^m (y_j-p_j)^2 Where y_j = 1 when the jth event is correct and y_j = 0 otherwise. It can be thought of as a generalization of
mean squared error to probabilistic forecasts. An important difference between these two rules is that a forecaster should strive to maximize the quadratic score \mathbf{S}_Q yet minimize the Brier score \mathbf{S}_B. This is due to a negative sign in the linear transformation between them.
Spherical score The spherical scoring rule is also a strictly proper scoring rule :\mathbf{S}(\mathbf{p},i) = \frac{p_i}{\lVert \mathbf{p} \rVert} = \frac{p_i}{\sqrt{p_1^2 + \cdots + p_m^2}} Also its generalization with \alpha > 1 is strictly proper :\mathbf{S}(\mathbf{p},i) = \frac{p_i^{\alpha-1}}{\left(\sum_{j=1}^m p_j^\alpha\right)^{(\alpha-1)/\alpha}}
Ranked Probability Score The ranked probability score (RPS) is a strictly proper scoring rule, that can be expressed as: :RPS(\mathbf{p},i) = \sum_{k=1}^{m-1} \left(\sum_{j=1}^k p_j - y_j\right)^2 Where y_j = 1 when the jth event is correct and y_j = 0 otherwise, and m is the number of classes. Other than other scoring rules, the ranked probability score considers the distance between classes, i.e. classes 1 and 2 are considered closer than classes 1 and 3. The score assigns better scores to probabilistic forecasts with high probabilities assigned to classes close to the correct class. For example, when considering probabilistic forecasts \mathbf{p}_1 = (0.5, 0.5, 0) and \mathbf{p}_2 = (0.5, 0, 0.5), we find that RPS(\mathbf{p}_1,1) = 0.25, while RPS(\mathbf{p}_2,1) = 0.5, despite both probabilistic forecasts assigning identical probability to the correct class.
Comparison of categorical strictly proper scoring rules Shown below on the left is a graphical comparison of the Logarithmic, Quadratic, and Spherical scoring rules for a
binary classification problem. The
x-axis indicates the reported probability for the event that actually occurred. It is important to note that each of the scores have different magnitudes and locations. The magnitude differences are not relevant however as scores remain proper under affine transformation. Therefore, to compare different scores it is necessary to move them to a common scale. A reasonable choice of normalization is shown in the picture where all scores intersect the points (0.5,0) and (1,1). This ensures that they yield 0 for a uniform distribution (two probabilities of 0.5 each), reflecting no cost or reward for reporting what is often the baseline distribution. All normalized scores below also yield 1 when the true class is assigned a probability of 1.
Univariate continuous variables The scoring rules listed below aim to evaluate probabilistic predictions when the predicted distributions are univariate
continuous probability distributions, i.e. the predicted distributions F are defined over a univariate target variable Y \in \mathbb{R} and have a
probability density function f: \mathbb{R} \to \mathbb{R}_+. They can be categorized into 3 groups: • Scoring rules for predictions of the probability density f • Scoring rules for prediction of the CDF F • Scoring rules depending on first and second momentum only
Logarithmic score for continuous variables The logarithmic score is a local, strictly proper scoring rule. It is defined as :L(F,y) = - \ln(f(y)). The logarithmic score for continuous variables has strong ties to
Maximum likelihood estimation and to the
Kullback–Leibler divergence.
Quadratic score for continuous variables The quadratic scoring rule for continuous variables reads :S(f,y)= 2 f(y) - \|f\|_2^2 It is strictly proper for densities for which the norm \|f\|_2^2 = \left(\int f(y)^2 dy\right)^{\frac{1}{2}} exists.
Continuous ranked probability score The continuous ranked probability score (CRPS) is a strictly proper scoring rule much used in meteorology. It is closely related to the one-dimensional
energy distance, and is defined as :CRPS(F,y)=\int_\mathbb{R} ( F(x) - H(x - y) ) ^2 dx where H is the
Heaviside step function and y \in \mathbb R is the observation. For distributions with finite first
moment, the continuous ranked probability score can be written as: :CRPS(F, y) = \mathbb{E}_{X \sim F}|X-y| - \frac{1}{2}\mathbb{E}_{X,X' \sim F}|X-X'| where X and X' are independent random variables, both sampled from the distribution F. This is the
energy form of CRPS and opens the door to estimating the CRPS via
Monte Carlo sampling (through approximating the expectation value). Furthermore, when the cumulative probability function F is continuous, the continuous ranked probability score can also be written as :CRPS(F, y) = \mathbb{E}_{X \sim F}|X-y| + \mathbb{E}_{X \sim F}[X] - 2 \mathbb{E}_{X \sim F}[X \cdot F(X)] The continuous ranked probability score can be seen as both a continuous extension of the ranked probability score, as well as
quantile regression. The continuous ranked probability score over the
empirical distribution \hat F_q of an ordered set points q_1 \leq \ldots \leq q_n (i.e. every point has 1/n probability of occurring), is equal to twice the mean
quantile loss applied on those points with evenly spread quantiles (\tau_1, \ldots, \tau_n) = (1/(2n), \ldots, (2n-1)/(2n)): :CRPS\left(\hat F_q, y\right) = \frac{2}{n} \sum_{i=1}^n \tau_i (y - q_i)_+ + (1 - \tau_i) (q_i - y)_+ For many popular families of distributions,
closed-form expressions for the continuous ranked probability score have been derived. The continuous ranked probability score has been used as a loss function for
artificial neural networks, in which weather forecasts are postprocessed to a
Gaussian probability distribution. CRPS was also adapted to
survival analysis to cover censored events. The CRPS can be thought of as the generalization of the
mean absolute error (MAE) to probabilistic forecasts, and for a single sample is equivalent to the MAE. Another way to think of it is the
Brier/quadaratic score of the sampled cumulative distribution function F for the binary event \{X \leq y\}. CRPS is a special case of the
Cramér distance (or
Cramér's distance) and can be seen as an improvement of
Wasserstein distance often used in machine learning. Cramér distance performed better in
ordinal regression than
KL distance or the Wasserstein metric. CRPS is widely used for evaluating probabilistic forecasts and compared against other scoring rules, see for example . It also has some critical theoretical limitations: It has been shown that CRPS can produce systematically misleading evaluations by favoring probabilistic forecasts whose medians are close to the observed outcome, regardless of the actual probability assigned to that region, potentially resulting in higher scores for forecasts that allocate negligible (or even zero) probability mass to the true outcome. Furthermore, CRPS is not invariant under smooth transformations of the forecast variable, and its ranking of forecast systems may reverse under such transformations, raising concerns about its consistency for evaluation purposes.
Dawid-Sebastiani score The Dawid-Sebastiani score (DSS) only depends on first and second momentum, or equivalently, mean \mu_F and standard deviation \sigma_F of the distribution F: :S(F, y) = \left(\frac{y-\mu_F}{\sigma_F}\right)^2 + \log\sigma_F^2
Multivariate continuous variables The scoring rules listed below aim to evaluate probabilistic predictions when the predicted distributions are univariate
continuous probability distribution's, i.e. the predicted distributions are defined over a multivariate target variable X \in \mathbb{R}^n and have a
probability density function f: \mathbb{R}^n \to \mathbb{R}_+.
Multivariate logarithmic score The multivariate logarithmic score is similar to the univariate logarithmic score: :L(D,y) = - \ln(f_D(y)) where f_D denotes the probability density function of the predicted multivariate distribution D. It is a local, strictly proper scoring rule.
Hyvärinen scoring rule The Hyvärinen scoring function (of a density p) is defined by :s(p) = 2 \Delta_y \log p(y) + \|\nabla_y \log p(y)\|_2^2 Where \Delta denotes the
Hessian trace and \nabla denotes the
gradient. This scoring rule can be used to computationally simplify parameter inference and address Bayesian model comparison with arbitrarily-vague priors. It was also used to introduce new information-theoretic quantities beyond the existing
information theory. The Hyvärinen scoring rule is local of order 2 (meaning it locally takes into account derivatives up to second order).
Energy score The energy score is a multivariate extension of the continuous ranked probability score: :ES_\beta(D, Y) = \mathbb{E}_{X \sim D}[\lVert X - Y \rVert_2^\beta] - \frac{1}{2} \mathbb{E}_{X,X' \sim D}[\lVert X - X' \rVert_2^\beta] Here, \beta \in (0, 2), \lVert\rVert_2 denotes the n-dimensional
Euclidean distance and X, X' are independently sampled random variables from the probability distribution D. The energy score is strictly proper for distributions D for which \mathbb{E}_{X \sim D}[\lVert X \rVert_2] is finite. It has been suggested that the energy score is somewhat ineffective when evaluating the intervariable dependency structure of the forecasted multivariate distribution. Apart from a term that depends only on the distribution of the observation, the energy score is equal to twice the
energy distance between the predicted distribution and the empirical distribution of the observation.
Variogram score The
variogram score of order p is given by: :VS_p(D, Y) = \sum_{i,j=1}^n w_{ij} (|Y_i - Y_j|^p - \mathbb{E}_{X \sim D}[|X_i - X_j|^p])^2 Here, w_{ij} are weights, often set to 1, and p > 0 can be arbitrarily chosen, but p = 0.5, 1 or 2 are often used. X_{i} is here to denote the i'th
marginal random variable of X. The variogram score is proper for distributions for which the (2p)'th
moment is finite for all components, but is never strictly proper. Compared to the energy score, the variogram score is claimed to be more discriminative with respect to the predicted correlation structure.
Conditional continuous ranked probability score The conditional continuous ranked probability score (Conditional CRPS or CCRPS) is a family of (strictly) proper scoring rules. Conditional CRPS evaluates a forecasted multivariate distribution D by evaluation of CRPS over a prescribed set of univariate
conditional probability distributions of the predicted multivariate distribution: :CCRPS_{\mathcal{T}}(D,Y) = \sum_{i=1}^k CRPS(P_{X \sim D}(X_{v_i} | X_j = Y_j \text{ for } j \in \mathcal{C}_i), Y_{v_i}) Here, X_i is the i'th marginal variable of X \sim D, \mathcal{T} = (v_i, \mathcal{C}_i)_{i=1}^k is a set of tuples that defines a conditional specification (with v_i \in \{1, \ldots, n\} and \mathcal{C}_i \subseteq \{1, \ldots, n\} \setminus \{v_i\}), and P_{X \sim D}(X_{v_i} | X_j = Y_j \text{ for } j \in \mathcal{C}_i) denotes the conditional probability distribution for X_{v_i} given that all variables X_j for j \in \mathcal{C}_i are equal to their respective observations. In the case that P_{X \sim D}(X_{v_i} | X_j = Y_j \text{ for } j \in \mathcal{C}_i) is ill-defined (i.e. its conditional event has zero likelihood), CRPS scores over this distribution are defined as infinite. Conditional CRPS is strictly proper for distributions with finite first moment, if the
chain rule is included in the conditional specification, meaning that there exists a permutation \phi_1, \ldots, \phi_n of 1, \ldots, n such that for all 1 \leq i \leq n: (\phi_i, \{\phi_1, \ldots, \phi_{i-1}\}) \in \mathcal{T}.
Interpretation of proper scoring rules All proper scoring rules are equal to weighted sums (integral with a non-negative weighting functional) of the losses in a set of simple two-alternative decision problems that
use the probabilistic prediction, each such decision problem having a particular combination of associated cost parameters for
false positive and false negative decisions. A
strictly proper scoring rule corresponds to having a nonzero weighting for all possible decision thresholds. Any given proper scoring rule is equal to the expected losses with respect to a particular probability distribution over the decision thresholds; thus the choice of a scoring rule corresponds to an assumption about the probability distribution of decision problems for which the predicted probabilities will ultimately be employed, with for example the quadratic loss (or Brier) scoring rule corresponding to a uniform probability of the decision threshold being anywhere between zero and one. The
classification accuracy score (percent classified correctly), a single-threshold scoring rule which is zero or one depending on whether the predicted probability is on the appropriate side of 0.5, is a proper scoring rule but not a strictly proper scoring rule because it is optimized (in expectation) not only by predicting the true probability but by predicting
any probability on the same side of 0.5 as the true probability. == Examples of consistent scoring functions ==