Multiclass classification

From the confusion matrix of a multiclass model, we can determine whether a model does better than chance. Let K \geq 3 be the number of classes, \mathcal{O} a set of observations, \hat{y}: \mathcal{O} \to \{1, ...,K\} a model of the target variable y: \mathcal{O} \to \{1, ...,K\} and n_{i,j} be the number of observations in the set \{y=i\} \cap \{\hat{y}=j\}. We note n_{i.} = \sum_j n_{i,j}, n_{.j} = \sum_i n_{i,j}, n = \sum_j n_{.j} = \sum_i n_{i.}, \lambda_i = \frac{n_{i.}}{n} and \mu_j = \frac{n_{.j}}{n}. It is assumed that the confusion matrix (n_{i,j})_{i,j} contains at least one non-zero entry in each row, that is \lambda_i > 0 for any i. Finally we call "normalized confusion matrix" the matrix of conditional probabilities (\mathbb{P}(\hat{y}=j \mid y=i))_{i,j} = \left(\frac{n_{i,j}}{n_{i.}}\right)_{i,j} . Intuitive explanation The lift is a way of measuring the deviation from independence of two events A and B : \mathrm{Lift}(A,B)= \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(A) \mathbb{P}(B)} = \frac{\mathbb{P}(A \mid B)}{\mathbb{P}(A)} = \frac{\mathbb{P}(B \mid A)}{\mathbb{P}(B)} We have \mathrm{Lift}(A,B) > 1 if and only if events A and B occur simultaneously with a greater probability than if they were independent. In other words, if one of the two events occurs, the probability of observing the other event increases. A first condition to satisfy is to have \mathrm{Lift}(y=i, \hat{y}=i) \geq 1 for any i. And the quality of a model (better or worse than chance) does not change if we over- or undersample the dataset, that is if we multiply each row R_i of the confusion matrix by a constant c_i. Thus the second condition is that the necessary and sufficient conditions for doing better than chance need only depend on the normalized confusion matrix. The condition on lifts can be reformulated with One versus Rest binary models : for any i, we define the binary target variable y_i which is the indicator of event \{y=i\}, and the binary model \hat{y}_i of y_i which is the indicator of event \{\hat{y}=i\}. Each of the \hat{y}_i models is a "One versus Rest" model. \mathrm{Lift}(y=i, \hat{y}=i) only depends on the events \{y = i\} and \{\hat{y} = i\}, so merging or not merging the other classes doesn't change its value. We therefore have \mathrm{Lift}(y=i, \hat{y}=i) = \mathrm{Lift}(y_i=1, \hat{y}_i=1) and the first condition is that all binary One versus Rest models are better than chance. Example If K=2 and 2 is the class of interest , the normalized confusion matrix is \begin{pmatrix} \mathrm{specificity} & 1-\mathrm{specificity} \\ 1-\mathrm{sensitivity} & \mathrm{sensitivity} \end{pmatrix} and we have \mathrm{Lift}(y=1, \hat{y}=1)-1 = \frac{\mathbb{P}(y = \hat{y} = 1)}{\lambda_1 \mu_1}-1 = \frac{n_{1,1} n}{n_{1.} n_{.1}}-1 = \frac{n_{1,1} (n_{1,1}+n_{1,2}+n_{2,1}+n_{2,2}) - (n_{1,1}+n_{1,2}) (n_{1,1}+n_{2,1})}{n_{1.} n_{.1}}= \frac{n_{1,1} n_{2,2} - n_{1,2} n_{2,1}}{n_{1.} n_{.1}}. Thus \mathrm{Lift}(y=1, \hat{y}=1)\geq 1 \iff n_{1,1} n_{2,2} - n_{1,2} n_{2,1}\geq 0. Similarly, by swapping the roles of 1 and 2, we find that \mathrm{Lift}(y=2, \hat{y}=2) \geq 1 \iff n_{1,1} n_{2,2} - n_{1,2} n_{2,1} \geq 0. Dividing by n_{1.} n_{2.} we find that the necessary and sufficient condition on the normalized confusion matrix is \mathrm{sensitivity} \ \mathrm{specificity} - (1-\mathrm{sensitivity}) (1-\mathrm{specificity}) \geq 0 \iff \mathrm{sensitivity} + \mathrm{specificity} -1 \geq 0 \iff J \geq 0. This brings us back to the classical binary condition: Youden's J must be positive (or zero for random models). Random models A random model is a model that is independent of the target variable. This property is easily reformulated with the confusion matrix. {{Math theorem|name=Proposition|math_statement = The model \hat{y} of y is random if and only if the confusion matrix is of rank 1. }} {{Math proof|title=Proof|proof= \hat{y} is a random model of y if and only if we have \mathbb{P}(\{y = i\} \cap \{\hat{y} = j\})= \mathbb{P}(y = i) \mathbb{P}(\hat{y} = j) for any i and j, which is equivalent to n_{i,j}n = n_{i.} n_{.j} for any i and j. All the columns of the confusion matrix are then proportional to the non-zero vector (n_{i.})_i , which implies that the confusion matrix is of rank 1. Conversely, if this matrix is of rank 1, the non-zero columns of the matrix are proportional to each other, and therefore proportional to their sum (n_{i.})_i. So there exists a family of numbers (\beta_j)_j such that n_{i,j} = n_{i.} \beta_j for any i and j. Summing this equations over i gives n_{.j} = \beta_j n, hence n_{i,j} n = n_{i.} n_{.j} for any i and j.}} This proposition shows that the model \hat{y} of y is uninformative if and only if there are two families of numbers (\alpha_i)_i and (\beta_j)_j such that \mathbb{P}(\{y = i\} \cap \{\hat{y} = j\}) = \alpha_i \beta_j for any i and j. Multiclass likelihood ratios and diagnostic odds ratios We define generalized likelihood ratios calculated from the normalized confusion matrix: for any i and j\not=i, let \mathrm{LR}_{i,j} = \frac{\mathbb{P}( \hat{y}=j \mid y=j)}{\mathbb{P}( \hat{y}=j \mid y=i)}. When K=2, if 2 is the class of interest,, we find the classical likelihood ratios \mathrm{LR}_{1,2}= \mathrm{LR}_+ and \mathrm{LR}_{2,1}= \frac{1}{\mathrm{LR}_-}. Multiclass diagnostic odds ratios can also be defined using the formula \mathrm{DOR}_{i,j} = \mathrm{DOR}_{j,i} = \mathrm{LR}_{i,j} \mathrm{LR}_{j,i} = \frac{n_{i,i}n_{j,j}}{n_{i,j}n_{j,i}}= \frac{\mathbb{P}( \hat{y}=j \mid y=j)/ \mathbb{P}( \hat{y}=i \mid y=j)}{\mathbb{P}( \hat{y}=j \mid y=i) / \mathbb{P}( \hat{y}=i \mid y=i)} {{Math theorem|name = Theorem|math_statement = For any j, \mathbb{P}( \hat{y}=j \mid y=j) - \mu_j = \sum_i \lambda_i(\mathbb{P}( \hat{y}=j \mid y=j) - \mathbb{P}( \hat{y}=j \mid y=i)) Equivalently, if all n_{i,j} are non-zero: \frac{1}{\mathrm{Lift}(y=j,\hat{y}=j)} = \sum_i \frac{\lambda_i }{\mathrm{LR}_{i,j}} }} {{Math proof|title=Proof|proof= \mathbb{P}( \hat{y}=j \mid y=j) - \mathbb{P}(\hat{y}=j) = \mathbb{P}( \hat{y}=j \mid y=j) - \sum_i \lambda_i \mathbb{P}( \hat{y}=j \mid y=i) = \sum_i \lambda_i(\mathbb{P}( \hat{y}=j \mid y=j) - \mathbb{P}( \hat{y}=j \mid y=i)). By dividing by \mathbb{P}( \hat{y}=j \mid y=j) and subtracting 1, we deduce the second formulation. }} {{Math theorem|name = Corollary|math_statement = If all probabilities \mathbb{P}( \hat{y}=k \mid y=l) are fixed, for any i and j we have \lim_{\lambda_i \to 1}(\mathbb{P}(\hat{y} =j \mid y=j) - \mu_j) = \mathbb{P}(\hat{y} =j \mid y=j) - \mathbb{P}(\hat{y} =j \mid y=i) Equivalently, if all n_{i,j} are non-zero: \lim_{\lambda_i \to 1} \mathrm{Lift}(y=j,\hat{y}=j) = \mathrm{LR}_{i,j} }} We saw above that a better-than-chance model (or a random model) must verify \mathrm{Lift}(y=i, \hat{y}=i) \geq 1 for any i and \lambda_i. According to the previous corollary, likelihood ratios are thus greater than or equal to 1. Conversely, if the likelihood ratios are greater than or equal to 1, the theorem shows that we have \mathrm{Lift}(y=i,\hat{y}=i) \geq 1 for any i and \lambda_i. Definition of better-than-chance multiclass models A model \hat{y} of y outperforms chance if the following conditions are met: • For any j, we have \max_i \mathbb{P}( \hat{y}=j \mid y=i) = \mathbb{P}( \hat{y}=j \mid y=j). • There are i and j distinct such that \mathbb{P}( \hat{y}=j \mid y=i) . If all the entries of the confusion matrix are non-zero, this means that all the likelihood ratios are greater than or equal to 1, and at least one of these inegalities is strict. A model that satisfies the first condition but not the second is random, since we then have \mathbb{P}( \{\hat{y}=j\} \cap \{y=i\}) = \mathbb{P}(y=i) \mathbb{P}( \hat{y}=j \mid y=i) = \mathbb{P}(y=i) \mathbb{P}( \hat{y}=j \mid y=j) = \alpha_i \beta_j for any i and j. We can rewrite the first condition in a more familiar way, noting x the observed value of \hat{y}, \theta the value to be estimated of y and \hat{\theta}(x) the set argmax_\theta \mathbb{P}(x \mid \theta): for any x we have x \in \hat{\theta}(x). We deduce that a model is better-than-random or random if and only if it is a maximum likelihood estimator of the target variable. Applications Multiclass balanced accuracy The performance of a better-than-chance model can be estimated using multiclass versions of metrics such as balanced accuracy or Youden's J. {{Math theorem|name = Definition|math_statement = \mathrm{Balanced \ accuracy} = \frac{1}{K}\sum_i \mathbb{P}(\hat{y}=i \mid y=i) \mathrm{J} = \frac{1}{K-1} (K \, \mathrm{balanced \ accuracy} - 1) = \frac{1}{K-1}\sum_i \mu_i (\mathrm{Lift}(y=i,\hat{y}=i)-1) }} If \mathrm{balanced \ accuracy} = 1, in other words J=1, the model is perfect. And for any random model, we have \mathrm{balanced \ accuracy} = \frac{1}{K} (if, for example, we draw a uniform random number from the K labels, we have exactly one chance in K of predicting the correct value of the target variable). On a balanced data set (\lambda_i = \frac{1}{K} for any i), balanced accuracy is equal to the rate of well-classified observations. On any data set, if a model does better than chance, we have J \geq 0 and \mathrm{balanced \ accuracy} \geq \frac{1}{K}. But the converse is not true when K>2, as we can see from this example: the confusion matrix \begin{pmatrix} 0 & 3 & 0\\ 1 & 2 & 0\\ 0 & 0 & 3 \end{pmatrix} is that of a bad model (=worse than chance) since \mathrm{LR}_{2,1}=0. However, 5 of the 9 observations are correctly classified. This also shows that poor model performance on one of the modalities is not compensated for by good performance on the other modalities. ROC space The set of normalized confusion matrices is called the ROC space, a subspace of \mathopen{[}0,1\mathclose{]}^{m^2}. If E denotes the subset of the ROC space made up of random models or models that do better than chance, one can show that the topological boundary of E is the set of elements of E for which at least one of the likelihood ratios is equal to 1. And random models are those models whose likelihood ratios are all equal to 1. When K=2, the boundary between models that do better than chance and bad models is equal to the set of random models (see article on the roc curve for more details), but it is strictly larger as soon as K>2. And if K=3, we can calculate the volume occupied by bad models in the ROC space: they occupy 90% of this space, whereas it's only 50% when K=2. ==General algorithmic strategies==