Cohen's kappa

Cohen's kappa coefficient is a statistic used to measure inter-rater reliability for qualitative or categorical data. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ incorporates the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items. Cohen's kappa coefficient ranges from -1 to 1.

History

The first mention of a kappa-like statistic is attributed to Galton in 1892. The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal Educational and Psychological Measurement in 1960. ==Definition==

Definition

Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The definition of \kappa is :\kappa \equiv \frac{p_o - p_e}{1 - p_e} = 1- \frac{1 - p_o}{1 - p_e}, where is the relative observed agreement among raters, and is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly selecting each category. If the raters are in complete agreement then \kappa=1. If there is no agreement among the raters other than what would be expected by chance (as given by ), \kappa=0. It is possible for the statistic to be negative, which can occur by chance if there is no relationship between the ratings of the two raters, or it may reflect a real tendency of the raters to give differing ratings. For categories, observations to categorize and n_{ki} the number of times rater predicted category : : p_e = \frac{1}{N^2} \sum_k n_{k1}n_{k2} This is derived from the following construction: : p_e = \sum_k \widehat{p_{k12}} \overset{\text{ind.}}{=} \sum_k \widehat{p_{k1}}\widehat{p_{k2}} = \sum_k \frac{n_{k1}}{N}\frac{n_{k2}}{N} = \frac{1}{N^2} \sum_k n_{k1}n_{k2} Where \widehat{p_{k12}} is the estimated probability that both rater 1 and rater 2 will classify the same item as k, while \widehat{p_{k1}} is the estimated probability that rater 1 will classify an item as k (and similarly for rater 2). The relation \widehat{p_k} = \sum_k \widehat{p_{k1}}\widehat{p_{k2}} is based on using the assumption that the rating of the two raters are independent. The term \widehat{p_{k1}} is estimated by using the number of items classified as k by rater 1 (n_{k1}) divided by the total items to classify (N): \widehat{p_{k1}}= {n_{k1} \over N} (and similarly for rater 2). Binary classification confusion matrix In the traditional 2 × 2 confusion matrix employed in machine learning and statistics to evaluate binary classifications, the Cohen's Kappa formula can be written as: :\kappa = \frac{2 \times (TP \times TN - FN \times FP)}{(TP + FP) \times (FP + TN) + (TP + FN) \times (FN + TN)} where TP are the true positives, FP are the false positives, TN are the true negatives, and FN are the false negatives. In this case, Cohen's Kappa is equivalent to the Heidke skill score known in Meteorology. The measure was first introduced by Myrick Haskell Doolittle in 1888. ==Examples==

Examples

Simple example Suppose that you were analyzing data related to a group of 50 people applying for a grant. Each grant proposal was read by two readers and each reader either said "Yes" or "No" to the proposal. Suppose the disagreement count data were as follows, where A and B are readers, data on the main diagonal of the matrix (a and d) count the number of agreements and off-diagonal data (b and c) count the number of disagreements: e.g. The observed proportionate agreement is: :p_o = \frac{a+d}{a+b+c+d} = \frac{20+15}{50} = 0.7 To calculate (the probability of random agreement) we note that: • Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time. • Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time. So the expected probability that both would say yes at random is: : p_\text{Yes} = \frac{a+b}{a+b+c+d} \cdot \frac{a+c}{a+b+c+d} = 0.5 \times 0.6 = 0.3 Similarly: : p_\text{No} = \frac{c+d}{a+b+c+d} \cdot \frac{b+d}{a+b+c+d} = 0.5 \times 0.4 = 0.2 Overall random agreement probability is the probability that they agreed on either Yes or No, i.e.: :p_e = p_\text{Yes} + p_\text{No} = 0.3 + 0.2 = 0.5 So now applying our formula for Cohen's Kappa we get: :\kappa = \frac{p_o - p_e}{1 - p_e} = \frac{0.7-0.5}{1-0.5} = 0.4 Same percentages but different numbers A case sometimes considered to be a problem with Cohen's Kappa occurs when comparing the Kappa calculated for two pairs of raters with the two raters in each pair having the same percentage agreement but one pair give a similar number of ratings in each class while the other pair give a very different number of ratings in each class. (In the cases below, notice B has 70 yeses and 30 nos, in the first case, but those numbers are reversed in the second.) For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) in terms of agreement in each class, so we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each: : \kappa = \frac{0.60-0.54}{1-0.54} = 0.1304 : \kappa = \frac{0.60-0.46}{1-0.46} = 0.2593 we find that it shows greater similarity between A and B in the second case, compared to the first. This is because while the percentage agreement is the same, the percentage agreement that would occur 'by chance' is significantly higher in the first case (0.54 compared to 0.46). ==Properties==

Properties

Hypothesis testing and confidence interval P-value for kappa is rarely reported, probably because even relatively low values of kappa can nonetheless be significantly different from zero but not of sufficient magnitude to satisfy investigators. Fleiss'sSome researchers have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can make it unreliable for measuring agreement in situations such as the diagnosis of rare diseases. In these situations, κ tends to underestimate the agreement on the rare category. For this reason, κ is considered an overly conservative measure of agreement. Others contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario. Moreover, some works have shown how kappa statistics can lead to a wrong conclusion for unbalanced data. ==Related statistics==

Related statistics

Scott's Pi A similar statistic, called pi, was proposed by Scott (1955). Cohen's kappa and Scott's pi differ in terms of how is calculated. Fleiss' kappa Note that Cohen's kappa measures agreement between two raters only. For a similar measure of agreement (Fleiss' kappa) used when there are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic, not Cohen's kappa. Kappa is also used to compare performance in machine learning, but the directional version known as Informedness or Youden's J statistic is argued to be more appropriate for supervised learning. Weighted kappa The weighted kappa allows disagreements to be weighted differently{{cite journal The equation for weighted κ is: : \kappa = 1 - \frac{\sum_{i=1}^{k} \sum_{j=1}^{k} w_{ij} x_{ij}} {\sum_{i=1}^{k} \sum_{j=1}^{k} w_{ij} m_{ij}} where k=number of codes and w_{ij}, x_{ij}, and m_{ij} are elements in the weight, observed, and expected matrices, respectively. When diagonal cells contain weights of 0 and all off-diagonal cells weights of 1, this formula produces the same value of kappa as the calculation given above. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com