Scott's pi is similar to
Cohen's kappa in that they both improve on simple observed agreement by factoring in the extent of agreement that might be expected by chance. However, in each statistic, the expected agreement is calculated slightly differently. Scott's pi compares to the baseline of the annotators being not only independent but also having the same distribution of responses;
Cohen's kappa compares to a baseline in which the annotators are assumed to be independent but to have their own, different distributions of responses. Thus, Scott's pi measures disagreements between the annotators relative to the level of agreement expected due to pure random chance if the annotators were independent and identically distributed, whereas Cohen's kappa measures disagreements between the annotators that are above and beyond any systematic, average disagreement that the annotators might have. Indeed, Cohen's kappa explicitly ignores all systematic, average disagreement between the annotators prior to comparing the annotators. So Cohen's kappa assesses only the level of randomly varying disagreements between the annotators, not systematic, average disagreements. Scott's pi is extended to more than two annotators by
Fleiss' kappa. The equation for Scott's pi, as in
Cohen's kappa, is: :\pi = \frac{\Pr(a) - \Pr(e)}{1 - \Pr(e)}, However, Pr(e) is calculated using squared "joint proportions" which are squared arithmetic means of the marginal proportions (whereas Cohen's uses squared geometric means of them). Therefore for C categories, N observations to categorize and n_{ki} the number of times rater i predicted category k: :\Pr(e) = \sum_{k \in C} \left(\frac{n_{k1} + n_{k2}}{2N}\right)^2 == Worked example ==