Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Cohen's kappa coefficient (κ, lowercase Greek kappa) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items. The first mention of a kappa-like statistic is attributed to Galton in 1892. The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal Educational and Psychological Measurement in 1960. Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The definition of is where po is the relative observed agreement among raters, and pe is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then . If there is no agreement among the raters other than what would be expected by chance (as given by pe), . It is possible for the statistic to be negative, which can occur by chance if there is no relationship between the ratings of the two raters, or it may reflect a real tendency of the raters to give differing ratings. For k categories, N observations to categorize and the number of times rater i predicted category k: This is derived from the following construction: Where is the estimated probability that both rater 1 and rater 2 will classify the same item as k, while is the estimated probability that rater 1 will classify an item as k (and similarly for rater 2). The relation is based on using the assumption that the rating of the two raters are independent.
Cathrin Brisken, Ayyakkannu Ayyanan, Rachel Marcone