In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, inter-coder reliability, and so on) is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.
Assessment tools that rely on ratings must exhibit good inter-rater reliability, otherwise they are not valid tests.
There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are joint-probability of agreement, such as Cohen's kappa, Scott's pi and Fleiss' kappa; or inter-rater correlation, concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.
There are several operational definitions of "inter-rater reliability," reflecting different viewpoints about what is a reliable agreement between raters. There are three operational definitions of agreement:
Reliable raters agree with the "official" rating of a performance.
Reliable raters agree with each other about the exact ratings to be awarded.
Reliable raters agree about which performance is better and which is worse.
These combine with two operational definitions of behavior:
The joint-probability of agreement is the simplest and the least robust measure. It is estimated as the percentage of the time the raters agree in a nominal or categorical rating system. It does not take into account the fact that agreement may happen solely based on chance. There is some question whether or not there is a need to 'correct' for chance agreement; some suggest that, in any case, any such adjustment should be based on an explicit model of how chance and error affect raters' decisions.
When the number of categories being used is small (e.g. 2 or 3), the likelihood for 2 raters to agree by pure chance increases dramatically.
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Scott's pi (named after William A Scott) is a statistic for measuring inter-rater reliability for nominal data in communication studies. Textual entities are annotated with categories by different annotators, and various measures are used to assess the extent of agreement between the annotators, one of which is Scott's pi. Since automatically annotating text is a popular problem in natural language processing, and the goal is to get the computer program that is being developed to agree with the humans in the annotations it creates, assessing the extent to which humans agree with each other is important for establishing a reasonable upper limit on computer performance.
In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC), is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures, it operates on data structured as groups rather than data structured as paired observations.
Cohen's kappa coefficient (κ, lowercase Greek kappa) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement.
Covers personality traits, Big-Five traits, measurement instruments, reliability, validity, Facebook profiles, and predictability from digital records.
AIM: To characterise the corticoreticular pathway (CRP) in a case -control cohort of adolescent idiopathic scoliosis (AIS) patients using high -resolution slice -accelerated readoutsegmented echo -planar diffusion tensor imaging (DTI) to enhance the discri ...
Radon is a naturally occurring radioactive gas that has the potential to accumulate in buildings and over time, causes lung cancer in humans. Present methods for radon measurements are disparate, which pose challenges to benchmark radon concentrations and ...
Oxford2024
Lightweight modules are essential for next-generation vehicle-integrated photovoltaic (VIPV) applications, such as solar-powered cars, allowing integration of solar cells beyond the roof, and on the hood, boot and body panels, and thereby extending the dri ...