Summary
The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively. The index is known by several other names, especially Sørensen–Dice index, Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC). Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending. Other names include: F1 score Czekanowski's binary (non-quantitative) index Measure of genetic similarity Zijdenbos similarity index, referring to a 1994 paper of Zijdenbos et al. Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as where |X| and |Y| are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set. When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as It is different from the Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1. It can be viewed as a similarity measure over sets. Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b: which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms. For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities : When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows: where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.