Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively. The index is known by several other names, especially Sørensen–Dice index, Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC). Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending. Other names include: F1 score Czekanowski's binary (non-quantitative) index Measure of genetic similarity Zijdenbos similarity index, referring to a 1994 paper of Zijdenbos et al. Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as where |X| and |Y| are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set. When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as It is different from the Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1. It can be viewed as a similarity measure over sets. Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b: which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms. For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities : When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows: where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y.
Hubert Girault, Horst Pick, Natalia Gasilova, Andreas Stephan Lesch, Milica Jovic, Tzu-En Lin, Yingdi Zhu
Meritxell Bach Cuadra, Tobias Kober, Cristina Granziera, Francesco La Rosa, Hamza Kebiri, Po-Jui Lu
Tobias Kober, Tom Hilbert, Gian Franco Piredda