Summary
The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively. The index is known by several other names, especially Sørensen–Dice index, Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC). Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending. Other names include: F1 score Czekanowski's binary (non-quantitative) index Measure of genetic similarity Zijdenbos similarity index, referring to a 1994 paper of Zijdenbos et al. Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as where |X| and |Y| are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set. When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as It is different from the Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1. It can be viewed as a similarity measure over sets. Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b: which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms. For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities : When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows: where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (1)
CS-422: Database systems
This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students
Related lectures (15)
Graph metrics: Statistical analysis
Explores graph metrics and statistical analysis in network clustering, including ERGMs application in sociology and asymptotics.
Link Prediction: Missing Edges and Probabilistic Methods
Explores link prediction in networks, covering missing edges, probabilistic methods, and causal inference challenges.
Entity Resolution Techniques
Explores entity resolution techniques, data deduplication, similarity metrics, computational cost, blocking techniques, and scaling out similarity joins.
Show more
Related publications (20)