Summary
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is frequently referred to as the Critical Success Index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communauté, and independently formulated again by T. Tanimoto. Thus, the Tanimoto index or Tanimoto coefficient are also used in some fields. However, they are identical in generally taking the ratio of Intersection over Union. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: Note that by design, If A intersection B is empty, then J(A,B) = 0. The Jaccard coefficient is widely used in computer science, ecology, genomics, and other sciences, where binary or binarized data are used. Both the exact solution and approximation methods are available for hypothesis testing with the Jaccard coefficient. Jaccard similarity also applies to bags, i.e., Multisets. This has a similar formula, but the symbols mean bag intersection and bag sum (not union). The maximum value is 1/2. The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union: An alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union. Jaccard distance is commonly used to calculate an n × n matrix for clustering and multidimensional scaling of n sample sets. This distance is a metric on the collection of all finite sets. There is also a version of the Jaccard distance for measures, including probability measures.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (4)
CS-423: Distributed information systems
This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.
MATH-448: Statistical analysis of network data
A first course in statistical network analysis and applications.
CS-401: Applied data analysis
This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat
Show more
Related lectures (32)
Graph metrics: Statistical analysis
Explores graph metrics and statistical analysis in network clustering, including ERGMs application in sociology and asymptotics.
Data Summarization: Minhashing and Locality-Sensitive Hashing
Explores Jaccard similarity, minhashing, and locality-sensitive hashing for data summarization.
Privacy-preserving Data Management: Operations and Protocols
Explores privacy-preserving data management operations and summarization techniques for sensitive data protection.
Show more
Related publications (44)
Related concepts (6)
String metric
In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.
Similarity measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.
Sørensen–Dice coefficient
The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively. The index is known by several other names, especially Sørensen–Dice index, Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC).
Show more