In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.
The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.
String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, , evidence-based machine learning, database data deduplication, data mining, incremental search, data integration, malware detection, and semantic knowledge integration.
Levenshtein distance, or its generalization edit distance
Damerau–Levenshtein distance
Sørensen–Dice coefficient
Block distance or L1 distance or City block distance
Hamming distance
Simple matching coefficient (SMC)
Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
Tversky index
Overlap coefficient
Variational distance
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence)
Skew divergence
Confusion probability
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)
Maximal matches
Grammar-based distance
TFIDF distance metric
There also exist functions which measure a dissimilarity between strings, but do not necessarily fulfill the triangle inequality, and as such are not metrics in the mathematical sense.
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Computer environments such as educational games, interactive simulations, and web services provide large amounts of data, which can be analyzed and serve as a basis for adaptation. This course will co
This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is frequently referred to as the Critical Success Index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communauté, and independently formulated again by T. Tanimoto. Thus, the Tanimoto index or Tanimoto coefficient are also used in some fields.
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.
Purpose: T1 Magnetization Prepared Two Rapid Acquisition Gradient Echo (MP2RAGE) with compress sensing (CS) has been proposed as an improvement of the standard MPRAGE sequence with multiple advantages including reduced acquisition time needed to provide a ...
A hash proof system (HPS) is a form of implicit proof of membership to a language. Out of the very few existing post-quantum HPS, most are based on languages of ciphertexts of code-based or lattice-based cryptosystems and inherently suffer from a gap cause ...
Functional connectomes (FCs) containing pairwise estimations of functional couplings between pairs of brain regions are commonly represented by correlation matrices. As symmetric positive definite matrices, FCs can be transformed via tangent space projecti ...