Summary
In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance. The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons. String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, , evidence-based machine learning, database data deduplication, data mining, incremental search, data integration, malware detection, and semantic knowledge integration. Levenshtein distance, or its generalization edit distance Damerau–Levenshtein distance Sørensen–Dice coefficient Block distance or L1 distance or City block distance Hamming distance Simple matching coefficient (SMC) Jaccard similarity or Jaccard coefficient or Tanimoto coefficient Tversky index Overlap coefficient Variational distance Hellinger distance or Bhattacharyya distance Information radius (Jensen–Shannon divergence) Skew divergence Confusion probability Tau metric, an approximation of the Kullback–Leibler divergence Fellegi and Sunters metric (SFS) Maximal matches Grammar-based distance TFIDF distance metric There also exist functions which measure a dissimilarity between strings, but do not necessarily fulfill the triangle inequality, and as such are not metrics in the mathematical sense.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (4)
CS-421: Machine learning for behavioral data
Computer environments such as educational games, interactive simulations, and web services provide large amounts of data, which can be analyzed and serve as a basis for adaptation. This course will co
CS-401: Applied data analysis
This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat
MATH-448: Statistical analysis of network data
A first course in statistical network analysis and applications.
Show more