In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T. Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance. Different types of edit distance allow different sets of string operations. For instance: The Levenshtein distance allows deletion, insertion and substitution. The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution. The Hamming distance allows only substitution, hence, it only applies to strings of the same length. The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition (swapping) of two adjacent characters. The Jaro distance allows only transposition. Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied. Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (3)
COM-302: Principles of digital communications
This course is on the foundations of digital communication. The focus is on the transmission problem (rather than being on source coding).
DH-406: Machine learning for DH
This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and imple
COM-102: Advanced information, computation, communication II
Text, sound, and images are examples of information sources stored in our computers and/or communicated over the Internet. How do we measure, compress, and protect the informatin they contain?
Related lectures (27)
Error-Correcting Codes: Hamming Distance and Polynomial Multiplication
Covers error-correcting codes, Hamming distance, and polynomial multiplication in algebraic fields.
Convolutional Codes: Mapping Symbols and Paths
Covers Convolutional Codes, focusing on mapping symbols and paths.
E8 Lattice Optimality
Delves into the construction and optimality of the E8 lattice as the densest sphere packing in dimension eight.
Show more
Related publications (72)

Mapping Active Site Geometry to Activity in Immobilized Frustrated Lewis Pair Catalysts

Shubhajit Das, Rubén Laplaza Solanas, Jacob Terence Blaskovits

The immobilization of molecular catalysts imposes spatial constraints on their active site. We reveal that in bifunctional catalysis such constraints can also be utilized as an appealing handle to boost intrinsic activity through judicious control of the a ...
2022
Show more
Related concepts (6)
Longest common subsequence
A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics.
Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.
String metric
In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.