**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Similarity measure

Summary

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.
Cosine similarity is a commonly used similarity measure for real-valued vectors, used in (among other fields) information retrieval to score the similarity of documents in the vector space model. In machine learning, common kernel functions such as the RBF kernel can be viewed as similarity functions.
Different types of similarity measures exist for various types of objects, depending on the objects being compared. For each type of object there are various similarity measurement formulas.
Similarity between two data points
There are many various options available when it comes to finding similarity between two data points, some of which are a combination of other similarity methods. Some of the methods for similarity measures between two data points include Euclidean distance, Manhattan distance, Minkowski distance, and Chebyshev distance. The Euclidean distance formula is used to find the distance between two points on a plane, which is visualized in the image below. Manhattan distance is commonly used in GPS applications, as it can be used to find the shortest route between two addresses. When you generalize the Euclidean distance formula and Manhattan distance formula you are left with the Minkowski distance formula, which can be used in a wide variety of applications.
Euclidean distance
Manhattan distance
Minkowski distance
Chebyshev distance
Similarity between strings
For comparing strings, there are various measures of string similarity that can be used. Some of these methods include edit distance, Levenshtein distance, Hamming distance, and Jaro distance.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related courses (2)

Related publications (3)

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

MGT-529: Data science and machine learning II

This class discusses advanced data science and machine learning (ML) topics: Recommender Systems, Graph Analytics, and Deep Learning, Big Data, Data Clouds, APIs, Clustering. The course uses the Wol

Related people (3)

Related concepts (15)

Related units (1)

Related lectures (28)

Recommender Systems: MovieLens DatasetCS-423: Distributed information systems

Covers implementing recommender systems using the MovieLens dataset and evaluating them with RMSE and MAE metrics.

Advanced Structure Discovery: Distance Metrics and Time Series DataCS-421: Machine learning for behavioral data

Explores clustering algorithms, distance metrics, and time series data analysis techniques.

Data Summarization: Minhashing and Locality-Sensitive HashingCS-422: Database systems

Explores Jaccard similarity, minhashing, and locality-sensitive hashing for data summarization.

Similarity measure

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

String metric

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.

Jaccard index

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is frequently referred to as the Critical Success Index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communauté, and independently formulated again by T. Tanimoto. Thus, the Tanimoto index or Tanimoto coefficient are also used in some fields.

Microsoft Kinect, Google's Project Tango and Lytro's light field camera are all examples of 3D depth sensing reaching the consumer market. As this technology becomes more widestream, new signal processing techniques are needed to exploit this data. Recent techniques in 3D shape similarity assesment present exciting opportunities to develop new algorithms in this area. The aim of this project is to create a tutorial to these techniques using an iPython notebook. Particular emphasis will be placed on an excellent review paper [1] and book [2]. Seneca, the Roman philosopher, said “While we teach, we learn,” and creating a tutorial on these topics is an excellent exercise to understand a complex subject that is at the heart of many modern computer vision techniques. [1] S. Biasotti et al "Recent Trends, Applications, and Perspectives in 3D Shape Similarity Assessment", 2015. [2] A. Bronstein, M. Bronstein, R. Kimmel "Numerical Geometry of Non-Rigid Shapes", 2008. LCAV1556050917

2016Bernard Moret, Shachi Shailesh Deshpande

Many important questions in molecular biology, evolution, and biomedicine can be addressed by comparative genomic approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example, to elucidate the phylogenetic relationships between species. The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomic methods that allow this kind of input are called gene family-based. The most powerfulbut also most complexmodels avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called gene family-free. In this article, we study an intermediate approach between family-based and family-free genomic similarity measures. Introducing this simpler model, called gene connections, we focus on the combinatorial aspects of gene family-free genome comparison. While in most cases, the computational costs to the general family-free case are the same, we also find an instance where the gene connections model has lower complexity. Within the gene connections model, we define three variants of genomic similarity measures that have different expression powers. We give polynomial-time algorithms for two of them, while we show NP-hardness for the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.

Pascal Frossard, Elif Vural, Ömer Sercan Arik

Efficient solutions for the classification of multi-view images can be built on graph-based algorithms when little information is known about the scene or cameras. Such methods typically require a pairwise similarity measure between images, where a common choice is the Euclidean distance. However, the accuracy of the Euclidean distance as a similarity measure is restricted to cases where images are captured from nearby viewpoints. In settings with large transformations and viewpoint changes, alignment of images is necessary prior to distance computation. We propose a method for the registration of uncalibrated images that capture the same 3D scene or object. We model the depth map of the scene as an algebraic surface, which yields a warp model in the form of a rational function between image pairs. The warp model is computed by minimizing the registration error, where the registered image is a weighted combination of two images generated with two different warp functions estimated from feature matches and image intensity functions in order to provide robust registration. We demonstrate the flexibility of our alignment method by experimentation on several wide-baseline image pairs with arbitrary scene geometries and texture levels. Moreover, the results on multi-view image classification suggest that the proposed alignment method can be effectively used in graph-based classification algorithms for the computation of pairwise distances where it achieves significant improvements over distance computation without prior alignment.

2011