**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.

Concept# Similarity measure

Summary

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.
Cosine similarity is a commonly used similarity measure for real-valued vectors, used in (among other fields) information retrieval to score the similarity of documents in the vector space model. In machine learning, common kernel functions such as the RBF kernel can be viewed as similarity functions.
Different types of similarity measures exist for various types of objects, depending on the objects being compared. For each type of object there are various similarity measurement formulas.
Similarity between two data points
There are many various options available when it comes to finding similarity between two data points, some of which are a combination of other similarity methods. Some of the methods for similarity measures between two data points include Euclidean distance, Manhattan distance, Minkowski distance, and Chebyshev distance. The Euclidean distance formula is used to find the distance between two points on a plane, which is visualized in the image below. Manhattan distance is commonly used in GPS applications, as it can be used to find the shortest route between two addresses. When you generalize the Euclidean distance formula and Manhattan distance formula you are left with the Minkowski distance formula, which can be used in a wide variety of applications.
Euclidean distance
Manhattan distance
Minkowski distance
Chebyshev distance
Similarity between strings
For comparing strings, there are various measures of string similarity that can be used. Some of these methods include edit distance, Levenshtein distance, Hamming distance, and Jaro distance.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related people (23)

Related publications (103)

Related concepts (13)

Related courses (2)

Related units (5)

Related lectures (28)

String metric

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.

Jaccard index

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is frequently referred to as the Critical Success Index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communauté, and independently formulated again by T. Tanimoto. Thus, the Tanimoto index or Tanimoto coefficient are also used in some fields.

Similarity learning

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification. There are four common setups for similarity and metric distance learning. Regression similarity learning In this setup, pairs of objects are given together with a measure of their similarity .

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

MGT-529: Data science and machine learning II

This class discusses advanced data science and machine learning (ML) topics: Recommender Systems, Graph Analytics, and Deep Learning, Big Data, Data Clouds, APIs, Clustering. The course uses the Wol

Recommender Systems: MovieLens Dataset

Covers implementing recommender systems using the MovieLens dataset and evaluating them with RMSE and MAE metrics.

Advanced Structure Discovery: Distance Metrics and Time Series Data

Explores clustering algorithms, distance metrics, and time series data analysis techniques.

Data Summarization: Minhashing and Locality-Sensitive Hashing

Explores Jaccard similarity, minhashing, and locality-sensitive hashing for data summarization.

Tobias Kober, Tom Hilbert, Gian Franco Piredda

Purpose: T1 Magnetization Prepared Two Rapid Acquisition Gradient Echo (MP2RAGE) with compress sensing (CS) has been proposed as an improvement of the standard MPRAGE sequence with multiple advantages including reduced acquisition time needed to provide a ...

Analysis of single-cell datasets generated from diverse organisms offers unprecedented opportunities to unravel fundamental evolutionary processes of conservation and diversification of cell types. However, interspecies genomic differences limit the joint ...

Friedrich Eisenbrand, Puck Elisabeth van Gerwen, Raimon Fabregat I De Aguilar-Amat

Supervised and unsupervised kernel-based algorithms widely used in the physical sciences depend upon the notion of similarity. Their reliance on pre-defined distance metrics-e.g. the Euclidean or Manhattan distance-are problematic especially when used in c ...