In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
Hashing-based approximate nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH).
Locality-preserving hashing was initially devised as a way to facilitate data pipelining in implementations of massively parallel algorithms that use randomized routing and universal hashing to reduce memory contention and network congestion.
An LSH family
is defined for
a metric space ,
a threshold ,
an approximation factor ,
and probabilities and .
This family is a set of functions that map elements of the metric space to buckets . An LSH family must satisfy the following conditions for any two points and any hash function chosen uniformly at random from :
if , then (i.e., p and q collide) with probability at least ,
if , then with probability at most .
A family is interesting when . Such a family is called -sensitive.
Alternatively it is defined with respect to a universe of items U that have a similarity function . An LSH scheme is a family of hash functions H coupled with a probability distribution D over the functions such that a function chosen according to D satisfies the property that for any .
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
A first graduate course in algorithms, this course assumes minimal background, but moves rapidly. The objective is to learn the main techniques of algorithm analysis and design, while building a reper
This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values. Formally, the nearest-neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q. Donald Knuth in vol.
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. Dimensionally cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases.
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable (hard to control or deal with).
Covers the Nearest Neighbor search algorithm and the Johnson-Lindenstrauss lemma for dimensionality reduction, exploring preprocessing techniques and locality-sensitive hashing.
Operators from various industries have been pushing the adoption of wireless sensing nodes for industrial monitoring, and such efforts have produced sizeable condition monitoring datasets that can be used to build diagnosis algorithms capable of warning ma ...
In this thesis we will present and analyze randomized algorithms for numerical linear algebra problems. An important theme in this thesis is randomized low-rank approximation. In particular, we will study randomized low-rank approximation of matrix functio ...
EPFL2024
As it has become easier and cheaper to collect big datasets in the last few decades, designing efficient and low-cost algorithms for these datasets has attracted unprecedented attention. However, in most applications, even storing datasets as acquired has ...