Concept

Latent semantic analysis

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents. An information retrieval technique using latent semantic structure was patented in 1988 (US Patent 4,839,853, now expired) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI). LSA can use a document-term matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents. A typical example of the weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency): the weight of an element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance. This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used. After the construction of the occurrence matrix, LSA finds a low-rank approximation to the term-document matrix.

Official source

https://en.wikipedia.org/wiki/Latent_semantic_analysis

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related courses (6)

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

CS-401: Applied data analysis

This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat

EE-724: Human language technology: applications to information access

The Human Language Technology (HLT) course introduces methods and applications for language processing and generation, using statistical learning and neural networks.

Related publications (29)

Randomized low-rank approximation and its applications

Ulf David Persson

In this thesis we will present and analyze randomized algorithms for numerical linear algebra problems. An important theme in this thesis is randomized low-rank approximation. In particular, we will study randomized low-rank approximation of matrix functio ...

EPFL2024

Whole-heart electromechanical simulations using Latent Neural Ordinary Differential Equations

Alfio Quarteroni, Francesco Regazzoni, Luca Dede'

Cardiac digital twins provide a physics and physiology informed framework to deliver personalized medicine. However, high-fidelity multi-scale cardiac models remain a barrier to adoption due to their extensive computational costs. Artificial Intelligence-b ...

Nature Portfolio2024

DiffAirfoil: An Efficient Novel Airfoil Sampler Based on Latent Space Diffusion Model for Aerodynamic Shape Optimization

Pascal Fua, Zhen Wei

Surrogate-based optimization is widely used for aerodynamic shape optimization, and its effectiveness depends on representative sampling of the design space. However, traditional sampling methods are hard-pressed to effectively sample high-dimensional desi ...

2024

Related people (4)

Daniel Gatica-Perez

Karl Aberer

Karl Aberer received his PhD in mathematics in 1991 from the ETH Zürich. From 1991 to 1992 he was postdoctoral fellow at the International Computer Science Institute (ICSI) at the University of California, Berkeley. In 1992, he joined the Integrated Publication and Information Systems institute (IPSI) of GMD in Germany, where he was leading the research division Open Adaptive Information Management Systems. In 2000 he joined EPFL as full professor. Since 2005 he is the director of the Swiss National Research Center for Mobile Information and Communication Systems ( NCCR-MICS, www.mics.ch ). He is member of the editorial boards of VLDB Journal, ACM Transaction on Autonomous and Adaptive Systems and World Wide Web Journal. He has been consulting for the Swiss government in research and science policy as a member of the Swiss Research and Technology Council ( SWTR ) from 2003 - 2011.

Jean-Marc Odobez

Related units (3)

L'IDIAP Laboratory

DHI - Administration

Distributed Information Systems Laboratory

Official source

https://en.wikipedia.org/wiki/Latent_semantic_analysis

About this result

Ontological neighbourhood

Information engineering

Natural language processing: Topics in natural language processing

Related courses (6)

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

CS-401: Applied data analysis

EE-724: Human language technology: applications to information access

The Human Language Technology (HLT) course introduces methods and applications for language processing and generation, using statistical learning and neural networks.

Related lectures (32)

Latent Semantic Indexing: Concepts and Applications

Explores Latent Semantic Indexing, a technique for mapping documents into a concept space for retrieval and classification.

Handling Text: Document Retrieval, Classification, Sentiment Analysis

Explores document retrieval, classification, sentiment analysis, TF-IDF matrices, nearest-neighbor methods, matrix factorization, regularization, LDA, contextualized word vectors, and BERT.

Latent Semantic Indexing

Covers Latent Semantic Indexing, a method to improve information retrieval by mapping documents and queries into a lower-dimensional concept space.

Related publications (29)

Randomized low-rank approximation and its applications

Ulf David Persson

EPFL2024

Whole-heart electromechanical simulations using Latent Neural Ordinary Differential Equations

Alfio Quarteroni, Francesco Regazzoni, Luca Dede'

Nature Portfolio2024

DiffAirfoil: An Efficient Novel Airfoil Sampler Based on Latent Space Diffusion Model for Aerodynamic Shape Optimization

Pascal Fua, Zhen Wei

2024

Related people (4)

Related units (3)

Distributed Information Systems Laboratory

Related concepts (22)

Semantic similarity

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.

Tf–idf

In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

Gensim

Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and Cython for performance. Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.