Publication

Intégration de connaissances syntaxiques et sémantiques dans les représentations vectorielles de textes

Romaric Besançon
2002
EPFL thesis
Abstract

The notion of similarity between texts is fundamental for many applications of Natural Language Processing. For example, this notion is particularly useful for the applications designed for the management of information in large textual databases, such as Information Retrieval or Automatic Text Structuring. Information Retrieval is the search of the most relevant documents according to an information need expressed by a query, and can be implemented by the search of the documents most similar to the query. Automatic Text Structuring is often viewed as the clustering of documents according to their similarity measures. The similarity between documents relies on their representation. The most used textual representation is the Vector Space model, in which each document is represented by a vector, and the similarity between documents is then computed by a distance measure in this space, for instance, the cosine of the vectors representing the documents. We first present several vector space models used for the computation of similarities between documents. Then, we focus on the problem of the integration of additional knowledge in the vector space representation, and the impact of this integration on the results obtained for several tasks. We fist consider the integration of co-occurrences in the representation model, and we focus on the DSIR model (Distributional Semantics based In-formation Retrieval). We show that this model has a probabilistic theoretical basis. We then consider the use of syntactic information to compute the co-occurrence frequencies. We also consider the integration of knowledge about compounds in the representation, taking into account morpho-syntactic and semantic variants of the considered compounds. We finally address the issue of word sense disambiguation, using synonymy relations to derive a vector space representation for which each dimension is associated to a meaning and not to a term. For all these methods, we propose several evaluations : we first consider a validation for the notion of similarity derived from a vector space representation, in a multi-lingual framework : the idea is to verify that the similarity between two documents in one language is close to the similarity between their translations in another language. We also propose an evaluation of the different models considered in a standard Information Retrieval evaluation framework. We finally consider the evaluation of the models on a Word Sense Disambiguation task.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.