Intégration de connaissances syntaxiques et sémantiques dans les représentations vectorielles de textes

Romaric Besançon
2002
Thèse EPFL

Résumé

The notion of similarity between texts is fundamental for many applications of Natural Language Processing. For example, this notion is particularly useful for the applications designed for the management of information in large textual databases, such as Information Retrieval or Automatic Text Structuring. Information Retrieval is the search of the most relevant documents according to an information need expressed by a query, and can be implemented by the search of the documents most similar to the query. Automatic Text Structuring is often viewed as the clustering of documents according to their similarity measures. The similarity between documents relies on their representation. The most used textual representation is the Vector Space model, in which each document is represented by a vector, and the similarity between documents is then computed by a distance measure in this space, for instance, the cosine of the vectors representing the documents. We first present several vector space models used for the computation of similarities between documents. Then, we focus on the problem of the integration of additional knowledge in the vector space representation, and the impact of this integration on the results obtained for several tasks. We fist consider the integration of co-occurrences in the representation model, and we focus on the DSIR model (Distributional Semantics based In-formation Retrieval). We show that this model has a probabilistic theoretical basis. We then consider the use of syntactic information to compute the co-occurrence frequencies. We also consider the integration of knowledge about compounds in the representation, taking into account morpho-syntactic and semantic variants of the considered compounds. We finally address the issue of word sense disambiguation, using synonymy relations to derive a vector space representation for which each dimension is associated to a meaning and not to a term. For all these methods, we propose several evaluations : we first consider a validation for the notion of similarity derived from a vector space representation, in a multi-lingual framework : the idea is to verify that the similarity between two documents in one language is close to the similarity between their translations in another language. We also propose an evaluation of the different models considered in a standard Information Retrieval evaluation framework. We finally consider the evaluation of the models on a Word Sense Disambiguation task.

Source officielle

https://infoscience.epfl.ch/record/32984?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Intégration de connaissances syntaxiques et sémantiques dans les représentations vectorielles de textes

Graph Chatbot

Chattez avec Graph Search

Modeling Structured Data in Attention-based Models

Diversity of radial spin textures in chiral materials

Multi-task prompt-RSVQA to explicitly count objects on aerial images

Modeling Structured Data in Attention-based Models

Multi-task prompt-RSVQA to explicitly count objects on aerial images

Diversity of radial spin textures in chiral materials