Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
The notion of similarity between texts is fundamental for many applications of Natural Language Processing. For example, this notion is particularly useful for the applications designed for the management of information in large textual databases, such as Information Retrieval or Automatic Text Structuring. Information Retrieval is the search of the most relevant documents according to an information need expressed by a query, and can be implemented by the search of the documents most similar to the query. Automatic Text Structuring is often viewed as the clustering of documents according to their similarity measures. The similarity between documents relies on their representation. The most used textual representation is the Vector Space model, in which each document is represented by a vector, and the similarity between documents is then computed by a distance measure in this space, for instance, the cosine of the vectors representing the documents. We first present several vector space models used for the computation of similarities between documents. Then, we focus on the problem of the integration of additional knowledge in the vector space representation, and the impact of this integration on the results obtained for several tasks. We fist consider the integration of co-occurrences in the representation model, and we focus on the DSIR model (Distributional Semantics based In-formation Retrieval). We show that this model has a probabilistic theoretical basis. We then consider the use of syntactic information to compute the co-occurrence frequencies. We also consider the integration of knowledge about compounds in the representation, taking into account morpho-syntactic and semantic variants of the considered compounds. We finally address the issue of word sense disambiguation, using synonymy relations to derive a vector space representation for which each dimension is associated to a meaning and not to a term. For all these methods, we propose several evaluations : we first consider a validation for the notion of similarity derived from a vector space representation, in a multi-lingual framework : the idea is to verify that the similarity between two documents in one language is close to the similarity between their translations in another language. We also propose an evaluation of the different models considered in a standard Information Retrieval evaluation framework. We finally consider the evaluation of the models on a Word Sense Disambiguation task.
Devis Tuia, Sylvain Lobry, Christel Marie Tartini-Chappuis, Javiera Francisca Castillo Navarro, Nicola Antonio Santacroce
Oleg Yazyev, Daniel Gosalbez Martinez, Alberto Crepaldi