Selection and aggregation of ranking criteria became an important topic in information retrieval as search is getting more specialized and as volume of electronically available information grows. In this context, document ranking has undergone a shift from purely content-based ranking criteria to combined ranking schemes, integrating additional fea- tures, such as document popularity or impact, or link-based techniques that were widely applied after the success of the PageRank algorithm. In this thesis we experiment with selection and aggregation of ranking criteria, aiming at increasing performance of a specialized scientific search engine. Based on theoretical foundations across several research fields including social choice theory, information re- trieval and digital libraries, we focus on ranking in databases of scientific publications in the field of High Energy Physics, identifying criteria that are pertinent for ranking scientific publications and selecting and aggregating them within a unified framework. The first issue that we address is thus identification and selection of ranking criteria for scientific documents in High Energy Physics. The criteria include the traditional information retrieval relevance that is based on the word similarity, but also the document usage, citation counts and links. In this context we present a novel ranking criterion combining the Hirsch index and the download counts, that we call the d-Hirsch index, taking into account counts of document downloads and assigning the corresponding Hirsch index directly to a document. Criteria selection is then based on correlation analysis between ranked lists of documents. We propose that correlations of entire document listings should be replaced by measuring an overlap on the top-k of the resulting list that should better reflect the independence in terms of ranking. To this end we proposed a new measure for the overlap, the Mean Average Overlap (MeanO). The second issue that we address is the aggregation framework for ranked lists of documents, where we focus on applying linear combination and models trained with ma- chine learning techniques based on logistic regression. As individual scores that are used for ranking are not necessarily comparable to each other, we describe a unified model for normalizing ranking scores before their aggregation, based on statistical properties of the underlying ranking criteria. Another contribution of our work is related to creation of a referential of relevance judgments for information retrieval experimentation in databases of scientific publications in the domain of High Energy Physics. Until now there has been no such resource avail- able that would allow to carry out evaluation of specialized information retrieval in this domain. We propose a method for automated generation of referentials, assuming that document relevance is determined by the document usage. Our approach corresponds to a modification of the pooling method, wh
Julien René Pierre Fageot, Adrien Raphaël Depeursinge, Daniel Abler
Matthieu Wyart, Carolina Brito Carvalho dos Santos
Claudia Rebeca Binder Signer, Matteo Barsanti, Selin Yilmaz