Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
Methods of estimating the similarity between individual publications is an area of long-standing interest in the scientometrics community. Traditional methods have generally relied on references and other metadata, while text mining approaches based on title and abstract text have appeared more frequently in recent years. In principle, Topic Models have great potential in this domain. But in practice, they are often difficult to successfully employ and, in particular, are notoriously inconsistent as latent space dimension grows. That is, running the same model, with the same parameters, on the same data, but with a different random seed produces radically different similarity estimates as the number of topics increase. In this manuscript we develop a simple, but novel, methodology for evaluating the robustness of topic models. Employing that methodology, we find that the neural network based Doc2Vec approach seems capable of providing (statistically) robust estimates of document-document similarities, even for topic spaces far larger than prudent for the most common topic model approach: Latent Dirichlet Allocation. As this is a work in progress, we do not venture deeply into the question of whether these estimates also reflect reality, but do provide some preliminary evidence and future directions for those efforts.
Jean-Marc Odobez, Olivier Canévet, Michael Villamizar