Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Methods of estimating the similarity between individual publications is an area of long-standing interest in the scientometrics community. Traditional methods have generally relied on references and other metadata, while text mining approaches based on title and abstract text have appeared more frequently in recent years. In principle, Topic Models have great potential in this domain. But in practice, they are often difficult to successfully employ and, in particular, are notoriously inconsistent as latent space dimension grows. That is, running the same model, with the same parameters, on the same data, but with a different random seed produces radically different similarity estimates as the number of topics increase. In this manuscript we develop a simple, but novel, methodology for evaluating the robustness of topic models. Employing that methodology, we find that the neural network based Doc2Vec approach seems capable of providing (statistically) robust estimates of document-document similarities, even for topic spaces far larger than prudent for the most common topic model approach: Latent Dirichlet Allocation. As this is a work in progress, we do not venture deeply into the question of whether these estimates also reflect reality, but do provide some preliminary evidence and future directions for those efforts.
Jean-Marc Odobez, Olivier Canévet, Michael Villamizar