Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This paper presents a methodology to analyze linguistic changes in a given textual corpus allowing to overcome two common problems related to corpus linguistics studies. One of these issues is the monotonic increase of the corpus size with time, and the other one is the presence of noise in the textual data. In addition, our method allows to better target the linguistic evolution of the corpus, instead of other aspects like noise fluctuation or topics evolution. A corpus formed by two newspapers “La Gazette de Lausanne” and “Le Journal de Genève” is used, providing 4 million articles from 200 years of archives. We first perform some classical measurements on this corpus in order to provide indicators and visualizations of linguistic evolution. We then define the concept of a lexical kernel and word resilience, to face the two challenges of noises and corpus size fluctuations. This paper ends with a discussion based on the comparison of results from linguistic change analysis and concludes with possible future works continuing in that direction.
Colin Neil Jones, Paul Scharnhorst, Rafael Eduardo Carrillo Rangel, Pierre-Jean Alet, Baptiste Schubnel
Frédéric Kaplan, Isabella Di Lenardo
Colin Neil Jones, Paul Scharnhorst, Rafael Eduardo Carrillo Rangel, Pierre-Jean Alet, Baptiste Schubnel