Publication

Analyse multi-échelle de n-grammes sur 200 années d'archives de presse

Résumé

The recent availability of large corpora of digitized texts over several centuries opens the way to new forms of studies on the evolution of languages. In this thesis, we study a corpus of 4 million press articles covering a period of 200 years. The thesis tries to measure the evolution of written French on this period at the level of words and expressions, but also in a more global way by attempting to define integrated measures of linguistic evolution. The methodological choice is to introduce a minimum of linguistic hypotheses in this study by developing new measures around the simple notion of n-gram, a sequence of n consecutive words. The thesis explores on this basis the potential of already known concepts as temporal frequency profiles and their diachronic correlations, but also introduces new abstractions such as the notion of resilient linguistic kernel or the decomposition of profiles into solidified expressions according to simple statistical models. Through the use of distributed computational techniques, it develops methods to test the relevance of these concepts on a large amount of textual data and thus allows to propose a virtual observatory of the diachronic evolutions associated with a given corpus. On this basis, the thesis explores more precisely the multi-scale dimension of linguistic phenomena by considering how standardized measures evolve when applied to increasingly long n-grams. The discrete and continuous scale from the isolated entities (n=1) to the increasingly complex and structured expressions (1 < n < 10) offers a transversal axis of study to the classical differentiations that ordinarily structure linguistics: syntax, semantics, pragmatics, and so on. The thesis explores the quantitative and qualitative diversity of phenomena at these different scales of language and develops a novel approach by proposing multi-scale measurements and formalizations, with the aim of characterizing more fundamental structural aspects of the studied phenomena.

À propos de ce résultat
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Graph Chatbot

Chattez avec Graph Search

Posez n’importe quelle question sur les cours, conférences, exercices, recherches, actualités, etc. de l’EPFL ou essayez les exemples de questions ci-dessous.

AVERTISSEMENT : Le chatbot Graph n'est pas programmé pour fournir des réponses explicites ou catégoriques à vos questions. Il transforme plutôt vos questions en demandes API qui sont distribuées aux différents services informatiques officiellement administrés par l'EPFL. Son but est uniquement de collecter et de recommander des références pertinentes à des contenus que vous pouvez explorer pour vous aider à répondre à vos questions.