Text corpus | EPFL Graph Search

Related courses (4)

CS-431: Introduction to natural language processing

The objective of this course is to present the main models, formalisms and algorithms necessary for the development of applications in the field of natural language information processing. The concept

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

ENG-270: Computational methods and tools

This course prepares students to use modern computational methods and tools for solving problems in engineering and science.

Related lectures (31)

Latent Semantic Indexing: Concepts and Applications

Explores latent semantic indexing, vocabulary construction, document matrix creation, query transformation, and document retrieval using cosine similarity.

Neural Networks for NLP

Covers modern Neural Network approaches to NLP, focusing on word embeddings, Neural Networks for NLP tasks, and future Transfer Learning techniques.

Text Processing: Humanities Computing and Linguistics

Explores the processing of large digital texts, revealing hidden patterns and structures, and the convergence of Humanities Computing and Computational Linguistics.

Related publications (30)

Data-Driven Music Theory: Curating and Investigating Large Corpora of Digitally Encoded Music Analyses

Johannes Hentschel

This dissertation on data-driven music theory is centered around curatorial practices concerning the creation, publication, and evaluation of large, expert-annotated symbolic datasets. With its primary interest in the harmony of European tonal music from i ...

EPFL2024

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

Frédéric Kaplan, Maud Ehrmann, Matteo Romanello, Emanuela Boros, Sven-Nicolas Yoann Najem

The quality of automatic transcription of heritage documents, whether from printed, manuscripts or audio sources, has a decisive impact on the ability to search and process historical texts. Although significant progress has been made in text recognition ( ...

Association for Computational Linguistics2024

An Annotated Corpus of Tonal Piano Music from the Long 19th Century

Martin Alois Rohrmeier, Fabian Claude Moss, Johannes Hentschel, Markus Franz Josef Neuwirth

We present a dataset of 264 annotated piano pieces of nine composers, composed in the long 19th century (https://doi.org/10.5281/zenodo.7483349). Annotations adhere to the DCML harmony annotation standard and include Roman numerals, phrase boundaries, and ...

Ohio State Univ, Sch Music2023

Related units (1)

DHI - Administration

Related concepts (12)

Linguistics

Linguistics is the scientific study of language. The modern-day scientific study of linguistics takes all aspects of language into account — i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural. Linguistics is based on a theoretical as well as descriptive study of language, and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages.

Corpus linguistics

Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural corpora), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The text-corpus method uses the body of texts written in any natural language to derive the set of abstract rules which govern that language.

Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure.