In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In search technology, a corpus is the collection of documents which is being searched. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics. Corpora are the main knowledge base in corpus linguistics. Other notable areas of application include: Language technology, natural language processing, computational linguistics The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (9)
AR-679: IMAGES AND NUMBERS. 8th Les Rencontres de l'EDAR
The eighth edition of Les Rencontres de l'EDAR invites doctoral students to reflect on scientific visualisation, referring to their own experience as young scholars - whether related to their PhD diss
ENG-270: Computational methods and tools
This course prepares students to use modern computational methods and tools for solving problems in engineering and science.
CS-431: Introduction to natural language processing
The objective of this course is to present the main models, formalisms and algorithms necessary for the development of applications in the field of natural language information processing. The concept
Show more
Related lectures (30)
Latent Semantic Indexing: Concepts and Applications
Explores latent semantic indexing, vocabulary construction, document matrix creation, query transformation, and document retrieval using cosine similarity.
Taxonomy Induction: Relations Extraction and Graph Construction
Covers relations extraction and graph construction in taxonomy induction, emphasizing noise reduction for accurate graphs.
Handling Text: Document Retrieval, Classification, Sentiment Analysis
Explores document retrieval, classification, sentiment analysis, TF-IDF matrices, nearest-neighbor methods, matrix factorization, regularization, LDA, contextualized word vectors, and BERT.
Show more
Related publications (107)

Data-Driven Music Theory: Curating and Investigating Large Corpora of Digitally Encoded Music Analyses

Johannes Hentschel

This dissertation on data-driven music theory is centered around curatorial practices concerning the creation, publication, and evaluation of large, expert-annotated symbolic datasets. With its primary interest in the harmony of European tonal music from i ...
EPFL2024

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

Frédéric Kaplan, Maud Ehrmann, Matteo Romanello, Sven-Nicolas Yoann Najem, Emanuela Boros

The quality of automatic transcription of heritage documents, whether from printed, manuscripts or audio sources, has a decisive impact on the ability to search and process historical texts. Although significant progress has been made in text recognition ( ...
The Association for Computational Linguistics2024

An Annotated Corpus of Tonal Piano Music from the Long 19th Century

Martin Alois Rohrmeier, Fabian Claude Moss, Markus Franz Josef Neuwirth, Johannes Hentschel

We present a dataset of 264 annotated piano pieces of nine composers, composed in the long 19th century (https://doi.org/10.5281/zenodo.7483349). Annotations adhere to the DCML harmony annotation standard and include Roman numerals, phrase boundaries, and ...
Ohio State Univ, Sch Music2023
Show more
Related concepts (17)
Linguistics
Linguistics is the scientific study of language. The modern-day scientific study of linguistics takes all aspects of language into account — i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural. Linguistics is based on a theoretical as well as descriptive study of language, and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages.
Corpus linguistics
Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural corpora), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The text-corpus method uses the body of texts written in any natural language to derive the set of abstract rules which govern that language.
Treebank
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.