Publication

Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies

Related concepts (27)

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.

Part-of-speech tagging

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags.

Natural language generation

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information". While it is widely agreed that the output of any NLG process is text, there is some disagreement about whether the inputs of an NLG system need to be non-linguistic.

Natural-language user interface

Natural-language user interface (LUI or NLUI) is a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications. In interface design, natural-language interfaces are sought after for their speed and ease of use, but most suffer the challenges to understanding wide varieties of ambiguous input. Natural-language interfaces are an active area of study in the field of natural-language processing and computational linguistics.

Masoretic Text

The Masoretic Text (MT or M; Nūssāḥ Hammāsōrā, lit. 'Text of the Tradition') is the authoritative Hebrew and Aramaic text of the 24 books of the Hebrew Bible (Tanakh) in Rabbinic Judaism. The Masoretic Text defines the Jewish canon and its precise letter-text, with its vocalization and accentuation known as the mas'sora. Referring to the Masoretic Text, masorah specifically means the diacritic markings of the text of the Hebrew scriptures and the concise marginal notes in manuscripts (and later printings) of the Tanakh which note textual details, usually about the precise spelling of words.

Language model

A language model is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. Large language models, as their most advanced form, are a combination of feedforward neural networks and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.

Biblical manuscript

A biblical manuscript is any handwritten copy of a portion of the text of the Bible. Biblical manuscripts vary in size from tiny scrolls containing individual verses of the Jewish scriptures (see Tefillin) to huge polyglot codices (multi-lingual books) containing both the Hebrew Bible (Tanakh) and the New Testament, as well as extracanonical works. The study of biblical manuscripts is important because handwritten copies of books can contain errors.

Textual criticism

Textual criticism is a branch of textual scholarship, philology, and literary criticism that is concerned with the identification of textual variants, or different versions, of either manuscripts (mss) or of printed books. Such texts may range in dates from the earliest writing in cuneiform, impressed on clay, for example, to multiple unpublished versions of a 21st-century author's work. Historically, scribes who were paid to copy documents may have been literate, but many were simply copyists, mimicking the shapes of letters without necessarily understanding what they meant.