LinguisticsLinguistics is the scientific study of language. The modern-day scientific study of linguistics takes all aspects of language into account — i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural. Linguistics is based on a theoretical as well as descriptive study of language, and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages.
Text corpusIn linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In search technology, a corpus is the collection of documents which is being searched.
Lemma (morphology)In morphology and lexicography, a lemma (: lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of word forms. In English, for example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lexeme, in this context, refers to the set of all the inflected or alternating forms in the paradigm of a single word, and lemma refers to the particular form that is chosen by convention to represent the lexeme.
Concordance (publishing)A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.
CollocationIn corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated. There are about seven main types of collocations: adjective + noun, noun + noun (such as collective nouns), noun + verb, verb + noun, adverb + adjective, verbs + prepositional phrase (phrasal verbs), and verb + adverb.
Part of speechIn grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assigned to the same part of speech generally display similar syntactic behavior (they play similar roles within the grammatical structure of sentences), sometimes similar morphological behavior in that they undergo inflection for similar properties and even similar semantic behavior.
Part-of-speech taggingIn corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags.
DictionaryA dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by consonantal root for Semitic languages or radical and stroke for logographic languages), which may include information on definitions, usage, etymologies, pronunciations, translation, etc. It is a lexicographical reference that shows inter-relationships among the data. A broad distinction is made between general and specialized dictionaries.
Computational linguisticsComputational linguistics has since 2020s became a near-synonym of either natural language processing or language technology, with deep learning approaches, such as large language models, overperforming the specific approaches previously used in the field. The field overlapped with artificial intelligence since the efforts in the United States in the 1950s to use computers to automatically translate texts from foreign languages, particularly Russian scientific journals, into English.
Machine translationMachine translation is use of either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. History of machine translation The origins of machine translation can be traced back to the work of Al-Kindi, a ninth-century Arabic cryptographer who developed techniques for systemic language translation, including cryptanalysis, frequency analysis, and probability and statistics, which are used in modern machine translation.