Text miningText mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al.
Textual criticismTextual criticism is a branch of textual scholarship, philology, and literary criticism that is concerned with the identification of textual variants, or different versions, of either manuscripts (mss) or of printed books. Such texts may range in dates from the earliest writing in cuneiform, impressed on clay, for example, to multiple unpublished versions of a 21st-century author's work. Historically, scribes who were paid to copy documents may have been literate, but many were simply copyists, mimicking the shapes of letters without necessarily understanding what they meant.
Content analysisContent analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the key advantages of using content analysis to analyse social phenomena is their non-invasive nature, in contrast to simulating social experiences or collecting survey answers. Practices and philosophies of content analysis vary between academic disciplines.
Document processingDocument processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or a document to obtain a , but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor.
Qualitative researchQualitative research is a type of research that aims to gather and analyse non-numerical (descriptive) data in order to gain an understanding of individuals' social reality, including understanding their attitudes, beliefs, and motivation. This type of research typically involves in-depth interviews, focus groups, or observations in order to collect data that is rich in detail and context. Qualitative research is often used to explore complex phenomena or to gain insight into people's experiences and perspectives on a particular topic.
Historical criticismHistorical criticism, also known as the historical-critical method or higher criticism, is a branch of criticism that investigates the origins of ancient texts in order to understand "the world behind the text". While often discussed in terms of Jewish and Christian writings from ancient times, historical criticism has also been applied to other religious and secular writings from various parts of the world and periods of history.
Speech recognitionSpeech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Masoretic TextThe Masoretic Text (MT or M; Nūssāḥ Hammāsōrā, lit. 'Text of the Tradition') is the authoritative Hebrew and Aramaic text of the 24 books of the Hebrew Bible (Tanakh) in Rabbinic Judaism. The Masoretic Text defines the Jewish canon and its precise letter-text, with its vocalization and accentuation known as the mas'sora. Referring to the Masoretic Text, masorah specifically means the diacritic markings of the text of the Hebrew scriptures and the concise marginal notes in manuscripts (and later printings) of the Tanakh which note textual details, usually about the precise spelling of words.
Automatic summarizationAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data. Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.
Optical character recognitionOptical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of s of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast).