Lecture

Text Processing: Large Digital Text Collections Analysis

In course

DH-405: Foundations of digital humanities

This course gives an introduction to the fundamental concepts and methods of the Digital Humanities, both from a theoretical and applied point of view. The course introduces the Digital Humanities cir

Description

This lecture explores the processing of large digital text collections in the field of Digital Humanities. It covers the extraction of hidden regularities and structures from massive textual objects, the distinction between Humanities Computing and Computational Linguistics, the challenges posed by very large textual objects, and the use of text processing pipelines. The lecture delves into the significance of projects like Project Gutenberg and Wikisource, the concept of text reuse, and the application of TF-IDF, Latent Semantic Analysis, and Topic Modeling in analyzing text data.

Instructor

Frédéric Kaplan

Official source

Ontological neighbourhood

Information engineering

Natural language processing: Topics in natural language processing

Related lectures (31)

Vector Space Semantics (and Information Retrieval)

Explores the Vector Space model, Bag of Words, tf-idf, cosine similarity, Okapi BM25, and Precision and Recall in Information Retrieval.

Handling Text: Document Retrieval, Classification, Sentiment Analysis

Explores document retrieval, classification, sentiment analysis, TF-IDF matrices, nearest-neighbor methods, matrix factorization, regularization, LDA, contextualized word vectors, and BERT.

Document Retrieval and Classification

Covers document retrieval, classification, sentiment analysis, and topic detection using TF-IDF matrices and contextualized word vectors like BERT.

Text Handling: Matrix, Documents, Topics

Explores text handling, focusing on matrices, documents, and topics, including challenges in document classification and advanced models like BERT.

Text Models: Word Embeddings and Topic Models

Explores word embeddings, topic models, Word2vec, Bayesian Networks, and inference methods like Gibbs sampling.