Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach. Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document. Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230). Request-oriented classification may be classification that is targeted towards a particular audience or user group.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (24)
EE-724: Human language technology: applications to information access
The Human Language Technology (HLT) course introduces methods and applications for language processing and generation, using statistical learning and neural networks.
CS-423: Distributed information systems
This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.
DH-406: Machine learning for DH
This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and imple
Show more
Related lectures (78)
Document Classification: Transformers and MLPs
Explores transformers and MLPs for document classification, emphasizing their benefits over traditional methods.
Naive Bayes Classifier
Introduces the Naive Bayes classifier, covering independence assumptions, conditional probabilities, and applications in document classification and medical diagnosis.
Document Classification: Overview
Explores document classification methods, including k-Nearest-Neighbors, Naïve Bayes Classifier, transformer models, and multi-head attention.
Show more
Related publications (226)

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

Frédéric Kaplan, Maud Ehrmann, Matteo Romanello, Sven-Nicolas Yoann Najem, Emanuela Boros

The quality of automatic transcription of heritage documents, whether from printed, manuscripts or audio sources, has a decisive impact on the ability to search and process historical texts. Although significant progress has been made in text recognition ( ...
The Association for Computational Linguistics2024

Land Cover Mapping From Multiple Complementary Experts Under Heavy Class Imbalance

Devis Tuia, Valérie Zermatten, Javiera Francisca Castillo Navarro, Xiaolong Lu

Deep learning has emerged as a promising avenue for automatic mapping, demonstrating high efficacy in land cover categorization through various semantic segmentation models. Nonetheless, the practical deployment of these models encounters important challen ...
Ieee-Inst Electrical Electronics Engineers Inc2024

From Archival Sources to Structured Historical Information: Annotating and Exploring the "Accordi dei Garzoni"

Frédéric Kaplan, Maud Ehrmann, Orlin Biserov Topalov

If automatic document processing techniques have achieved a certain maturity for present time documents, the transformation of hand-written documents into well-represented, structured and connected data which can satisfactorily be used for historical study ...
Routledge, Taylor & Francis Group2023
Show more
Related concepts (14)
Bag-of-words model
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
Naive Bayes spam filtering
Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification. Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.
Naive Bayes classifier
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem.
Show more