Summary
Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or a document to obtain a , but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents. Document processing was initially as is still to some extent a kind of production line work dealing with the treatment of documents, such as letters and parcels, in an aim of sorting, extracting or massively extracting data. This work could be performed in-house or through business process outsourcing. Document processing can indeed involve some kind of externalized manual labor, such as mechanical Turk. As an example of manual document processing, as relatively recent as 2007, document processing for "millions of visa and citizenship applications" was about use of "approximately 1,000 contract workers" working to "manage mail room and data entry." While document processing involved data entry via keyboard well before use of a computer mouse or a , a 1990 article in The New York Times regarding what it called the "paperless office" stated that "document processing begins with the scanner".
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (3)
DH-405: Foundations of digital humanities
This course gives an introduction to the fundamental concepts and methods of the Digital Humanities, both from a theoretical and applied point of view. The course introduces the Digital Humanities cir
CS-452: Foundations of software
The course introduces the foundations on which programs and programming languages are built. It introduces syntax, types and semantics as building blocks that together define the properties of a progr
PHYS-324: Classical electrodynamics
The goal of this course is the study of the physical and conceptual consequences of Maxwell equations.
Related lectures (11)
Mathematical Equation Rule
Covers the process of converting handwritten entries into text and sharing mathematical equation rules.
Mathematical Equation Rule
Covers mathematical equation rules, handwritten entries, drawings, and text conversions.
Lambda Calculus: Church Numerals
Explores Church numerals, Booleans, pairs, recursion, and behavioral equivalence in Lambda Calculus.
Show more
Related publications (15)

Lausanne Historical Censuses Dataset HTR 35k

Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer

This training dataset includes a total of 34,913 manually transcribed text segments. It is dedicated to the handwritten text recognition (HTR) of historical sources, typically tabular records, such as censuses. This dataset is based on a sample of 83 pages ...
Zenodo2023
Show more
Related people (2)