Publication

Lausanne Historical Censuses Dataset HTR 35k

Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer
2023
Dataset

Abstract

This training dataset includes a total of 34,913 manually transcribed text segments. It is dedicated to the handwritten text recognition (HTR) of historical sources, typically tabular records, such as censuses. This dataset is based on a sample of 83 pages from the 19th century (1805-1898) censuses of Lausanne, Switzerland. The primary language of the documents is French, although many germanic names and toponyms are also found. The training data are formatted and provided on the model of the Bentham dataset. The format thus simply consists in a list of jpeg images, one per text segments, and their corresponding transcription, stored in a txt file. The file naming convention is 'yyyy-ppp-n', where 'y' stands for the year of publication of the census, and 'p' for the page number. The digitized documents are provided by the Archives of the City of Lausanne. Please note that the annotation and extraction methodology, as well as the complete evaluation of performance, including HTR benchmark and post-correction performance is published in : Petitpierre R., Rappo L., Kramer M. (2023). An end-to-end pipeline for historical censuses processing. International Journal on Document Analysis and Recognition (IJDAR). doi: 10.1007/s10032-023-00428-9 Tabular dataset resulting from automatic extraction are also available on Zenodo : Petitpierre R., Rappo L., Kramer M., di Lenardo I. (2023). 1805-1898 Census Records of Lausanne : a Long Digital Dataset for Demographic History. Zenodo. doi: 10.5281/zenodo.7711640

Official source

https://infoscience.epfl.ch/record/301983?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.

Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer
2023
Dataset

Abstract

Official source

https://infoscience.epfl.ch/record/301983?ln=en

About this result

Ontological neighbourhood

Information engineering

Natural language processing: Topics in natural language processing

Related concepts (37)

Related publications (36)

Lausanne Historical Censuses Dataset HTR 35k

Graph Chatbot

Chat with Graph Search

Subjective performance evaluation of bitrate allocation strategies for MPEG and JPEG Pleno point cloud compression

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

1805-1898 Census Records of Lausanne : a Long Digital Dataset for Demographic History

Subjective performance evaluation of bitrate allocation strategies for MPEG and JPEG Pleno point cloud compression

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

1805-1898 Census Records of Lausanne : a Long Digital Dataset for Demographic History