Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This historical dataset stems from the project of automatic extraction of 72 census records of Lausanne, Switzerland. The complete dataset covers a century of historical demography in Lausanne (1805-1898), which corresponds to 18,831 pages, and nearly 6 million cells. Content. The data published in this repository correspond to a first release, i.e. a diachronic slice of one register every 8 to 9 years. Unfortunately, the remaining data are currently under embargo. Their publication will take place as soon as possible, and at the latest by the end of 2023. In the meantime, the data presented here correspond to a large subset of 2,844 pages, which already allows to investigate most research hypotheses. The population censuses, digitized by the Archives of the city of Lausanne, continuously cover the evolution of the population in Lausanne throughout the 19th century, starting in 1805, with only one long interruption from 1814 to 1831. Highly detailed, they are an invaluable source for studying migration, economic and social history, and traces of cultural exchanges not only with Bern, but also with France and Italy. Indeed, the system of tracing family origin, specific to Switzerland, allows to follow the migratory movements of families long before the censuses appeared. The bourgeoisie is also an essential economic tracer. In addition, censuses extensively describe the organization of the social fabric into family nuclei, around which gravitate various boarders, workers, servants or apprentices, often living in the same apartment with the family. Production. The structure and richness of censuses have also provided an opportunity to develop automatic methods for processing structured documents. The processing of censuses includes several steps, from the identification of text segments to the restructuring of information as digital tabular data, through Handwritten Text Recognition and the automatic segmentation of the structure using neural networks. Please note that the detailed extraction methodology, as well as the complete evaluation of performance and reliability is published in: Petitpierre R., Rappo L., Kramer M. (2023). An end-to-end pipeline for historical censuses processing. International Journal on Document Analysis and Recognition (IJDAR). doi: 10.1007/s10032-023-00428-9 Data structure. The data are structured in rows and columns, with each row corresponding to a household. Multiple entries in the same column for a single household are separated by vertical bars 〈|〉. The center point 〈·〉 indicates an empty entry. For some columns (e.g., street name, house number, owner name), an empty entry indicates that the last non-empty value should be carried over. The page number is in the last column. Liability. The data presented here are not curated nor verified. They are the raw results of the extraction, the reliability of which was thoroughly assessed in the above-mentioned publication. We insist on the fact that for any reuse of this data for research purposes, the implementation of an appropriate methodology is necessary. This may typically include string distance heuristics, or statistical methodologies to deal with noise and uncertainty. References: ["Petitpierre R., Rappo L., Kramer M. (2023). An end-to-end pipeline for historical censuses processing. International Journal on Document Analysis and Recognition (IJDAR). doi: 10.1007/s10032-023-00428-9"]
Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer
Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer