Automatic table detection and classification in large-scale newspaper archives

2022
Student project

Abstract

In recent decades, major efforts to digitize historical documents led to the creation of large machine readable corpora, including newspapers, which are waiting to be processed and analyzed. Newspapers are a valuable historical source, notably because of the plurality of subjects and points of view they cover; however their heterogeneity due to their diachronic properties and their visual richness makes them difficult to deal with. Certain recurring elements, such as tables, which are powerful layout objects because of their ability to easily convey a large amount of information through their logical visual arrangement, play a role in the difficulty of processing them. This thesis focuses on automatic table processing in large-scale newspaper archives. Starting from a large corpus of Luxembourgish newspapers annotated with tables, we propose a statistical exploration of this dataset as well as strategies to address its annotation inconsistencies and to automatically bootstrap a training dataset for table classification. We also explore the ability of deep learning methods to detect and semantically classify tables. The performance of image segmentation models are compared in a series of experiments around their ability to learn under challenging conditions, while classifiers based on different combinations of data modalities are evaluated on the task of table classification. Results show that visual models are able to detect tables by learning on an inconsistent ground truth, and that adding further modalities increases classification performance.

Official source

https://infoscience.epfl.ch/record/291895?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Automatic table detection and classification in large-scale newspaper archives

Graph Chatbot

Chat with Graph Search

SAGTTA: SALIENCY GUIDED TEST TIME AUGMENTATION FOR MEDICAL IMAGE SEGMENTATION ACROSS VENDOR DOMAIN SHIFT

Breaking the Curse of Dimensionality in Deep Neural Networks by Learning Invariant Representations

Text Representation Learning for Low Cost Natural Language Understanding

SAGTTA: SALIENCY GUIDED TEST TIME AUGMENTATION FOR MEDICAL IMAGE SEGMENTATION ACROSS VENDOR DOMAIN SHIFT

Breaking the Curse of Dimensionality in Deep Neural Networks by Learning Invariant Representations

Text Representation Learning for Low Cost Natural Language Understanding