Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
Early modern printed gazettes relied on a system of news exchange and text reuse largely based on handwritten sources. The reconstruction of this information exchange system is possible by detecting reused texts. We present a method to individuate text borrowings within noisy OCRed texts from printed gazettes based on string kernels and local text alignment. We apply our methods on a corpus of Italian gazettes for the year 1648. Beside unveiling substantial overlaps in news sources, we are able to assess the editorial policy of different gazettes and account for a multi-faceted system of text reuse.
Matthias Grossglauser, Aswin Suresh, Chi Hsuan Wu