Publication

A Scalable Approach to Harvest Modern Weblogs

Olivier Eric Paul Blanvillain
2015
Journal paper

Abstract

Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.

Official source

https://infoscience.epfl.ch/record/208256?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

A Scalable Approach to Harvest Modern Weblogs

Graph Chatbot

Chat with Graph Search

Content Moderation in Online Platforms

Referencing in YouTube Knowledge Communication Videos

Dynamic Personalized Ranking

Dynamic Personalized Ranking

Referencing in YouTube Knowledge Communication Videos

Content Moderation in Online Platforms