Publication

Purely URL-based Topic Classification

Related publications (41)

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.

Extracting Informative Textual Parts from Web Pages Containing User-Generated Content

Nikolaos Pappas

The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing st ...

ACM2012

A Decentralized Recommender System for Effective Web Credibility Assessment

Karl Aberer, Alexandra Olteanu, Jean-Eudes Marie Ranvier

An overwhelming and growing amount of data is available online. The problem of untrustworthy online information is augmented by its high economic potential and its dynamic nature, e.g. transient domain names, dynamic content, etc. In this paper, we address ...

2012

An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery

Nikolaos Pappas

The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retri ...

IEEE2012

Automatic detection of conflicts in spoken conversations: ratings and analysis of broadcast political debates

Fabio Valente, Alessandro Vinciarelli

Automatic analysis of spoken conversations has recently searched for phenomena like agreement/disagreement in collaborative and non- conflictual discussions (e.g., meetings). This work adds a novel dimension investigating conflicts in spontaneous conversat ...

2012

, ,

As person names are non-unique, the same name on different Web pages might or might not refer to the same real-world person. This entity identification problem is one of the most challenging issues in realizing the Semantic Web or entity-oriented search. W ...

1st International Workshop on Data Engineering meets the Semantic Web (DESWeb'2010) (co-located with ICDE'2010)2010

Towards better entity resolution techniques for Web document collections

Karl Aberer, Zoltán Miklós, Surender Reddy Yerva

IEEE2010

, , ,

Microblogging sites are a unique and dynamic Web 2.0 communication medium. Understanding the information flow in these systems can not only provide better insights into the underlying sociology, but is also crucial for applications such as content ranking, ...

2010

Understanding the Web

Eda Baykan

The World Wide Web is one of the most widely used information resources. Understanding the web better will enable us to benefit more of it. In this thesis we develop techniques to learn the properties of the web pages like language and topic using only the ...

EPFL2009

A Comparison of Techniques for Sampling Web Pages

Monika Henzinger, Eda Baykan

As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the ...

2009

, ,

Modern web sites commonly interact with third-party domains to integrate advertisements and generate revenue from them. To improve the relevance of advertisement, online advertisers track user activities online with third- party cookies. However, excessive ...

2009