Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
The drastic shift towards digital communication in our mediasphere has caused a profound change in the production and consumption of information, which in turn has substantial implications on the social and political landscape. Misinformation, as a side effect of mass information diffusion, has become a fundamental problem for governments, platforms, and the general public in light of critical events such as elections, pandemics, and wars. In this thesis, we focus on the problem of online scientific misinformation. As a starting point, we survey the evolution of misinformation and present the main characteristics and approaches against it, framing the high-level positioning of this thesis with respect to related literature. Then, we discuss three major scientific contributions of this thesis: our methods for combating claim-based, article-based, and source-based scientific misinformation.For combating claim-based scientific misinformation, we introduce SciClops, a method for detecting and contextualizing scientific claims for assisting manual fact-checking. Our method involves three steps: (1) extracting scientific claims using a domain-specific, fine-tuned transformer model, (2) clustering similar claims together with related scientific literature using a method that exploits their content and the connections among them, and (3) highlighting check-worthy claims broadcasted by popular yet unreliable sources. Our experiments show that SciClops effectively assists non-expert fact-checkers in verifying complex scientific claims, facilitating them to outperform commercial fact-checking systems.For combating article-based scientific misinformation, we introduce SciLens, a method for evaluating the quality of scientific news articles. Our method involves a series of quality indicators for news articles that derive from: (1) their content, including the use of attributed quotes, (2) their scientific context, including their semantic similarity and web proximity to the scientific literature, and (3) their social context, including their social media reach and stance. Our experiments show that these indicators help non-experts evaluate the quality of articles more accurately compared to non-experts that do not have access to these indicators. Moreover, SciLens can also produce completely automated quality scores for articles, which agree more with expert evaluators than manual evaluations done by non-experts.For combating source-based scientific misinformation, we introduce SciLander, a method for learning representations of news sources reporting on scientific topics. Our method involves heterogeneous source indicators that capture: (1) the copying of news stories between sources, (2) the semantic shift of terms across sources, (3) the usage of jargon, and (4) the stance towards specific citations. SciLander uses these indicators as signals of source agreement to train unsupervised source embeddings. Our experiments show that the learned source representations outperform state-of-the-art baselines on the task of news veracity classification while encoding information about the reliability, political leaning, and partisanship bias of these sources.In the last part of this thesis, we introduce NewsTeller, a real-time news analytics platform that runs operationally, handling daily thousands of news articles, social media reactions, and references.
Loading
Loading
Loading
Loading
Loading
black-boxes''. The Law of Parsimony states that
simpler solutions are more likely to be correct than complex ones''. Since they perform quite well in practice, a natural question to ask, then, is in what way are neural networks simple?
We propose that compression is the answer. Since good generalization requires invariance to irrelevant variations in the input, it is necessary for a network to discard this irrelevant information. As a result, semantically similar samples are mapped to similar representations in neural network deep feature space, where they form simple, low-dimensional structures.
Conversely, a network that overfits relies on memorizing individual samples. Such a network cannot discard information as easily.
In this thesis we characterize the difference between such networks using the non-negative rank of activation matrices. Relying on the non-negativity of rectified-linear units, the non-negative rank is the smallest number that admits an exact non-negative matrix factorization.
We derive an upper bound on the amount of memorization in terms of the non-negative rank, and show it is a natural complexity measure for rectified-linear units.
With a focus on deep convolutional neural networks trained to perform object recognition, we show that the two non-negative factors derived from deep network layers decompose the information held therein in an interpretable way. The first of these factors provides heatmaps which highlight similarly encoded regions within an input image or image set. We find that these networks learn to detect semantic parts and form a hierarchy, such that parts are further broken down into sub-parts.
We quantitatively evaluate the semantic quality of these heatmaps by using them to perform semantic co-segmentation and co-localization. In spite of the convolutional network we use being trained solely with image-level labels, we achieve results comparable or better than domain-specific state-of-the-art methods for these tasks.
The second non-negative factor provides a bag-of-concepts representation for an image or image set. We use this representation to derive global image descriptors for images in a large collection. With these descriptors in hand, we perform two variations content-based image retrieval, i.e. reverse image search. Using information from one of the non-negative matrix factors we obtain descriptors which are suitable for finding semantically related images, i.e., belonging to the same semantic category as the query image. Combining information from both non-negative factors, however, yields descriptors that are suitable for finding other images of the specific instance depicted in the query image, where we again achieve state-of-the-art performance.Karl Aberer, Panagiotis Smeros
Gianluca Carlo Misuraca, Gianluigi Viscusi