Publication

Probing the Limits of Social Data

Alexandra Olteanu
2016
EPFL thesis
Abstract

Online social data has been hailed to provide unprecedented insights into human phenomena due to its ability to capture human behavior at a scale and level of detail, both in breadth and depth, that is hard to achieve through conventional data collection techniques. This has led to numerous studies that leverage online social data to model or gain insights about real world phenomena, as well as to inform system or methods design for performance gains, or for providing personalized services. Alas, regardless of how large, detailed or varied the online social data is, there are limits to what can be discerned from it about real-world, or even media- or application-specific phenomena. This thesis investigates four instances of such limits that are related to both the properties of the working data sets and of the methods used to acquire and leverage them, including: (1) online social media biases, (2) assessing and (3) reducing data collection biases, and (4) methods sensitivity to data biases and variability. For each of them, we conduct a separate case study that enables us to systematically devise and apply consistent methodologies to collect, process, compare or assess different data sets and dedicated methods. The main contributions of this thesis are: (i) To gain insights into media-specific biases, we run a comparative study juxtaposing social and mainstream media coverage of domain-specific news events for a period of 17 months. To this end, we introduce a generic methodology for comparing news agendas online based on a comparison of spikes of coverage. We expose significant differences in the type of events that are covered by the two media. (ii) To assess possible biases across data collections, we run a transversal study that systematically assembles and examines 26 distinct data sets of social media posts during a variety of crisis events spanning a 2 years period. While we find patterns and consistencies, we also uncover substantial variability across different event data sets, highlighting the pitfalls of generalizing findings from one data set to another. (iii) To improve data collections, we introduce a method that increases the recall of social media samples, while preserving the original distribution of message types and sources. To locate and monitor domain-specific events, this method constructs and applies a domain-specific, yet generic lexicon, automatically learning event-specific terms and adapting the lexicon to the targeted event. The resulted improvements also show that only a fraction of the relevant data is currently mined. (iv) To test the methods sensitivity, to data biases and variability we run an empirical evaluation on 6 real-world data sets dissecting the impact of user and item attributes on the performance of recommendation approaches that leverage distinct social cues--explicit social links vs. implicit interest affinity. We show performance variations not only across data sets, but also within each data set, across different classes of users or items, suggesting that global metrics are often unsuited for assessing recommendation systems performance. The overarching goal of this thesis is to contribute a practical perspective to the body of research that aims to quantify biases, to devise better methods to collect and model social data, and to evaluate such methods in context.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (42)
Media bias
Media bias is the bias of journalists and news producers within the mass media in the selection of many events and stories that are reported and how they are covered. The term "media bias" implies a pervasive or widespread bias contravening of the standards of journalism, rather than the perspective of an individual journalist or article. The direction and degree of media bias in various countries is widely disputed.
Data
In common usage and statistics, data (USˈdætə; UKˈdeɪtə) is a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally. A datum is an individual value in a collection of data. Data is usually organized into structures such as tables that provide additional context and meaning, and which may themselves be used as data in larger structures.
Data collection
Data collection or data gathering is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. Data collection is a research component in all study fields, including physical and social sciences, humanities, and business. While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same.
Show more
Related publications (68)

It’s All Relative: Learning Interpretable Models for Scoring Subjective Bias in Documents from Pairwise Comparisons

Matthias Grossglauser, Aswin Suresh, Chi Hsuan Wu

We propose an interpretable model to score the subjective bias present in documents, based only on their textual content. Our model is trained on pairs of revisions of the same Wikipedia article, where one version is more biased than the other. Although pr ...
2024

The Impact of Data Persistence Bias on Social Media Studies

Tugrulcan Elmas

Social media studies often collect data retrospectively to analyze public opinion. Social media data may decay over time and such decay may prevent the collection of the complete dataset. As a result, the collected dataset may differ from the complete data ...
New York2023

How Words Move Hearts: Interpretable Machine Learning Models of Bias, Engagement, and Influence in Socio-Political Systems

Aswin Suresh

We study socio-political systems in representative democracies. Motivated by problems that affect the proper functioning of the system, we build computational methods to answer research questions regarding the phenomena occurring in them. For each phenomen ...
EPFL2023
Show more
Related MOOCs (25)
Introduction to Geographic Information Systems (part 1)
Organisé en deux parties, ce cours présente les bases théoriques et pratiques des systèmes d’information géographique, ne nécessitant pas de connaissances préalables en informatique. En suivant cette
Introduction to Geographic Information Systems (part 1)
Organisé en deux parties, ce cours présente les bases théoriques et pratiques des systèmes d’information géographique, ne nécessitant pas de connaissances préalables en informatique. En suivant cette
Humanitarian Action in the Digital Age
The first MOOC about responsible use of technology for humanitarians. Learn about technology and identify risks and opportunities when designing digital solutions.
Show more