**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.

Publication# Discrete-Choice Mining of Social Processes

Abstract

Poor decisions and selfish behaviors give rise to seemingly intractable global problems, such as the lack of transparency in democratic processes, the spread of conspiracy theories, and the rise in greenhouse gas emissions. However, people are more predictable than we think, and with machine-learning algorithms and sufficiently large datasets, we can design accurate models of human behavior in a variety of settings. In this thesis, to gain insight into social processes, we develop highly interpretable probabilistic choice-models. We draw from the econometrics literature on discrete-choice models and combine them with matrix factorization methods, Bayesian statistics, and generalized linear models. These predictive models enable interpretability through their learned parameters and latent factors.

First, we study the social dynamics behind group collaborations for the collective creation of content, such as in Wikipedia, the Linux kernel, and the European Union law-making process. By combining the Bradley-Terry and Rasch models with matrix factorization and natural language processing, we develop a model of edit acceptance in peer-production systems. We discover controversial components (e.g., Wikipedia articles and European laws) and influential users (e.g., Wikipedia editors and parliamentarians), as well as features that correlate with a high probability of edit acceptance. The latent representations capture non-linear interactions between components and users, and they cluster well into different topics (e.g., historical figures and TV characters in Wikipedia, business and environment in European laws).

Second, we develop an algorithm for predicting the outcome of elections and of referenda by combining matrix factorization and generalized linear models. Our algorithm learns representations of votes and regions, which capture ideological and cultural voting patterns (e.g., liberal/conservative, rural/urban), and it predicts the vote results in unobserved regions from partial observations. We test our model on voting data in Germany, Switzerland, and the US, and we deploy it on a Web platform to predict Swiss referendum votes in real-time. On average, our predictions reach a mean absolute error of 1% after observing only 5% of the regions.

Third, we study how people perceive the carbon footprint of their day-to-day actions. We cast this problem as a comparison problem between pairs of actions (e.g., the difference between flying across continents and using household appliances), and we develop a statistical model of relative comparisons reminiscent of the Thurstone model in psychometrics. The model learns the usersâ perception as the parameters of a Bayesian linear regression, which enables us to derive an active-learning algorithm to collect data efficiently. Our experiments show that users overestimate the emissions of low-footprint actions and underestimate those of high-footprint actions.

Finally, we design a probabilistic model of pairwise-comparison outcomes that capture a wide range of time dynamics. We achieve this by replacing the static parameters of a class of popular pairwise-comparison models with continuous-time Gaussian processes. We also develop an efficient inference algorithm that computes, with only a few linear-time iterations over the data, an approximate Bayesian posterior distribution.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications (135)

Related concepts (57)

Related MOOCs (32)

Algebra (part 1)

Un MOOC francophone d'algèbre linéaire accessible à tous, enseigné de manière rigoureuse et ne nécessitant aucun prérequis.

Algebra (part 1)

Un MOOC francophone d'algèbre linéaire accessible à tous, enseigné de manière rigoureuse et ne nécessitant aucun prérequis.

Algebra (part 2)

Un MOOC francophone d'algèbre linéaire accessible à tous, enseigné de manière rigoureuse et ne nécessitant aucun prérequis.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Generalized linear models were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including linear regression, logistic regression and Poisson regression.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as well as other parameters describing the distribution of the regressand) and ultimately allowing the out-of-sample prediction of the regressand (often labelled ) conditional on observed values of the regressors (usually ).

The carbon footprint (or greenhouse gas footprint) serves as an indicator to compare the total amount of greenhouse gases emitted from an activity, product, company or country. Carbon footprints are usually reported in tons of emissions (CO2-equivalent) per unit of comparison; such as per year, person, kg protein, km travelled and alike. For a product, its carbon footprint includes the emissions for the entire life cycle from the production along the supply chain to its final consumption and disposal.

In the rapidly evolving landscape of machine learning research, neural networks stand out with their ever-expanding number of parameters and reliance on increasingly large datasets. The financial cost and computational resources required for the training p ...

A key challenge across many disciplines is to extract meaningful information from data which is often obscured by noise. These datasets are typically represented as large matrices. Given the current trend of ever-increasing data volumes, with datasets grow ...

Rachid Guerraoui, Youssef Allouah, Oscar Jean Olivier Villemaud, Le Nguyen Hoang

Many applications, such as content moderation and recommendation, require reviewing and scoring a large number of alternatives. Doing so robustly is however very challenging. Indeed, voters' inputs are inevitably sparse: most alternatives are only scored b ...

2024