**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Distribution-on-distribution regression via optimal transport maps

Abstract

We present a framework for performing regression when both covariate and response are probability distributions on a compact interval. Our regression model is based on the theory of optimal transportation, and links the conditional Frechet mean of the response to the covariate via an optimal transport map. We define a Frechet-least-squares estimator of this regression map, and establish its consistency and rate of convergence to the true map, under both full and partial observations of the regression pairs. Computation of the estimator is shown to reduce to a standard convex optimization problem, and thus our regression model can be implemented with ease. We illustrate our methodology using real and simulated data.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related concepts (12)

Regression analysis

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a '

Estimator

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (th

Linear regression

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variable

Related publications (13)

Loading

Loading

Loading

Generalized Linear Models have become a commonly used tool of data analysis. Such models are used to fit regressions for univariate responses with normal, gamma, binomial or Poisson distribution. Maximum likelihood is generally applied as fitting method. In the usual regression setting the least absolute-deviations estimator (L1-norm) is a popular alternative to least squares (L2-norm) because of its simplicity and its robustness properties. In the first part of this thesis we examine the question of how much of these robustness features carry over to the setting of generalized linear models. We study a robust procedure based on the minimum absolute deviation estimator of Morgenthaler (1992), the Lq quasi-likelihood when q = 1. In particular, we investigate the influence function of these estimates and we compare their sensitivity to that of the maximum likelihood estimate. Furthermore we particularly explore the Lq quasi-likelihood estimates in binary regression. These estimates are difficult to compute. We derive a simpler estimator, which has a similar form as the Lq quasi-likelihood estimate. The resulting estimating equation consists in a simple modification of the familiar maximum likelihood equation with the weights wq(μ). This presents an improvement compared to other robust estimates discussed in the literature that typically have weights, which depend on the couple (xi, yi) rather than on μi = h(xiT β) alone. Finally, we generalize this estimator to Poisson regression. The resulting estimating equation is a weighted maximum likelihood with weights that depend on μ only.

The increased accessibility of data that are geographically referenced and correlated increases the demand for techniques of spatial data analysis. The subset of such data comprised of discrete counts exhibit particular difficulties and the challenges further increase when a large proportion (typically 50% or more) of the counts are zero-valued. Such scenarios arise in many applications in numerous fields of research and it is often desirable to infer on subtleties of the process, despite the lack of substantive information obscuring the underlying stochastic mechanism generating the data. An ecological example provides the impetus for the research in this thesis: when observations for a species are recorded over a spatial region, and many of the counts are zero-valued, are the abundant zeros due to bad luck, or are aspects of the region making it unsuitable for the survival of the species? In the framework of generalized linear models, we first develop a zero-inflated Poisson generalized linear regression model, which explains the variability of the responses given a set of measured covariates, and additionally allows for the distinction of two kinds of zeros: sampling ("bad luck" zeros), and structural (zeros that provide insight into the data-generating process). We then adapt this model to the spatial setting by incorporating dependence within the model via a general, leniently-defined quasi-likelihood strategy, which provides consistent, efficient and asymptotically normal estimators, even under erroneous assumptions of the covariance structure. In addition to this advantage of robustness to dependence misspecification, our quasi-likelihood model overcomes the need for the complete specification of a probability model, thus rendering it very general and relevant to many settings. To complement the developed regression model, we further propose methods for the simulation of zero-inflated spatial stochastic processes. This is done by deconstructing the entire process into a mixed, marked spatial point process: we augment existing algorithms for the simulation of spatial marked point processes to comprise a stochastic mechanism to generate zero-abundant marks (counts) at each location. We propose several such mechanisms, and consider interaction and dependence processes for random locations as well as over a lattice.

Model specification is an integral part of any statistical inference problem. Several model selection techniques have been developed in order to determine which model is the best one among a list of possible candidates. Another way to deal with this question is the so-called model averaging, and in particular the frequentist approach. An estimation of the parameters of interest is obtained by constructing a weighted average of the estimates of these quantities under each candidate model. We develop compromise frequentist strategies for the estimation of regression parameters, as well as for the probabilistic clustering problem. In the regression context, we construct compromise strategies based on the Pitman estimators associated with various underlying errors distributions. The weight given to each model is equal to its profile likelihood, which gives a measure of the goodness-of-fit. Asymptotic properties of both Pitman estimators and profile likelihood allow us to define a minimax strategy for choosing the distributions of the compromise, involving a notion of distance between distributions. Performances of such estimators are then compared to other usual and robust procedures. In the second part of the thesis, we develop compromise strategies in the probabilistic clustering context. Although this clustering method is based on mixtures of distributions, our compromise strategies are not applied directly to the estimates of the parameters, but on the posterior probabilities of membership. Two types of compromise are presented, and the performances of resulting classification rules are investigated.