**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# A Quasi-Likelihood Approach to Zero-Inflated Spatial Count Data

Résumé

The increased accessibility of data that are geographically referenced and correlated increases the demand for techniques of spatial data analysis. The subset of such data comprised of discrete counts exhibit particular difficulties and the challenges further increase when a large proportion (typically 50% or more) of the counts are zero-valued. Such scenarios arise in many applications in numerous fields of research and it is often desirable to infer on subtleties of the process, despite the lack of substantive information obscuring the underlying stochastic mechanism generating the data. An ecological example provides the impetus for the research in this thesis: when observations for a species are recorded over a spatial region, and many of the counts are zero-valued, are the abundant zeros due to bad luck, or are aspects of the region making it unsuitable for the survival of the species? In the framework of generalized linear models, we first develop a zero-inflated Poisson generalized linear regression model, which explains the variability of the responses given a set of measured covariates, and additionally allows for the distinction of two kinds of zeros: sampling ("bad luck" zeros), and structural (zeros that provide insight into the data-generating process). We then adapt this model to the spatial setting by incorporating dependence within the model via a general, leniently-defined quasi-likelihood strategy, which provides consistent, efficient and asymptotically normal estimators, even under erroneous assumptions of the covariance structure. In addition to this advantage of robustness to dependence misspecification, our quasi-likelihood model overcomes the need for the complete specification of a probability model, thus rendering it very general and relevant to many settings. To complement the developed regression model, we further propose methods for the simulation of zero-inflated spatial stochastic processes. This is done by deconstructing the entire process into a mixed, marked spatial point process: we augment existing algorithms for the simulation of spatial marked point processes to comprise a stochastic mechanism to generate zero-abundant marks (counts) at each location. We propose several such mechanisms, and consider interaction and dependence processes for random locations as well as over a lattice.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Publications associées (41)

Chargement

Chargement

Chargement

Concepts associés (27)

Régression (statistiques)

En mathématiques, la régression recouvre plusieurs méthodes d’analyse statistique permettant d’approcher une variable à partir d’autres qui lui sont corrélées. Par extension, le terme est aussi uti

Estimateur (statistique)

En statistique, un estimateur est une fonction permettant d'estimer un moment d'une loi de probabilité (comme son espérance ou sa variance). Il peut par exemple servir à estimer certaines caractérist

Covariance

En théorie des probabilités et en statistique, la covariance entre deux variables aléatoires est un nombre permettant de quantifier leurs écarts conjoints par rapport à leurs espérances respectives.

Generalized Linear Models have become a commonly used tool of data analysis. Such models are used to fit regressions for univariate responses with normal, gamma, binomial or Poisson distribution. Maximum likelihood is generally applied as fitting method. In the usual regression setting the least absolute-deviations estimator (L1-norm) is a popular alternative to least squares (L2-norm) because of its simplicity and its robustness properties. In the first part of this thesis we examine the question of how much of these robustness features carry over to the setting of generalized linear models. We study a robust procedure based on the minimum absolute deviation estimator of Morgenthaler (1992), the Lq quasi-likelihood when q = 1. In particular, we investigate the influence function of these estimates and we compare their sensitivity to that of the maximum likelihood estimate. Furthermore we particularly explore the Lq quasi-likelihood estimates in binary regression. These estimates are difficult to compute. We derive a simpler estimator, which has a similar form as the Lq quasi-likelihood estimate. The resulting estimating equation consists in a simple modification of the familiar maximum likelihood equation with the weights wq(μ). This presents an improvement compared to other robust estimates discussed in the literature that typically have weights, which depend on the couple (xi, yi) rather than on μi = h(xiT β) alone. Finally, we generalize this estimator to Poisson regression. The resulting estimating equation is a weighted maximum likelihood with weights that depend on μ only.

In geostatistics, the presence of outlying data is more the rule than the exception. Moreover, the statistical analysis of data contaminated by outliers requires caution, particularly when a spatial dependence exists. In order to take into account these possible outliers during the adjustment of the spatial process, a new modeling tool, called the substitutive errors model, is proposed. The optimal prediction in the least squares sense is derived and its properties are studied. Because of its complexity, this estimator needs in practice to be numerically approximated. An automated algorithm is proposed in this thesis. This method is based on an ordering of the observations with respect to the specified spatial process of interest, with the values least in agreement being included towards the end of the ordering. It proves to be useful in case of masked multiple outliers or nonstationary clusters. Simulations are carried out to illustrate its performances and to compare it to other forecasts, robust or not. An application to real data is provided as an illustration of its practical usefulness. The second part of this work also deals with the presence of spatial heterogeneity. One could say that the proposed model offers a characterization of this heterogeneity rather than estimating the locations and sizes of outliers. It is based on the theory of bidimensional α-stable motion. This represents a generalization of the unidimensional Brownian motion. In particular, the stability parameter α can be seen as a measure of the distance between the observations and the hypothesis of a Gaussian distribution. A method of estimation for the parameters of such a process is presented, based on a numerical constrained optimization of the likelihood. Its performances are illustrated by means of simulations. An application ends this second part.

This work is concerned with the estimation of the spreading potential of the disease in the initial stages of an epidemic. A speedy and accurate estimation is important for determining whether or not interventions are necessary to prevent a major outbreak. At the same time, the information available in the early stages is scarce and data collection imperfect. We consider an epidemic in a large susceptible population, and address the estimation based on temporally aggregated counts of new cases that are subject to unknown random under-reporting. We allow for an influence of the detection process on the evolution of the epidemic. While the proportion of infectious individuals in the population is small, the role of chance in the spread of the disease may be substantial. Therefore, stochastic epidemic models are applied. As these are difficult to analyse, the time evolution of the number of infectious individuals is approximated by branching processes. We study the estimation in a partially observed Galton–Watson process and in a partially observed linear birth and death process; and in each case focus on the parameter characterising the growth of the process. We aim at estimators that perform well in the asymptotic sense where a single trajectory is observed over a long period of time, and study the asymptotics conditionally on the eventual explosion of the process. The partially observed Galton–Watson process has been recently proposed in the literature as a model for the initial stages of an epidemic. Its probabilistic structure has been explored and estimation has been partially addressed, in that consistent estimators have been constructed. However, the estimation-related uncertainty has not been evaluated. We address this issue here by constructing estimators that are motivated from the asymptotic dependence structure of the process. We show that they are consistent and asymptotically normal, consistently estimate their asymptotic variances, and construct asymptotic confidence intervals. In addition, we evaluate their finite-sample performance in a simulation study and their practical performance on real data. The observation mechanism in the partially observed Galton–Watson process is inherently discrete. To allow for continuous-time dynamics, we incorporate partial observation in the linear birth and death process. In particular, we propose a model where the birth process is completely unobserved, while a random proportion of the death process is observed at discrete time points. We study the estimation in this model. Motivated by its counting process structure, we arrive at consistent and asymptotically normal estimators, consistently estimate their asymptotic variances, and construct asymptotic confidence intervals. We also evaluate the finite-sample and practical performance of the estimators in a simulation study and on real data.