**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# Multiple testing with test statistics following heavy-tailed distributions

Résumé

In multiple testing problems where the components come from a mixture model of noise and true effect, we seek to first test for the existence of the non-zero components, and then identify the true alternatives under a fixed significance level $\alpha$. Two parameters, namely the fraction of the non-null components $\varepsilon$ and the size of the effects $\mu$, characterise the two-point mixture model under the global alternative. When the number of hypotheses $m$ goes to infinity, we are interested in an asymptotic framework where the fraction of the non-null components is vanishing, and the true effects need to be sizable to be detected. Donoho and Jin give an explicit form of the asymptotic detectable boundary based on the Gaussian mixture model under the classic calibration of the parameters of the mixture model. We prove the analogous results for the Cauchy mixture distribution as an example heavy-tailed case. This requires a different formulation of the parameters, which reflects the added difficulties.

We also propose a multiple testing procedure based on a filtering approach that can discover the true alternatives. Benjamini and Hochberg (BH) compare the observed $p$-values to a linear threshold curve and reject the null hypotheses from the minimum up to the last up-crossing, and prove the false discovery rate (FDR) is controlled. However, there is an intrinsic difference in heavy-tailed settings. Were we to use the BH procedure we would get a highly variable positive false discovery rate (pFDR). In our study we analyse the distribution of the $p$-values and devise a new multiple testing procedure to combine the usual case and the heavy-tailed case based on the empirical properties of the $p$-values. The filtering approach is designed to eliminate most $p$-values that are more likely to be uniform, while preserving most of the true alternatives. Based on the filtered $p$-values, we estimate the mode $\vartheta$ and define the rejection region $\mathscr{R}(\vartheta, \delta)=\left[ \vartheta -\delta/2, \vartheta +\delta/2 \right]$ such that the most informative $p$-values are included. The length $\delta$ is chosen by controlling the data-dependent estimation of FDR at a desired level.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Concepts associés (20)

Statistique de test

En statistique, une statistique de test - aussi appelée variable de décision - est une variable aléatoire construite à partir d'un échantillon statistique permettant de formuler une règle de décision

Hypothèse nulle

En statistiques et en économétrie, l'hypothèse nulle (symbole international : H_0) est une hypothèse postulant l'égalité entre des paramètres statistiques (généralement, la moyenne ou l

Loi de probabilité à queue lourde

vignette|Long tail.
Dans la théorie des probabilités, une loi de probabilité à queue lourde est une loi de probabilité dont les queues ne sont pas exponentiellement bornées, ce q

Publications associées (18)

Chargement

Chargement

Chargement

Pal Christie Ryalen, Mats Julius Stensrud

The conventional nonparametric tests in survival analysis, such as the log-rank test, assess the null hypothesis that the hazards are equal at all times. However, hazards are hard to interpret causally, and other null hypotheses are more relevant in many scenarios with survival outcomes. To allow for a wider range of null hypotheses, we present a generic approach to define test statistics. This approach utilizes the fact that a wide range of common parameters in survival analysis can be expressed as solutions of differential equations. Thereby, we can test hypotheses based on survival parameters that solve differential equations driven by cumulative hazards, and it is easy to implement the tests on a computer. We present simulations, suggesting that our tests perform well for several hypotheses in a range of scenarios. As an illustration, we apply our tests to evaluate the effect of adjuvant chemotherapies in patients with colon cancer, using data from a randomized controlled trial.

2019Anthony Christopher Davison, Thomas Lugrin, Jonathan A. Tawn

Statistical models for extreme values are generally derived from non-degenerate probabilistic limits that can be used to approximate the distribution of events that exceed a selected high threshold. If convergence to the limit distribution is slow, then the approximation may describe observed extremes poorly, and bias can only be reduced by choosing a very high threshold at the cost of unacceptably large variance in any subsequent tail inference. An alternative is to use sub-asymptotic extremal models, which introduce more parameters but can provide better fits for lower thresholds. We consider this problem in the context of the Heffernan–Tawn conditional tail model for multivariate extremes, which has found wide use due to its flexible handling of dependence in high-dimensional applications. Recent extensions of this model appear to improve joint tail inference. We seek a sub-asymptotic justification for why these extensions work and show that they can improve convergence rates by an order of magnitude for certain copulas. We also propose a class of extensions of them that may have wider value for statistical inference in multivariate extremes.

2021This work is about time series of functional data (functional time series), and consists of three main parts. In the first part (Chapter 2), we develop a doubly spectral decomposition for functional time series that generalizes the Karhunen–Loève expansion. In the second part (Chapter 3), we develop the theory of estimation for the spectral density operators, which are the main tool involved in the doubly spectral decomposition. The third part (Chapter 4) is concerned with the problem of understanding and comparing the dynamics of DNA. It proposes a methodology for comparing the dynamics of DNA minicircles that are vibrating in solution, using tools developed in this thesis. In the first part, we develop a doubly spectral representation of a stationary functional time series that generalizes the Karhunen–Loève expansion to the functional time series setting. The representation decomposes the time series into an integral of uncorrelated frequency components (Cramér representation), each of which is in turn expanded in a Karhunen-Loève series, thus yielding a Cramér–Karhunen–Loève decomposition of the series. The construction is based on the spectral density operators—whose Fourier coefficients are the lag-t autocovariance operators—which characterise the second-order dynamics of the process. The spectral density operators are the functional analogues of the spectral density matrices, whose eigenvalues and eigenfunctions at different frequencies provide the building blocks of the representation. By truncating the representation at a finite level, we obtain a harmonic principal component analysis of the time series, an optimal finite dimensional reduction of the time series that captures both the temporal dynamics of the process, and the within-curve dynamics, and dominates functional PCA. The proofs rely on the construction of a stochastic integral of operator-valued functions, whose construction is similar to that of the Itô integral. In practice, the spectral density operators are unknown. In the second part, we therefore develop the basic theory of a frequency domain framework for drawing statistical inferences on the spectral density operators of a stationary functional time series. Our main tool is the functional Discrete Fourier Transform(fDFT).We derive an asymptotic Gaussian representation of the fDFT, thus allowing the transformation of the original collection of dependent random functions into a collection of approximately independent complex-valued Gaussian random functions. Our results are then employed in order to construct estimators of the spectral density operators based on smoothed versions of the periodogram kernel, the functional generalisation of the periodogram matrix. The consistency and asymptotic law of these estimators are studied in detail. As immediate consequences, we obtain central limit theorems for the mean and the long-run covariance operator of a stationary functional time series. Our results do not depend on structural modeling assumptions, but only functional versions of classical cumulant mixing conditions. The effect of discrete noisy observations on the consistency of the estimators is studied in a framework general enough to apply to a wide range of smoothing techniques for converting discrete noisy observations into functional data. We also perform a simulation study to assess the finite sample performance of our estimators, and give a discussion of the technical assumptions of our results, and at what cost our weak dependence assumptions could be changed or weakened, and provide examples of processes satisfying the technical assumptions of our asymptotic results. As an application, we consider in the third part the problem of comparing the dynamics of the trajectories of two DNA minicircles that are vibrating in solution, which are obtained via Molecular Dynamics simulations. The approach we take is to view and compare the dynamics through their spectral density operators, which contain the entire second-order structure of the trajectories. As a first step, we compare the spectral density operators of the two DNA minicircles using a new test we develop, which allows us to compare the spectral density operators at a fixed frequencies. Using multiple testing procedures, we are able to localize in frequencies the differences in spectral density operators of the two DNA minicircles, while controlling a type-I error, and conduct numerical simulations to assess the performance of our method. We further investigate the differences between the two minicircles by comparing their spectral density operators within frequencies. This allows us to localize their differences both in frequencies and on the minicircles, while controlling the averaged false discovery rate over the selected frequencies. Our methodology is general enough to be applied to the comparison of the dynamics of any pair of stationary functional time series.