**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# Statistical Applications of Random Matrix Theory: Comparison of Two Populations

Résumé

During the last twenty years, Random matrix theory (RMT) has produced numerous results that allow a better understanding of large random matrices. These advances have enabled interesting applications in the domain of communication. Although this theory can contribute to many other domains such as brain imaging or genetic research, its has been rarely applied. The main barrier to the adoption of RMT may be the lack of concrete statistical results from probabilistic Random matrix theory. Indeed, direct generalisation of classical multivariate theory to high dimensional assumptions is often difficult and the proposed procedures often assume strong hypotheses on the data matrix such as normality or overly restrictive independence conditions on the data.

This thesis proposes a statistical procedure for testing the equality of two independent estimated covariance matrices when the number of potentially dependent data vectors is large and proportional to the size of the vectors corresponding to the number of observed variables. Although the existing theory builds a very good intuition of the behaviour of these matrices, it does not provide enough results to build a satisfactory test for both the power and the robustness. Hence, inspired by spike models, we define the residual spikes and prove many theorems describing the behaviour of many statistics using eigenvectors and eigenvalues in very general cases. For example in the two central theorems of this thesis, the Invariant Angle Theorem and the Invariant Dot Product Theorem.

Using numerous generalisations of the theory, this thesis finally proposes a description of the behaviour of a statistic under a null hypothesis. This statistic allows the user to test the equality of two populations, but also other null hypotheses such as the independence of two sets of variables. Finally, the robustness of the procedure is demonstrated for different classes of models and criteria for evaluating robustness are proposed to the reader.

Therefore, the major contribution of this thesis is to propose a methodology both easy to apply and having good properties. Secondly, a large number of theoretical results are demonstrated and could be easily used to build other applications.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Concepts associés (29)

Statistique

La statistique est la discipline qui étudie des phénomènes à travers la collecte de données, leur traitement, leur analyse, l'interprétation des résultats et leur présentation afin de rendre ces don

Hypothèse nulle

En statistiques et en économétrie, l'hypothèse nulle (symbole international : H_0) est une hypothèse postulant l'égalité entre des paramètres statistiques (généralement, la moyenne ou l

Théorème central limite

thumb|upright=2|La loi normale, souvent appelée la « courbe en cloche ».
Le théorème central limite (aussi appelé théorème limite central, théorème de la limite centrale ou théorème de la limite cent

Publications associées (23)

Chargement

Chargement

Chargement

This thesis is a contribution to financial statistics. One of the principal concerns of investors is the evaluation of portfolio risk. The notion of risk is vague, but in finance it is always linked to possible losses. In this thesis, we present some measures allowing the valuation of risk with the help of Bayesian methods. An exploratory analysis of data is presented to describe the sampling properties of financial time series. This analysis allows us to understand the origins of the daily returns studied in this thesis. Moreover, a discussion of different models is presented. These models make strong assumptions on investor behaviour, which are not always satisfied. This exploratory analysis shows some differences between the behaviour anticipated under equilibrium models, and that of real data. The Bayesian approach has been chosen because it allows one to incorporate all the variability, in particular that associated with model choice. The models studied in this thesis allow one to take heteroskedasticity into account, as well as particular shapes of the tails of returns. ARCH type models and models based on extreme value theory are studied. One original aspect of this thesis is its use of Bayesian analysis to detect change points in financial time series. We suppose that a market has two phases, and that it switches from a state to the other at random. Another new contribution is a model integrating heteroskedasticity and time dependence of extreme values, by superposition of the model proposed by Bortot and Coles (2003) and a GARCH process. This thesis uses simulation intensively for the estimation of risk measures. The drawback of simulation is the amount of time needed to obtain accurate estimates. However, simulation allows one to produce results when direct calculation is not feasible. For example, simulation allows one to compute risk estimates for time horizons greater than one day. The methods presented in this thesis are illustrated on simulated data, and on real data from European and American markets. This thesis involved the construction of a library containing C and S code to perform risk analysis using GARCH and extreme value theory models. The results show that model uncertainty can be incorporated, and that risk measures for time horizons greater than one can be obtained by simulation. The methods presented in this thesis have a natural representation involving conditioning. Thus, they permit the computation of both conditional and unconditional risk estimates. Three methods are described: the GARCH method; the two-state GARCH method; and the HBC method. Unconditional risk estimation using the GARCH method is satisfactory on data which seem stationary, but not reliable on data which are non-stationary, such as data with change points. The two-state GARCH model does a little better, but gives very satisfactory results when the risk is estimated conditionally on time. The HBC method does not give satisfactory results.

Gene expression profiles have been widely used in molecular classification, diagnosis and prediction, particularly in the area of oncology where accurate and early diagnosis is needed for appropriate treatment. Avoiding under-/over-treatment when it is not necessary can extend a patient's survival and prevent disease recurrence. These high-throughput assay technologies have generated terabytes of data exploited extensively to provide insights on cancer biology and the underlying mechanism of disease progression. The ultimate goal is to identify possibly tailored treatment and therapy for personalized medicine. Analysis of microarray data is constrained by the following characteristics: (i) noisy due to missing or erroneous values; (ii) high dimensional due to a large number of genes versus a few number of samples in which their expression levels are measured; (iii) costly due to expensive microarray experiments. Abundant microarray gene expression data should be processed by appropriate computational and statistical learning methodologies such as machine learning techniques. These methods are robust to noisy data and have a great capacity to analyze high dimensional data. Their computational power is nevertheless limited to sample size based on which these methods are built. These algorithms have been widely applied to microarray gene expression data to identify a set of genes known as a gene signature whose expressions are highly correlated to a target value or outcome such as disease status, tumor subtype, a patient's survival time, risk of mortality or cancer relapse. Prediction of survival time and a patient's risk which is unknown at diagnosis presents a more challenging task for machine learning methods than tumor subtype or disease classification, which is already established by oncologists. The properties of microarray data cited above, the limitation of the number of samples in cancer patients and dependency of the machine learning methods' performance on sample size justify joint analysis of microarray data to increase the number of samples. We applied joint analysis methods to breast and lung cancer data sets to improve survival prediction and risk assessment. In overall, no significant improvement or deterioration of the performance accuracy was obtained with joint analysis. However, increasing sample size helped to identify robust or stable gene signatures predictive of survival time and risk assessment. Our achievements and learned-lessons from joint analysis of microarray gene expression data can be used as a guideline for future research studies in classification and prediction.

The saddlepoint approximation was introduced into statistics in 1954 by Henry E. Daniels. This basic result on approximating the density function of the sample mean has been generalized to many situations. The accuracy of this approximation is very good, particularly in the tails of the distribution and for small sample sizes, compared with normal or Edgeworth approximation methods. Before applying saddlepoint approximations to the bootstrap, this thesis will focus on saddlepoint approximations for the distribution of quadratic forms in normal variables and for the distribution of the waiting time in the coupon collector's problem. Both developments illustrate the modern art of statistics relying on the computer and embodying both numeric and analytic approximations. Saddlepoint approximations are extremely accurate in both cases. This is underlined in the first development by means of an extensive study and several applications to nonparametric regression, and in the second by several examples, including the exhaustive bootstrap seen from a collector's point of view. The remaining part of this thesis is devoted to the use of saddlepoint approximations in order to replace the computer-intensive bootstrap. The recent massive increases in computer power have led to an upsurge in interest in computer-intensive statistical methods. The bootstrap is the first computer-intensive method to become widely known. It found an immediate place in statistical theory and, more slowly, in practice. The bootstrap seems to be gaining ground as the method of choice in a number of applied fields, where classical approaches are known to be unreliable, and there is sustained interest from theoreticians in its development. But it is known that, for accurate approximations in the tails, the nonparametric bootstrap requires a large number of replicates of the statistic. As this is time-intensive other methods should be considered. Saddlepoint methods can provide extremely accurate approximations to resampling distributions. As a first step I develop fast saddlepoint approximations to bootstrap distributions that work in the presence of an outlier, using a saddlepoint mixture approximation. Then I look at robust M-estimates of location like Huber's M-estimate of location and its initially MAD scaled version. One peculiarity of the current literature is that saddlepoint methods are often used to approximate the density or distribution functions of bootstrap estimators, rather than related pivots, whereas it is the latter which are more relevant for inference. Hence the aim of the final part of this thesis is to apply saddlepoint approximations to the construction of studentized confidence intervals based on robust M-estimates. As examples I consider the studentized versions of Huber's M-estimate of location, of its initially MAD scaled version and of Huber's proposal 2. In order to make robust inference about a location parameter there are three types of robustness one would like to achieve: robustness of performance for the estimator of location, robustness of validity and robustness of efficiency for the resulting confidence interval method. Hence in the context of studentized bootstrap confidence intervals I investigate these in more detail in order to give recommendations for practical use, underlined by an extensive simulation study.