**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.

Concept# High-dimensional statistics

Summary

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.
The most basic statistical model for the relationship between a covariate vector and a response variable is the linear model
where is an unknown parameter vector, and is random noise with mean zero and variance . Given independent responses , with corresponding covariates , from this model, we can form the response vector , and design matrix . When and the design matrix has full column rank (i.e. its columns are linearly independent), the ordinary least squares estimator of is
When , it is known that . Thus, is an unbiased estimator of , and the Gauss-Markov theorem tells us that it is the Best Linear Unbiased Estimator.
However, overfitting is a concern when is of comparable magnitude to : the matrix in the definition of may become ill-conditioned, with a small minimum eigenvalue. In such circumstances will be large (since the trace of a matrix is the sum of its eigenvalues). Even worse, when , the matrix is singular. (See Section 1.2 and Exercise 1.2 in .)
It is important to note that the deterioration in estimation performance in high dimensions observed in the previous paragraph is not limited to the ordinary least squares estimator. In fact, statistical inference in high dimensions is intrinsically hard, a phenomenon known as the curse of dimensionality, and it can be shown that no estimator can do better in a worst-case sense without additional information (see Example 15.10). Nevertheless, the situation in high-dimensional statistics may not be hopeless when the data possess some low-dimensional structure.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related courses (4)

Related lectures (33)

Related publications (53)

Related people (11)

Related units (2)

Related concepts (1)

PHYS-467: Machine learning for physicists

Machine learning and data analysis are becoming increasingly central in sciences including physics. In this course, fundamental principles and methods of machine learning will be introduced and practi

EE-556: Mathematics of data: from theory to computation

This course provides an overview of key advances in continuous optimization and statistical analysis for machine learning. We review recent learning formulations and models as well as their guarantees

MATH-408: Regression methods

General graduate course on regression methods

Generalization Error in Learning with Random Features

Explores generalization error in learning theory, emphasizing structured data and the replica method.

Probabilistic Models for Linear Regression

Covers the probabilistic model for linear regression and its applications in nuclear magnetic resonance and X-ray imaging.

Generalised Additive Models: Tackling Dimensionality Issue

Explores Generalised Additive Models and methods to tackle dimensionality issues efficiently.

Linear regression

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

We present a combination technique based on mixed differences of both spatial approximations and quadrature formulae for the stochastic variables to solve efficiently a class of optimal control problems (OCPs) constrained by random partial differential equ ...

2024A key challenge across many disciplines is to extract meaningful information from data which is often obscured by noise. These datasets are typically represented as large matrices. Given the current trend of ever-increasing data volumes, with datasets grow ...

In the rapidly evolving landscape of machine learning research, neural networks stand out with their ever-expanding number of parameters and reliance on increasingly large datasets. The financial cost and computational resources required for the training p ...