In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.
The most basic statistical model for the relationship between a covariate vector and a response variable is the linear model
where is an unknown parameter vector, and is random noise with mean zero and variance . Given independent responses , with corresponding covariates , from this model, we can form the response vector , and design matrix . When and the design matrix has full column rank (i.e. its columns are linearly independent), the ordinary least squares estimator of is
When , it is known that . Thus, is an unbiased estimator of , and the Gauss-Markov theorem tells us that it is the Best Linear Unbiased Estimator.
However, overfitting is a concern when is of comparable magnitude to : the matrix in the definition of may become ill-conditioned, with a small minimum eigenvalue. In such circumstances will be large (since the trace of a matrix is the sum of its eigenvalues). Even worse, when , the matrix is singular. (See Section 1.2 and Exercise 1.2 in .)
It is important to note that the deterioration in estimation performance in high dimensions observed in the previous paragraph is not limited to the ordinary least squares estimator. In fact, statistical inference in high dimensions is intrinsically hard, a phenomenon known as the curse of dimensionality, and it can be shown that no estimator can do better in a worst-case sense without additional information (see Example 15.10). Nevertheless, the situation in high-dimensional statistics may not be hopeless when the data possess some low-dimensional structure.