Concept

# Canonical correlation

Summary
In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y which have maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Jordan in 1875. Given two column vectors and of random variables with finite second moments, one may define the cross-covariance to be the matrix whose entry is the covariance . In practice, we would estimate the covariance matrix based on sampled data from and (i.e. from a pair of data matrices). Canonical-correlation analysis seeks vectors () and () such that the random variables and maximize the correlation . The (scalar) random variables and are the first pair of canonical variables. Then one seeks vectors maximizing the same correlation subject to the constraint that they are to be uncorrelated with the first pair of canonical variables; this gives the second pair of canonical variables. This procedure may be continued up to times. Let be the cross-covariance matrix for any pair of (vector-shaped) random variables and . The target function to maximize is The first step is to define a change of basis and define where and can be obtained from the eigen-decomposition (or By diagonalization): and And thus we have By the Cauchy–Schwarz inequality, we have There is equality if the vectors and are collinear. In addition, the maximum of correlation is attained if is the eigenvector with the maximum eigenvalue for the matrix (see Rayleigh quotient).