In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation. Goodness of fit One measure of goodness of fit is the R2 (coefficient of determination), which in ordinary least squares with an intercept ranges between 0 and 1. However, an R2 close to 1 does not guarantee that the model fits the data well: as Anscombe's quartet shows, a high R2 can occur in the presence of misspecification of the functional form of a relationship or in the presence of outliers that distort the true relationship. One problem with the R2 as a measure of model validity is that it can always be increased by adding more variables into the model, except in the unlikely event that the additional variables are exactly uncorrelated with the dependent variable in the data sample being used. This problem can be avoided by doing an F-test of the statistical significance of the increase in the R2, or by instead using the adjusted R2. residual analysis The residuals from a fitted model are the differences between the responses observed at each combination of values of the explanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically, the definition of the residual for the ith observation in the data set is written with yi denoting the ith response in the data set and xi the vector of explanatory variables, each set at the corresponding values found in the ith observation in the data set. If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship.
Olivier Sauter, Federico Alberto Alfredo Felici, Cassandre Ekta Contré, Anna Teplukhina, Simon Van Mulders, Bernhard Sieglin
David Atienza Alonso, Alireza Amirshahi, Jonathan Dan, Adriano Bernini, William Cappelletti, Luca Benini, Una Pale
Lijing Xin, Yubo Zhao, Yan Lin, Wei Ye