**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Regression validation

Summary

In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.
Goodness of fit
One measure of goodness of fit is the R2 (coefficient of determination), which in ordinary least squares with an intercept ranges between 0 and 1. However, an R2 close to 1 does not guarantee that the model fits the data well: as Anscombe's quartet shows, a high R2 can occur in the presence of misspecification of the functional form of a relationship or in the presence of outliers that distort the true relationship.
One problem with the R2 as a measure of model validity is that it can always be increased by adding more variables into the model, except in the unlikely event that the additional variables are exactly uncorrelated with the dependent variable in the data sample being used. This problem can be avoided by doing an F-test of the statistical significance of the increase in the R2, or by instead using the adjusted R2.
residual analysis
The residuals from a fitted model are the differences between the responses observed at each combination of values of the explanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically, the definition of the residual for the ith observation in the data set is written
with yi denoting the ith response in the data set and xi the vector of explanatory variables, each set at the corresponding values found in the ith observation in the data set.
If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts (10)

Related courses (33)

Regression validation

In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.

Statistical model validation

In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misunderstanding by researchers of the actual relevance of their model. To combat this, model validation is used to test whether a statistical model can hold up to permutations in the data.

Statistical model specification

In statistics, model specification is part of the process of building a statistical model: specification consists of selecting an appropriate functional form for the model and choosing which variables to include. For example, given personal income together with years of schooling and on-the-job experience , we might specify a functional relationship as follows: where is the unexplained error term that is supposed to comprise independent and identically distributed Gaussian variables.

ME-443: Hydroacoustique pour aménagements hydroélectriques

Introduction aux phénomènes propagatifs dans les circuits hydrauliques, calculs de coups de béliers, comportement transitoire d'aménagements hydroélectriques, simulation numériques 1D du comportement

DH-406: Machine learning for DH

This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and imple

CS-433: Machine learning

Machine learning methods are becoming increasingly central in many sciences and applications. In this course, fundamental principles and methods of machine learning will be introduced, analyzed and pr

Related lectures (229)

Linear Regression and Gradient DescentPHYS-231: Data analysis for Physics

Covers linear regression, gradient descent, overfitting, and ridge regression among other concepts.

Hydroacoustics for Hydroelectric InstallationsME-443: Hydroacoustique pour aménagements hydroélectriques

Explores valve closing laws, water hammer, and surge tanks in hydroelectric installations.

Bias-Variance Trade-Off

Explores underfitting, overfitting, and the bias-variance trade-off in machine learning models.