Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough." For all four datasets: The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two variables correlated where y could be modelled as gaussian with mean linearly dependent on x. The second graph (top right); while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate. In the third graph (bottom left), the modelled relationship is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth graph (bottom right) shows an example when one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables. The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets. The datasets are as follows. The x values are the same for the first three datasets. It is not known how Anscombe created his datasets.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (1)
MATH-710: Data Analysis for Science and Engineering
An overview course intended for scientists and engineers who need to use statistical methods as part of their research, who have already attended a course at the second-year EPFL undergraduate level,
Related lectures (7)
Linear Regression: Fundamentals and Goodness of Fit
Explores linear regression fundamentals, non-linear regression issues, and R-squared goodness of fit, with examples like Anscombe's quartet and the Datasaurus dataset.
Linear Regression: Beyond the Basics
Explores advanced concepts in linear regression models, including multicollinearity, hypothesis testing, and handling outliers.
Describing Data: Statistics and Hypothesis Testing
Covers descriptive statistics, hypothesis testing, and correlation analysis with various probability distributions and robust statistics.
Show more
Related publications (10)

Experimental Investigation on Size-Effect of Rubble Stone Masonry Walls Under In-Plane Horizontal Loading: Overview and Preliminary Results

Katrin Beyer, Savvas Saloustros

Rubble stone masonry is a common construction typology of historical city centres and vernacular architecture. While past earthquakes have shown that it is one of the most vulnerable masonry construction typologies, there are few experimental campaigns giv ...
2024

Tractography passes the test: Results from the diffusion-simulated connectivity (disco) challenge

Jean-Philippe Thiran, Erick Jorge Canales Rodriguez, Gabriel Girard, Marco Pizzolato, Alonso Ramirez Manzanares, Juan Luis Villarreal Haro, Alessandro Daducci, Ying-Chia Lin, Sara Sedlar, Caio Seguin, Kenji Marshall, Yang Ji

Estimating structural connectivity from diffusion-weighted magnetic resonance imaging is a challenging task, partly due to the presence of false-positive connections and the misestimation of connection weights. Building on previous efforts, the MICCAI-CDMR ...
ACADEMIC PRESS INC ELSEVIER SCIENCE2023

Assessment of quality of JPEG XL proposals based on subjective methodologies and objective metrics

Touradj Ebrahimi, Pinar Akyazi

The Joint Photographic Experts Group (JPEG) is currently in the process of standardizing JPEG XL, the next generation image coding standard that o↵ers substantially better compression efficiency than existing image formats. In this paper, the quality asses ...
2019
Show more
Related concepts (10)
Linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.
Regression validation
In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.
Leverage (statistics)
In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.