Concept

Exploratory data analysis

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA. Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems S-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study. Tukey's EDA was related to two other developments in statistical theory: robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation).

Official source

https://en.wikipedia.org/wiki/Exploratory_data_analysis

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related courses (32)

ENV-444: Exploratory data analysis in environmental health

This course teaches how to apply exploratory spatial data analysis to health information. Teaching focuses on the role of GIS and spatial statistics in spatial epidemiology. It proposes a context to i

MATH-493: Applied biostatistics

This course covers topics in applied biostatistics, with an emphasis on practical aspects of data analysis using R statistical software. Topics include types of studies and their design and analysis,

MATH-131: Probability and statistics

Le cours présente les notions de base de la théorie des probabilités et de l'inférence statistique. L'accent est mis sur les concepts principaux ainsi que les méthodes les plus utilisées.

Related lectures (32)

Structuring exploratory spatial data analysis

Explores the structured approach to exploratory spatial data analysis, emphasizing the importance of analytical frameworks and the Visual Seeking Mantra.

Cognitive processes for data exploration

Explores cognitive processes in data analysis, focusing on visual thinking and simplification to extract insights from data.

Apprentice Origins: Venice's Garzoni Project

Delves into the lives of young 'garzoni' apprentices in Venice during the 16th to 18th centuries.

Official source

https://en.wikipedia.org/wiki/Exploratory_data_analysis

About this result

Related courses (32)

ENV-444: Exploratory data analysis in environmental health

MATH-493: Applied biostatistics

This course covers topics in applied biostatistics, with an emphasis on practical aspects of data analysis using R statistical software. Topics include types of studies and their design and analysis,

MATH-131: Probability and statistics

Le cours présente les notions de base de la théorie des probabilités et de l'inférence statistique. L'accent est mis sur les concepts principaux ainsi que les méthodes les plus utilisées.

Related lectures (32)

Structuring exploratory spatial data analysis

Explores the structured approach to exploratory spatial data analysis, emphasizing the importance of analytical frameworks and the Visual Seeking Mantra.

Cognitive processes for data exploration

Explores cognitive processes in data analysis, focusing on visual thinking and simplification to extract insights from data.

Apprentice Origins: Venice's Garzoni Project

Delves into the lives of young 'garzoni' apprentices in Venice during the 16th to 18th centuries.

Related publications (30)

Unlabeled Principal Component Analysis and Matrix Completion

Yunzhen Yao, Liangzu Peng

We introduce robust principal component analysis from a data matrix in which the entries of its columns have been corrupted by permutations, termed Unlabeled Principal Component Analysis (UPCA). Using algebraic geometry, we establish that UPCA is a well-de ...

Microtome Publishing2024

Autorepression of yeast Hsp70 cochaperones by intramolecular interactions involving their J-domains

Paolo De Los Rios, Satyam Tiwari, Pierre Goloubinoff, Bruno Claude Daniel Fauvet, Mathieu Rebeaud, Adélaïde Alice Mohr

The 70 kDa heat shock protein (Hsp70) chaperones control protein homeostasis in all ATP-containing cellular compartments. J-domain proteins (JDPs) coevolved with Hsp70s to trigger ATP hydrolysis and catalytically upload various substrate polypeptides in ne ...

Elsevier Science Inc2024

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

Frédéric Kaplan, Maud Ehrmann, Matteo Romanello, Emanuela Boros, Sven-Nicolas Yoann Najem

The quality of automatic transcription of heritage documents, whether from printed, manuscripts or audio sources, has a decisive impact on the ability to search and process historical texts. Although significant progress has been made in text recognition ( ...

Association for Computational Linguistics2024

Related concepts (11)

Data and information visualization

Data and information visualization (data viz or info viz) is the practice of designing and creating easy-to-communicate and easy-to-understand graphic or visual representations of a large amount of complex quantitative and qualitative data and information with the help of static, dynamic or interactive visual items.

Multidimensional scaling

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of objects or individuals" into a configuration of points mapped into an abstract Cartesian space. More technically, MDS refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. It is a form of non-linear dimensionality reduction.

Box plot

In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines (which are called whiskers) extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot.