Concept

Data transformation (statistics)

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs. Nearly always, the function that is used to transform the data is invertible, and generally is continuous. The transformation is usually applied to a collection of comparable measurements. For example, if we are working with data on peoples' incomes in some currency unit, it would be common to transform each person's income value by the logarithm function. Guidance for how data should be transformed, or whether a transformation should be applied at all, should come from the particular statistical analysis to be performed. For example, a simple way to construct an approximate 95% confidence interval for the population mean is to take the sample mean plus or minus two standard error units. However, the constant factor 2 used here is particular to the normal distribution, and is only applicable if the sample mean varies approximately normally. The central limit theorem states that in many situations, the sample mean does vary normally if the sample size is reasonably large. However, if the population is substantially skewed and the sample size is at most moderate, the approximation provided by the central limit theorem can be poor, and the resulting confidence interval will likely have the wrong coverage probability. Thus, when there is evidence of substantial skew in the data, it is common to transform the data to a symmetric distribution before constructing a confidence interval. If desired, the confidence interval can then be transformed back to the original scale using the inverse of the transformation that was applied to the data. Data can also be transformed to make them easier to visualize.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (5)
CS-421: Machine learning for behavioral data
Computer environments such as educational games, interactive simulations, and web services provide large amounts of data, which can be analyzed and serve as a basis for adaptation. This course will co
MATH-413: Statistics for data science
Statistics lies at the foundation of data science, providing a unifying theoretical and methodological backbone for the diverse tasks enountered in this emerging field. This course rigorously develops
MATH-408: Regression methods
General graduate course on regression methods
Show more
Related lectures (31)
Supervised Learning: Classification and Regression
Covers supervised learning, classification, regression, decision boundaries, overfitting, Perceptron, SVM, and logistic regression.
Statistics essentials: ANOVA
Covers the essentials of ANOVA, explaining its concept, calculations, assumptions, and interpretation of results.
Linear Applications: Matrices and Transformations
Covers linear applications, matrices, transformations, and the principle of superposition.
Show more
Related publications (40)

Roles of Clinical Features and Chest CT in Predicting the Outcomes of Hospitalized Patients with COVID-19 Developing AKI

Ali Falsafi, Shekoofeh Yaghmaei

This research aimed to evaluate the clinical features and computed tomography (CT) scans associated with poor outcomes in COVID-19 patients with acute kidney injury (AKI). A total of 351 COVID-19 patients (100 AKI, 251 non-AKI) hospitalized at Imam Hossein ...
IRANIAN SOC NEPHROLGY2023

Last iterate convergence of SGD for Least-Squares in the Interpolation regime

Nicolas Henri Bernard Flammarion, Aditya Vardhan Varre, Loucas Pillaud-Vivien

Motivated by the recent successes of neural networks that have the ability to fit the data perfectly \emph{and} generalize well, we study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor fits perfectly inputs ...
2021

Real-Time Multi-Ion-Monitoring Front-End With Interference Compensation by Multi-Output Support Vector Regressor

Giovanni De Micheli, Sandro Carrara, Mandresy Ivan Ny Hanitra, Francesca Criscuolo

Ion-sensors play a major role in physiology and healthcare monitoring since they are capable of continuously collecting biological data from body fluids. Nevertheless, ion interference from background electrolytes present in the sample is a paramount chall ...
2021
Show more
Related concepts (10)
Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson ('pwɑːsɒn; pwasɔ̃). The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area, or volume.
Homoscedasticity and heteroscedasticity
In statistics, a sequence (or a vector) of random variables is homoscedastic (ˌhoʊmoʊskəˈdæstɪk) if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used.
Model selection
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of learning, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.