Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes. Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery. The issues to be dealt with fall into two main categories: systematic errors involving large numbers of data records, probably because they have come from different sources; individual errors affecting small numbers of data records, probably due to errors in the original data entry. The first step is to set out a full and detailed specification of the format of each data field and what the entries mean. This should take careful account of: most importantly, consultation with the users of the data any available specification of the system which will use the data to perform the analysis a full understanding of the information available, and any gaps, in the source data. See also data definition specification. Suppose there is a two-character alphabetic field that indicates geographical location. It is possible that in one data source a code "EE" means "Europe" and in another data source the same code means "Estonia". One would need to devise an unambiguous set of codes and amend the code in one set of records accordingly. Furthermore, the "geographical area" might refer to any of e.g. delivery address, billing address, address from which goods supplied, billing currency, or applicable national regulations. All these matters must be covered in the specification. There could be some records with "X" or "555" in that field. Clearly, this is invalid data as it does not conform to the specification. If there are only small numbers of such records, one would either correct them manually or if precision is not important, simply delete those records from the file. Another possibility would be to create a "not known" category.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (3)
FIN-525: Financial big data
The course introduces modern methods to acquire, clean, and analyze large quantities of financial data efficiently. The second part expands on how to apply these techniques and robust statistics to fi
CIVIL-226: Introduction to machine learning for engineers
Machine learning is a sub-field of Artificial Intelligence that allows computers to learn from data, identify patterns and make predictions. As a fundamental building block of the Computational Thinki
MGT-492: Data science and machine learning I
This class provides a hands-on introduction to data science and machine learning topics, exploring areas such as data acquisition and cleaning, regression, classification, clustering, neural networks,
Related lectures (16)
Data Science in Personalized and Global Health: Privacy-Enhancing Technologies
Delves into data science in personalized and global health, emphasizing privacy-enhancing technologies and AI applications in healthcare.
Data Wrangling: Transforming Data for Analysis
Covers the process of data wrangling, focusing on transforming and preparing data for analysis.
Image Processing: Practical
Covers practical image processing using Fiji software, emphasizing data quality importance.
Show more
Related publications (11)

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.