Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall. After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code), or with fuzzy or approximate string matching (such as correcting records that partially match existing, known records). Some data cleansing solutions will clean data by cross-checking with a validated data set. A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address. Data cleansing may also involve harmonization (or normalization) of data, which is the process of bringing together data of "varying file formats, naming conventions, and columns", and transforming it into one cohesive data set; a simple example is the expansion of abbreviations ("st, rd, etc." to "street, road, etcetera"). Administratively incorrect, inconsistent data can lead to false conclusions and misdirect investments on both public and private scales.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (20)
FIN-525: Financial big data
The course introduces modern methods to acquire, clean, and analyze large quantities of financial data efficiently. The second part expands on how to apply these techniques and robust statistics to fi
CS-401: Applied data analysis
This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat
COM-490: Large-scale data science for real-world data
This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipe
Show more
Related lectures (62)
Data Structuring: Intrarecord and Interrecord Techniques
Covers data structuring techniques, error detection, and functional dependencies within records.
Data Accuracy: Assessing Faithfulness and Error Detection
Explores data accuracy through faithfulness assessment, error detection, outlier handling, correlations, functional dependencies, violation detection, denial constraints, and data repairing techniques.
Temporality and Entity Resolution
Explores challenges in data temporality and techniques for entity resolution.
Show more
Related publications (161)

Extensions of Peer Prediction Incentive Mechanisms

Adam Julian Richardson

As large, data-driven artificial intelligence models become ubiquitous, guaranteeing high data quality is imperative for constructing models. Crowdsourcing, community sensing, and data filtering have long been the standard approaches to guaranteeing or imp ...
EPFL2024

Continuous corrected particle number concentration data in 10 sec resolution measured in the Swiss aerosol container using a whole air inlet during MOSAiC 2019/2020

Julia Schmale, Ivo Fabio Beck

This dataset contains corrected particle number concentration data measured during the year-long MOSAiC expedition from October 2019 to September 2020. Some periods of the measurements were affected by repeated step changes in the particle number concentra ...
EPFL Infoscience2023

Genetic features and genomic targets of human KRAB-zinc finger proteins

Didier Trono, Jacques Fellay, Priscilla Turelli, Christian Axel Wandall Thorball, Evaristo Jose Planet Letschert, Julien Léonard Duc, Romain Forey, Bara Khubieh, Sandra Eloise Kjeldsen, Alexandre Coudray, Michaël Imbeault, Cyril David Son-Tuyên Pulver, Jonas Caspar De Tribolet-Hardy

Krüppel-associated box (KRAB) domain-containing zinc finger proteins (KZFPs) are one of the largest groups of transcription factors encoded by tetrapods, with 378 members in human alone. KZFP genes are often grouped in clusters reflecting amplification by ...
2023
Show more
Related concepts (7)
Data Preprocessing
Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data collection methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, amongst other issues. Analyzing data that has not been carefully screened for such problems can produce misleading results.
Data quality
Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning". Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose.
Data transformation (computing)
In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration. Data transformation can be simple or complex based on the required changes to the data between the source (initial) data and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.