Summary
Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes. Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery. The issues to be dealt with fall into two main categories: systematic errors involving large numbers of data records, probably because they have come from different sources; individual errors affecting small numbers of data records, probably due to errors in the original data entry. The first step is to set out a full and detailed specification of the format of each data field and what the entries mean. This should take careful account of: most importantly, consultation with the users of the data any available specification of the system which will use the data to perform the analysis a full understanding of the information available, and any gaps, in the source data. See also data definition specification. Suppose there is a two-character alphabetic field that indicates geographical location. It is possible that in one data source a code "EE" means "Europe" and in another data source the same code means "Estonia". One would need to devise an unambiguous set of codes and amend the code in one set of records accordingly. Furthermore, the "geographical area" might refer to any of e.g. delivery address, billing address, address from which goods supplied, billing currency, or applicable national regulations. All these matters must be covered in the specification. There could be some records with "X" or "555" in that field. Clearly, this is invalid data as it does not conform to the specification. If there are only small numbers of such records, one would either correct them manually or if precision is not important, simply delete those records from the file. Another possibility would be to create a "not known" category.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.