Data Preprocessing

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data collection methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, amongst other issues. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, representation and quality of data is necessary before running any analysis. Often, data preprocessing is the most important phase of a machine learning project, especially in computational biology. If there is a high proportion of irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase may be more difficult. Data preparation and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include cleaning, instance selection, normalization, one-hot encoding, data transformation, feature extraction and feature selection. The origins of data preprocessing are located in data mining. The idea is to aggregate existing information and search in the content. Later it was recognized, that for machine learning and neural networks a data preprocessing step is needed too. So it has become to a universal technique which is used in computing in general. Data preprocessing allows for the removal of unwanted data with the use of data cleaning, this allows the user to have a dataset to contain more valuable information after the preprocessing stage for data manipulation later in the data mining process. Editing such dataset to either correct data corruption or human error is a crucial step to get accurate quantifiers like true positives, true negatives, false positives and false negatives found in a confusion matrix that are commonly used for a medical diagnosis.

Data and scripts for "Unraveling secondary ice production in winter orographic clouds through a synergy of in-situ observations, remote sensing and modeling"

Athanasios Nenes, Alexis Berne, Satoshi Takahama, Georgia Sotiropoulou, Paraskevi Georgakaki, Romanos Foskinis, Kunfeng Gao, Anne-Claire Marie Billault--Roux

This repository contains field observations and processed data from the Weather Research and Forecasting (WRF) model simulations and the Cloud Resolving Model Radar Simulator (CR-SIM), alongside scripts designed to reproduce the figures presented in the pa ...

Zenodo2024

Data and scripts for the RaFSIP scheme

Athanasios Nenes, Paraskevi Georgakaki

This repository contains microphysics routines, scripts, and processed data from the Weather Research and Forecasting (WRF) model simulations presented in the paper "RaFSIP: Parameterizing ice multiplication in models using a machine learning approach", by ...

Zenodo2024

Graph Chatbot

Data and scripts for "Unraveling secondary ice production in winter orographic clouds through a synergy of in-situ observations, remote sensing and modeling"

Data and scripts for the RaFSIP scheme

Robust machine learning for neuroscientific inference

Data and scripts for "Unraveling secondary ice production in winter orographic clouds through a synergy of in-situ observations, remote sensing and modeling"

Data and scripts for the RaFSIP scheme

Robust machine learning for neuroscientific inference