**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Training, validation, and test data sets

Summary

In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and test sets.
The model is initially fit on a training data set, which is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks) of the model. The model (e.g. a naive Bayes classifier) is trained on the training data set using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent. In practice, the training data set often consists of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), where the answer key is commonly denoted as the target (or label). The current model is run with the training data set and produces a result, which is then compared with the target, for each input vector in the training data set. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted. The model fitting can include both variable selection and parameter estimation.
Successively, the fitted model is used to predict the responses for the observations in a second data set called the validation data set. The validation data set provides an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters (e.g. the number of hidden units—layers and layer widths—in a neural network). Validation datasets can be used for regularization by early stopping (stopping training when the error on the validation data set increases, as this is a sign of over-fitting to the training data set).

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts (16)

Related people (51)

Related units (5)

Related courses (32)

Related publications (455)

Related lectures (203)

Statistical learning theory

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics. The goals of learning are understanding and prediction. Learning falls into many categories, including supervised learning, unsupervised learning, online learning, and reinforcement learning.

Bias–variance tradeoff

In statistics and machine learning, the bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set: The bias error is an error from erroneous assumptions in the learning algorithm.

Random forest

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

DH-406: Machine learning for DH

This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and imple

ME-443: Hydroacoustique pour aménagements hydroélectriques

Introduction aux phénomènes propagatifs dans les circuits hydrauliques, calculs de coups de béliers, comportement transitoire d'aménagements hydroélectriques, simulation numériques 1D du comportement

PHYS-467: Machine learning for physicists

Machine learning and data analysis are becoming increasingly central in sciences including physics. In this course, fundamental principles and methods of machine learning will be introduced and practi

Neural Networks: Deep Neural Networks

Explores the basics of neural networks, with a focus on deep neural networks and their architecture and training.

Linear Regression and Gradient Descent

Covers linear regression, gradient descent, overfitting, and ridge regression among other concepts.

Introduction to Machine Learning

Covers the basics of machine learning for physicists and chemists, focusing on image classification and dataset labeling.

Lijing Xin, Yan Li, Yubo Zhao, Yan Lin, Wei Ye

Metabolic changes precede malignant histology. However, it remains unclear whether detectable characteristic metabolome exists in esophageal squamous cell carcinoma (ESCC) tissues and biofluids for early diagnosis. Here, we conduct NMR- and MS-based metabo ...

Supervised machine learning models are receiving increasing attention in electricity theft detection due to their high detection accuracy. However, their performance depends on a massive amount of labeled training data, which comes from time-consuming and ...

David Atienza Alonso, Alireza Amirshahi, Jonathan Dan, Adriano Bernini, William Cappelletti, Luca Benini, Una Pale

The need for high-quality automated seizure detection algorithms based on electroencephalography (EEG) becomes ever more pressing with the increasing use of ambulatory and long-term EEG monitoring. Heterogeneity in validation methods of these algorithms in ...

2024