**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Metric learning for kernel ridge regression: assessment of molecular similarity

Friedrich Eisenbrand, Raimon Fabregat I De Aguilar-Amat, Puck Elisabeth van Gerwen

*IOP Publishing Ltd, *2022

Journal paper

Journal paper

Abstract

Supervised and unsupervised kernel-based algorithms widely used in the physical sciences depend upon the notion of similarity. Their reliance on pre-defined distance metrics-e.g. the Euclidean or Manhattan distance-are problematic especially when used in combination with high-dimensional feature vectors for which the similarity measure does not well-reflect the differences in the target property. Metric learning is an elegant approach to surmount this shortcoming and find a property-informed transformation of the feature space. We propose a new algorithm for metric learning specifically adapted for kernel ridge regression (KRR): metric learning for kernel ridge regression (MLKRR). It is based on the Metric Learning for Kernel Regression framework using the Nadaraya-Watson estimator, which we show to be inferior to the KRR estimator for typical physics-based machine learning tasks. The MLKRR algorithm allows for superior predictive performance on the benchmark regression task of atomisation energies of QM9 molecules, as well as generating more meaningful low-dimensional projections of the modified feature space.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related publications (9)

Loading

Loading

Loading

Related concepts (15)

Machine learning

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machin

Algorithm

In mathematics and computer science, an algorithm (ˈælɡərɪðəm) is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algo

Nonparametric regression

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no p

In a society which produces and consumes an ever increasing amount of information, methods which can make sense out of all this data become of crucial importance. Machine learning tries to develop models which can make the information load accessible. Three important questions one can ask when constructing such models are: - What is the structure of the data? This is especially relevant for high-dimensional data which cannot be visualized anymore. - Which features are most characteristic? -How to predict whether a pattern belongs to one class or to another? This thesis investigates these three questions by trying to construct complex models from simple ones. The decomposition into simpler parts can also be found in the methods used for estimating the parameter values of these models. The algorithms for the simple models constitute the core of the algorithms for the complex ones. The above questions are addressed in three stages: Unsupervised learning: This part deals with the problem of probability density estimation with the goal of finding a good probabilistic representation of the data. One of the most popular density estimation methods is the Gaussian mixture model (GMM). A promising alternative to GMMs are the recently proposed mixtures of latent variable models. Examples of the latter are principal component analysis (PCA) and factor analysis. The advantage of these models is that they are capable of representing the covariance structure with less parameters by choosing the dimension of a subspace in a suitable way. An empirical evaluation on a large number of data sets shows that mixtures of latent variable models almost always outperform GMMs. To avoid having to choose a value for the dimension of the subspace by a computationally expensive search technique such as cross-validation, a Bayesian treatment of mixtures of latent variable models is proposed. This framework makes it possible to determine the appropriate dimension during training and experiments illustrate its viability. Feature extraction: PCA is also (and foremost) a classic method for feature extraction. However, PCA is limited to linear feature extraction by a projection onto a subspace. Kernel PCA is a recent method which allows non-linear feature extraction. Applying kernel PCA to a data set with N patterns requires finding the eigenvectors of an N*N matrix. An Expectation-Maximization (EM) algorithm for PCA which does not need to store this matrix is adapted to kernel PCA and applied to large data sets with more than 10,000 examples. The experiments confirm that this approach is feasible and that the extracted features lead to good performance when used as pre-processed data for a linear classifier. A new on-line variant of the EM algorithm for PCA is presented and shown to speed up the learning process. Supervised learning: This part illustrates two ways of constructing complex models from simple ones for classification problems. The first approach is inspired by unsupervised mixture models and extends them to supervised learning. The resulting model, called a mixture of experts, tries to decompose a complex problem into subproblems treated by several simpler models. The division of the data space is effectuated by an input-dependent gating network. After a review of the model and existing training methods, different possible gating networks are proposed and compared. Unsupervised mixture models are one of the evaluated options. The experiments show that a standard mixture of experts with a neural network gate gives the best results. The second approach is a constructive algorithm called boosting which creates a committee of simple models by emphasizing patterns which have been frequently misclassified by the preceding classifiers. A new model has been developed which lies between a mixture of experts and a boosted committee. It adds an input-dependent combiner (like a gating network) to standard boosting. This has the advantage that with a considerably smaller committee results are obtained which are comparable to those of boosting. Finally, some of the investigated models have been evaluated on two problems of machine vision. The results confirm the potential of mixtures of latent variable models which lead to good performance when incorporated in a Bayes classifier.

Humans have the ability to learn. Having seen an object we can recognise it later. We can do this because our nervous system uses an efficient and robust visual processing and capabilities to learn from sensory input. On the other hand, designing algorithms to learn from visual data is a difficult task. More than fifty years ago, Rosenblatt proposed the perceptron algorithm. The perceptron learns from data examples a linear separation, which categorises the data in two classes. The algorithm served as a simple model of neuronal learning. Two further important ideas were added to the perceptron. First, to look for a maximal margin of separation. Second, to separate the data in a possibly high dimensional feature space, related nonlinearly to the initial space of the data, and allowing nonlinear separations. Important is that learning in the feature space can be performed implicitly and hence efficiently with the use of a kernel, a measure of similarity between two data points. The combination of these ideas led to the support vector machine, an efficient algorithm with high performance. In this thesis, we design an algorithm to learn the categorisation of data into multiple classes. This algorithm is applied to a real-time vision task, the recognition of human faces. Our algorithm can be seen as a generalisation of the support vector machine to multiple classes. It is shown how the algorithm can be efficiently implemented. To avoid a large number of small but time consuming updates of the variables limited accuracy computations are used. We prove a bound on the accuracy needed to find a solution. The proof motivates the use of a heuristic, which further increases efficiency. We derive a second implementation using a stochastic gradient descent method. This implementation is appealing as it has a direct interpretation and can be used in an online setting. Conceptually our approach differs from standard support vector approaches because examples can be rejected and are not necessarily attributed to one of the categories. This is natural in the context of a vision task. At any time, the sensory input can be something unseen before and hence cannot be recognised. Our visual data are images acquired with the recently developed adaptive vision sensor from CSEM. The vision sensor has two important features. First, like the human retina, it is locally adaptive to light intensity. Hence, the sensor has a high dynamic range. Second, the image gradient is computed on the sensor chip and is thus available directly from the sensor in real time. The sensor output is time encoded. The information about a strong local contrast is transmitted rst and the weakest contrast information at the end. To recognise faces, possibly moving in front of the camera, the sensor images have to be processed in a robust way. Representing images to exhibit local invariances is a common yet unsolved problem in computer vision. We develop the following representation of the sensor output. The image gradient information is decomposed into local histograms over contrast intensity. The histograms are local in position and direction of the gradient. Hence, the representation has local invariance properties to translation, rotation, and scaling. The histograms can be efficiently computed because the sensor output is already ordered with respect to the local contrast. Our support vector approach for multicategorical data uses the local histogram features to learn the recognition of faces. As recognition is time consuming, a face detection stage is used beforehand. We learn the detection features in an unsupervised manner using a specially designed optimisation procedure. The combined system to detect and recognise faces of a small group of individuals is efficient, robust, and reliable.

In a society which produces and consumes an ever increasing amount of information, methods which can make sense out of al1 this data become of crucial importance. Machine learning tries to develop models which can make the information load accessible. Three important questions one can ask when constructing such models are: What is the structure of the data? This is especially relevant for high-dimensional data which cannot be visualized anymore. Which features are most characteristic? How to predict whether a pattern belongs to one class or to another? This thesis investigates these three questions by trying to construct complex models from simple ones. The decomposition into simpler parts can also be found in the methods used for estimating the parameter values of these models. The algorithms for the simple models constitute the core of the algorithms for the complex ones. The above questions are addressed in three stages: Unsupervised learning This part deals with the problem of probability density estimation with the goal of finding a good probabilistic representation of the data. One of the most popular density estimation methods is the Gaussian mixture model (GMM). A promising alternative to GMMs are the recently proposed mixtures of latent variable models. Examples of the latter are principal component analysis (PCA) and factor analysis. The advantage of these models is that they are capable of representing the covariance structure with less parameters by choosing the dimension of a subspace in a suitable way. An empirical evaluation on a large number of data sets shows that mixtures of latent variable models almost always outperform GMMs. To avoid having to choose a value for the dimension of the subspace by a computationally expensive search technique such as cross-validation, a Bayesian treatment of mixtures of latent variable models is proposed. This framework makes it possible to determine the appropriate dimension during training and experiments illustrate its viability. Feature extraction PCA is also (and foremost) a classic method for feature extraction. However, PCA is limited to linear feature extraction by a projection onto a subspace. Kernel PCA is a recent method which allows non-linear feature extraction. Applying kernel PCA to a data set with N patterns requires finding the eigenvectors of an N×N matrix. An Expectation-Maximization (EM) algorithm for PCA which does not need to store this matrix is adapted to kernel PCA and applied to large data sets with more than 10,000 examples. The experiments confirm that this approach is feasible and that the extracted features lead to good performance when used as pre-processed data for a linear classifier. A new on-line variant of the EM algorithm for PCA is presented and shown to speed up the learning process. Supervised learning This part illustrates two ways of constructing complex models from simple ones for classification problems. The first approach is inspired by unsupervised mixture models and extends them to supervised learning. The resulting model, called a mixture of experts, tries to decompose a complex problem into subproblems treated by several simpler models. The division of the data space is effectuated by an input-dependent gating network. After a review of the model and existing training methods, different possible gating networks are proposed and compared. Unsupervised mixture models are one of the evaluated options. The experiments show that a standard mixture of experts with a neural network gate gives the best results. The second approach is a constructive algorithm called boosting which creates a committee of simple models by emphasizing patterns which have been frequently misclassified by the preceding classifiers. A new model has been developed which lies between a mixture of experts and a boosted committee. It adds an input-dependent combiner (like a gating network) to standard boosting. This has the advantage that with a considerably smaller committee results are obtained which are comparable to those of boosting. Finally, some of the investigated models have been evaluated on two problems of machine vision. The results confirm the potential of mixtures of latent variable models which lead to good performance when incorporated in a Bayes classifier.