**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Landscape and training regimes in deep learning

Abstract

Deep learning algorithms are responsible for a technological revolution in a variety oftasks including image recognition or Go playing. Yet, why they work is not understood.Ultimately, they manage to classify data lying in high dimension – a feat genericallyimpossible due to the geometry of high dimensional space and the associatedcurse ofdimensionality. Understanding what kind of structure, symmetry or invariance makesdata such as images learnable is a fundamental challenge. Other puzzles include that(i) learning corresponds to minimizing a loss in high dimension, which is in generalnot convex and could well get stuck bad minima. (ii) Deep learning predicting powerincreases with the number of fitting parameters, even in a regime where data areperfectly fitted. In this manuscript, we review recent results elucidating (i, ii) andthe perspective they offer on the (still unexplained) curse of dimensionality paradox.We base our theoretical discussion on the (h,α) plane wherehcontrols the numberof parameters andαthe scale of the output of the network at initialization, andprovide new systematic measures of performance in that plane for two common imageclassification datasets. We argue that different learning regimes can be organized intoa phase diagram. A line of critical points sharply delimits an under-parametrized phasefrom an over-parametrized one. In over-parametrized nets, learning can operate intwo regimes separated by a smooth cross-over. At large initialization, it correspondsto a kernel method, whereas for small initializations features can be learnt, togetherwith invariants in the data. We review the properties of these different phases, ofthe transition separating them and some open questions. Our treatment emphasizesanalogies with physical systems, scaling arguments and the development of numericalobservables to quantitatively test these results empirically. Practical implications arealso discussed, including the benefit of averaging nets with distinct initial weights, orthe choice of parameters (h,α) optimizing performance.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related concepts (10)

Data

In common usage and statistics, data (USˈdætə; UKˈdeɪtə) is a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic uni

Parameter

A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element

Dimension

In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a d

Related publications (27)

Loading

Loading

Loading

One of the main goal of Artificial Intelligence is to develop models capable of providing valuable predictions in real-world environments. In particular, Machine Learning (ML) seeks to design such models by learning from examples coming from this same environment. However, the real world is most of the time not static, and the environment in which the model will be used can differ from the one in which it is trained. It is hence desirable to design models that are robust to changes of environments. This encapsulates a large family of topics in ML, such as adversarial robustness, meta-learning, domain adaptation and others, depending on the way the environment is perturbed.In this dissertation, we focus on methods for training models whose performance does not drastically degrade when applied to environments differing from the one the model has been trained in. Various types of environmental changes will be treated, differing in their structure or magnitude. Each setup defines a certain kind of robustness to certain environmental changes, and leads to a certain optimization problem to be solved. We consider 3 different setups, and propose algorithms for solving each associated problem using 3 different types of methods, namely, min-max optimization (Chapter 2), regularization (Chapter 3) and variable selection (Chapter 4).Leveraging the framework of distributionally robust optimization, which phrases the problem of robust training as a min-max optimization problem, we first aim to train robust models by directly solving the associated min-max problem. This is done by exploiting recent work on game theory as well as first-order sampling algorithms based on Langevin dynamics. Using this approach, we propose a method for training robust agents in the scope of Reinforcement Learning.We then treat the case of adversarial robustness, i.e., robustness to small arbitrary perturbation of the model's input. It is known that neural networks trained using classical optimization methods are particularly sensitive to this type of perturbations. The adversarial robustness of a model is tightly connected to its smoothness, which is quantified by its so-called Lipschitz constant. This constant measures how much the model's output changes upon any bounded input perturbation. We hence develop a method to estimate an upper bound on the Lipschitz constant of neural networks via polynomial optimization, which can serve as a robustness certificate against adversarial attacks. We then propose to penalize the Lipschitz constant during training by minimizing the 1-path-norm of the neural network, and we develop an algorithm for solving the resulting regularized problem by efficiently computing the proximal operator of the 1-path-norm term, which is non-smooth and non-convex.Finally, we consider a scenario where the environmental changes can be arbitrary large (as opposed to adversarial robustness), but need to preserve a certain causal structure. Recent works have demonstrated interesting connections between robustness and the use of causal variables. Assuming that certain mechanisms remain invariant under some change of the environment, it has been shown that knowing the underlying causal structure of the data at hand allows to train models that are invariant to such changes. Unfortunately, in many cases, the causal structure is unknown. We thus propose a causal discovery algorithm from observational data in the case of non-linear additive model.

Neural networks (NNs) have been very successful in a variety of tasks ranging from machine translation to image classification. Despite their success, the reasons for their performance are still not well-understood. This thesis explores two main themes: loss landscapes and symmetries present in data.Machine learning consists of training models on data by optimizing the model parameters. This optimization is done by minimizing a loss function. NNs, a family of machine learning models, are created by composing functions, called layers. Informally, they can be visualized as a set of interconnected neurons.Ten years ago, NNs became the most popular models of machine learning. With their success come many open questions. For example, neural networks and glassy systems both have many degrees of freedom and highly non-convex objective or energy functions, respectively. However, glassy systems get stuck in local minima near where they are initialized, whereas neural networks avoid this even when they 100s of times more parameters than the number of data use to train them? (i) What drives this difference in behavior? (ii) How is it then that NNs do not become too specialized to the training data (overfitting)?In the first part of this thesis, we show that in classification tasks, NNs undergo a jamming transition dependent on the number of parameters, $N$. This answers (i): With a sufficiently high $N$ above a critical number $N^*$, local minima are avoided. Then, we establish a "double-descent" behavior in the test error of classification tasks: It decreases twice as a function of $N$, before $N^*$ but also after, until infinity, where it converges to its minimum. We answer (ii) by explaining the origins of this double-descent. Finally, we introduce a phase diagram that describes the landscape of the loss function and unifies the two limits in which a neural network can converge when sending $N$ to infinity.In the second part of this thesis, we explore the issue of the curse of dimensionality (CD): Sampling a $d$-dimensional space requires an exponential number of points $P$. However, NNs perform well even for $P \ll \exp(d)$. Symmetries in the data play a role in this conundrum. For example, to process images we use convolutional NNs (CNNs) which have the property of being locally connected and equivariant with respect to translations, i.e., a translation in the input leads to a corresponding translation in the output. Although empirical experience suggests that locality and equivariance contribute to the success of CNNs, it is difficult to understand how. Indeed, equivariance reduces the dimensionality of the data only slightly. Stability toward diffeomorphisms however might be the key to CD. We studied how NNs are affected by images distorted by diffeomorphisms. Our results suggest that locality and equivariance allow, during learning, to develop stability towards diffeomorphisms \textit{relative} to other generic transformations.Following this intuition, we have created new architectures by extending CNNs properties to 3D rotations.Our work contributes to the current understanding of the behavior of neural networks empirically observed by machine learning practitioners. Moreover, the architectures developed for 3D rotation problems are currently being applied to a wide range of domains.

Segmenting images is a significant challenge that has drawn a lot of attention from different fields of artificial intelligence and has many practical applications. One such challenge addressed in this thesis is the segmentation of electron microscope (EM) imaging of neural tissue. EM microscopy is one of the key tools used to analyze neural tissue and understand the brain, but the huge amounts of data it produces make automated analysis necessary. In addition to the challenges specific to EM data, the common problems encountered in image segmentation must also be addressed. These problems include extracting discriminative features from the data and constructing a statistical model using ground-truth data. Although complex models appear to be more attractive because they allow for more expressiveness, they also lead to a higher computational complexity. On the other hand, simple models come with a lower complexity but less faithfully express the real world. Therefore, one of the most challenging tasks in image segmentation is in constructing models that are expressive enough while remaining tractable. In this work, we propose several automated graph partitioning approaches that address these issues. These methods reduce the computational complexity by operating on supervoxels instead of voxels, incorporating features capable of describing the 3D shape of the target objects and using structured models to account for correlation in output variables. One of the non-trivial issues with such models is that their parameters must be carefully chosen for optimal performance. A popular approach to learning model parameters is a maximum-margin approach called Structured SVM (SSVM) that provides optimality guarantees but also suffers from two main drawbacks. First, SSVM-based approaches are usually limited to linear kernels, since more powerful nonlinear kernels cause the learning to become prohibitively expensive. In this thesis, we introduce an approach to “kernelize” the features so that a linear SSVM framework can leverage the power of nonlinear kernels without incurring their high computational cost. Second, the optimality guarentees are violated for complex models with strong inter-relations between the output variables. We propose a new subgradient-based method that is more robust and leads to improved convergence properties and increased reliability. The different approaches presented in this thesis are applicable to both natural and medical images. They are able to segment mitochondria at a performance level close to that of a human annotator, and outperform state-of-the-art segmentation techniques while still benefiting from a low learning time.