Publication

Understanding generalization and robustness in modern deep learning

Maksym Andriushchenko
2024
EPFL thesis
Abstract

In this thesis, we study two closely related directions: robustness and generalization in modern deep learning. Deep learning models based on empirical risk minimization are known to be often non-robust to small, worst-case perturbations known as adversarial examples that can easily fool state-of-the-art deep neural networks into making wrong predictions. Their existence can be seen as a generalization problem: despite the impressive average-case performance, the deep learning models tend to learn non-robust features that can be used for adversarial manipulations. In this thesis, we delve deeply into a range of questions related to robustness and generalization, such as how to accurately evaluate robustness, how to make robust training more efficient, and why some optimization algorithms lead to better generalization and learn qualitatively different features. We start the first direction from exploring computationally efficient methods to perform adversarial training and its failure mode referred to as catastrophic overfitting when the model suddenly loses its robustness after some point in training. Then we provide a better understanding of the robustness evaluation and the progress in the field by proposing new query-efficient black-box adversarial attacks based on random search that do not rely on the gradient information and thus can complement a typical robustness evaluation based on gradient-based methods. Finally, for the same goal, we propose a new community-driven robustness benchmark RobustBench which aims to systematically track the progress in the field in a standardized way. We start the second direction from investigating reasons behind the success of sharpness-aware minimization, a recent algorithm that increases robustness in the parameter space during training and improves generalization for deep networks. Then we discuss why overparameterized models trained with stochastic gradient descent tend to generalize surprisingly well even without any explicit regularization. We study the implicit regularization induced by stochastic gradient descent with large step sizes and its effect on the features learned by the model. Finally, we rigorously study the relationship between sharpness of minima (i.e., robustness in the parameter space) and generalization that prior works observed to correlate to each other. Our study suggests that, contrary to the common belief, sharpness is not a good indicator of generalization and it rather tends to correlate well with some hyperparameters like the learning rate but not inherently with generalization.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (34)
Deep learning
Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.
Stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).
Gradient descent
In mathematics, gradient descent (also often called steepest descent) is a iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.
Show more
Related publications (125)

Optimization Algorithms for Decentralized, Distributed and Collaborative Machine Learning

Anastasiia Koloskova

Distributed learning is the key for enabling training of modern large-scale machine learning models, through parallelising the learning process. Collaborative learning is essential for learning from privacy-sensitive data that is distributed across various ...
EPFL2024

Efficient local linearity regularization to overcome catastrophic overfitting

Volkan Cevher, Grigorios Chrysos, Fanghui Liu, Elias Abad Rocamora

Catastrophic overfitting (CO) in single-step adversarial training (AT) results in abrupt drops in the adversarial test accuracy (even down to 0%). For models trained with multi-step AT, it has been observed that the loss function behaves locally linearly w ...
2024

Towards Trustworthy Deep Learning for Image Reconstruction

Alexis Marie Frederic Goujon

The remarkable ability of deep learning (DL) models to approximate high-dimensional functions from samples has sparked a revolution across numerous scientific and industrial domains that cannot be overemphasized. In sensitive applications, the good perform ...
EPFL2024
Show more
Related MOOCs (9)
Introduction to optimization on smooth manifolds: first order methods
Learn to optimize on smooth, nonlinear spaces: Join us to build your foundations (starting at "what is a manifold?") and confidently implement your first algorithm (Riemannian gradient descent).
Show more