Extrapolation for Large-batch Training in Deep Learning

Martin Jaggi, Sebastian Urban Stich, Tao Lin, Lingjing Kong
2020
Article de conférence

Résumé

Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for reducing training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.

Source officielle

https://infoscience.epfl.ch/record/286859?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Extrapolation for Large-batch Training in Deep Learning

Graph Chatbot

Chattez avec Graph Search

Understanding generalization and robustness in modern deep learning

On the Generalization of Stochastic Gradient Descent with Momentum

Optimization Algorithms for Decentralized, Distributed and Collaborative Machine Learning

On the Generalization of Stochastic Gradient Descent with Momentum

Understanding generalization and robustness in modern deep learning

Optimization Algorithms for Decentralized, Distributed and Collaborative Machine Learning