Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate. While the basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s, stochastic gradient descent has become an important optimization method in machine learning. M-estimation Estimating equation Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: where the parameter that minimizes is to be estimated. Each summand function is typically associated with the -th observation in the data set (used for training). In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations). The sum-minimization problem also arises for empirical risk minimization. In this is the value of the loss function at -th example, and is the empirical risk. When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations: where is a step size (sometimes called the learning rate in machine learning).

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Nicolas Henri Bernard Flammarion, Hristo Georgiev Papazov, Scott William Pesme

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size

\gamma

and momentum parameter

\beta

that allows u ...

2024

Graph Chatbot

Chat with Graph Search

On the Generalization of Stochastic Gradient Descent with Momentum

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Understanding generalization and robustness in modern deep learning

On the Generalization of Stochastic Gradient Descent with Momentum

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Understanding generalization and robustness in modern deep learning