**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Stochastic gradient descent

Summary

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.
While the basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s, stochastic gradient descent has become an important optimization method in machine learning.
Background
M-estimation
Estimating equation
Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a su

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications

Loading

Related people

Loading

Related units

Loading

Related concepts

Loading

Related courses

Loading

Related lectures

Loading

Related people (37)

Related concepts (13)

Gradient descent

In mathematics, gradient descent (also often called steepest descent) is a iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated ste

Machine learning

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machin

Logistic regression

In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or m

Related publications (100)

Loading

Loading

Loading

Related lectures (196)

Finding convergence rates for numerical optimization algorithms is an important task, because it gives a justification to their use in solving practical problems, while also providing a way to compare their efficiency. This is especially useful in an asynchronous environment, because the algorithms are often proven to be more efficient than their synchronous counterparts by experience, but they lack the theory that justifies this property. Furthermore, analyzing the various issues that can arise when inconsistency is taken into consideration, it is possible to obtain a clearer picture of the downsides of inexact implementations one should be aware of. This work tries to address the problem of finding an expected convergence rate for the asynchronous version of the widely popular stochastic gradient descent algorithm, applied to the common class of problems that present a cost function with a sum structure. It follows a similar approach to the one suggested by R. Leblond, F. Pedregosa and S. Lacoste-Julien in "ASAGA: Asynchronous Parallel SAGA" (2016) [RLLJ16], also borrowing their formalization of asynchronicity. The main achievement of this work is a bound on the constant step size that guarantees convergence in expectation of the algorithm. The relative convergence rate is also obtained. The result is also partially validated by sequential models of an asynchronous environment. We hope that this can be a basis for future applications of the same approach to more specific algorithms and that numerical experiments on real multiprocessor architecture can be performed in the future to further validate the convergence rates.

2017Our brain continuously self-organizes to construct and maintain an internal representation of the world based on the information arriving through sensory stimuli. Remarkably, cortical areas related to different sensory modalities appear to share the same functional unit, the neuron, and develop through the same learning mechanism, synaptic plasticity. It motivates the conjecture of a unifying theory to explain cortical representational learning across sensory modalities. In this thesis we present theories and computational models of learning and optimization in neural networks, postulating functional properties of synaptic plasticity that support the apparent universal learning capacity of cortical networks. In the past decades, a variety of theories and models have been proposed to describe receptive field formation in sensory areas. They include normative models such as sparse coding, and bottom-up models such as spike-timing dependent plasticity. We bring together candidate explanations by demonstrating that in fact a single principle is sufficient to explain receptive field development. First, we show that many representative models of sensory development are in fact implementing variations of a common principle: nonlinear Hebbian learning. Second, we reveal that nonlinear Hebbian learning is sufficient for receptive field formation through sensory inputs. A surprising result is that our findings are independent of specific details, and allow for robust predictions of the learned receptive fields. Thus nonlinear Hebbian learning and natural statistics can account for many aspects of receptive field formation across models and sensory modalities. The Hebbian learning theory substantiates that synaptic plasticity can be interpreted as an optimization procedure, implementing stochastic gradient descent. In stochastic gradient descent inputs arrive sequentially, as in sensory streams. However, individual data samples have very little information about the correct learning signal, and it becomes a fundamental problem to know how many samples are required for reliable synaptic changes. Through estimation theory, we develop a novel adaptive learning rate model, that adapts the magnitude of synaptic changes based on the statistics of the learning signal, enabling an optimal use of data samples. Our model has a simple implementation and demonstrates improved learning speed, making this a promising candidate for large artificial neural network applications. The model also makes predictions on how cortical plasticity may modulate synaptic plasticity for optimal learning. The optimal sampling size for reliable learning allows us to estimate optimal learning times for a given model. We apply this theory to derive analytical bounds on times for the optimization of synaptic connections. First, we show this optimization problem to have exponentially many saddle-nodes, which lead to small gradients and slow learning. Second, we show that the number of input synapses to a neuron modulates the magnitude of the initial gradient, determining the duration of learning. Our final result reveals that the learning duration increases supra-linearly with the number of synapses, suggesting an effective limit on synaptic connections and receptive field sizes in developing neural networks.

This paper considers the Byzantine fault-tolerance problem in distributed stochastic gradient descent (D-SGD) method - a popular algorithm for distributed multi-agent machine learning. In this problem, each agent samples data points independently from a certain data-generating distribution. In the fault-free case, the D-SGD method allows all the agents to learn a mathematical model best fitting the data collectively sampled by all agents. We consider the case when a fraction of agents may be Byzantine faulty. Such faulty agents may not follow a prescribed algorithm correctly, and may render traditional D-SGD method ineffective by sharing arbitrary incorrect stochastic gradients. We propose a norm-based gradient-filter, named comparative gradient elimination (CGE), that robustilies the D-SGD method against Byzantine agents. We show that the CGE gradient-filter guarantees fault-tolerance against a bounded fraction of Byzantine agents under standard stochastic assumptions, and is computationally simpler compared to many existing gradient-filters such as multi-KRUM, geometric median-of-means, and the spectral filters. We empirically show, by simulating distributed learning on neural networks, that the fault-tolerance of CGE is comparable to that of existing gradient-filters. We also empirically show that exponential averaging of stochastic gradients improves the fault-tolerance of a generic gradient-filter.

Related courses (58)

This course provides an overview of key advances in continuous optimization and statistical analysis for machine learning. We review recent learning formulations and models as well as their guarantees, describe scalable solution techniques and algorithms, and illustrate the trade-offs involved.

Since 2010 approaches in deep learning have revolutionized fields as diverse as computer vision, machine learning, or artificial intelligence. This course gives a systematic introduction into influential models of deep artificial neural networks, with a focus on Reinforcement Learning.

Students understand basic concepts and methods of machine learning. They can describe them in mathematical terms and can apply them to data using a high-level programming language (julia/python/R).

Related units (21)