**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Sparsified SGD with Memory

Martin Jaggi, Sebastian Urban Stich, Jean-Baptiste Francis Marie Juliette Cordonnier

2018

Conference paper

2018

Conference paper

Abstract

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related MOOCs (11)

Related concepts (32)

Related publications (34)

Introduction to optimization on smooth manifolds: first order methods

Learn to optimize on smooth, nonlinear spaces: Join us to build your foundations (starting at "what is a manifold?") and confidently implement your first algorithm (Riemannian gradient descent).

Hilbert scheme

In algebraic geometry, a branch of mathematics, a Hilbert scheme is a scheme that is the parameter space for the closed subschemes of some projective space (or a more general projective scheme), refining the Chow variety. The Hilbert scheme is a disjoint union of projective subschemes corresponding to Hilbert polynomials. The basic theory of Hilbert schemes was developed by . Hironaka's example shows that non-projective varieties need not have Hilbert schemes.

Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).

Scheme (mathematics)

In mathematics, a scheme is a mathematical structure that enlarges the notion of algebraic variety in several ways, such as taking account of multiplicities (the equations x = 0 and x2 = 0 define the same algebraic variety but different schemes) and allowing "varieties" defined over any commutative ring (for example, Fermat curves are defined over the integers). Scheme theory was introduced by Alexander Grothendieck in 1960 in his treatise "Éléments de géométrie algébrique"; one of its aims was developing the formalism needed to solve deep problems of algebraic geometry, such as the Weil conjectures (the last of which was proved by Pierre Deligne).

Within the context of contemporary machine learning problems, efficiency of optimization process depends on the properties of the model and the nature of the data available, which poses a significant problem as the complexity of either increases ad infinit ...

Michel Bierlaire, Nicola Marco Ortelli, Matthieu Marie Cochon de Lapparent

In the field of choice modeling, the availability of ever-larger datasets has the potential to significantly expand our understanding of human behavior, but this prospect is limited by the poor scalability of discrete choice models (DCMs): as sample sizes ...

2023Distributed learning is the key for enabling training of modern large-scale machine learning models, through parallelising the learning process. Collaborative learning is essential for learning from privacy-sensitive data that is distributed across various ...