Communication-efficient distributed training of machine learning models

Thijs Vogels
2023
EPFL thesis

Abstract

In this thesis, we explore techniques for addressing the communication bottleneck in data-parallel distributed training of deep learning models. We investigate algorithms that either reduce the size of the messages that are exchanged between workers, or that reduce the number of messages sent and received.To reduce the size of messages, we propose an algorithm for lossy compression of gradients. This algorithm is compatible with existing high-performance training pipelines based on the all-reduce primitive and leverages the natural approximate low-rank structure in gradients of neural network layers to obtain high compression rates.To reduce the number of messages, we study the decentralized learning paradigm where workers do not average their model updates all-to-all in each step of Stochastic Gradient Descent, but only communicate with a small subset of their peers. We extend the aforementioned compression algorithm to operate in this setting. We also study the influence of the communication topology on the performance of decentralized learning, highlighting shortcomings of the typical 'spectral gap' metric to measure the quality of communication topologies, and proposing a new framework for evaluating topologies. Finally, we propose an alternative communication paradigm for distributed learning over sparse topologies. This paradigm, which is based on the concept 'relaying' updates over spanning trees of the communication topology, shows benefits over the typical gossip-based approach, especially when the workers have very heterogeneous data distributions.

Official source

https://infoscience.epfl.ch/record/301929?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Communication-efficient distributed training of machine learning models

Graph Chatbot

Chat with Graph Search

Subjective performance evaluation of bitrate allocation strategies for MPEG and JPEG Pleno point cloud compression

Topics in statistical physics of high-dimensional machine learning

Deep Learning Theory Through the Lens of Diagonal Linear Networks

Subjective performance evaluation of bitrate allocation strategies for MPEG and JPEG Pleno point cloud compression

Topics in statistical physics of high-dimensional machine learning

Deep Learning Theory Through the Lens of Diagonal Linear Networks