Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent

Byzantine-resilient Stochastic Gradient Descent (SGD) aims at shielding model training from Byzantine faults, be they ill-labeled training datapoints, exploited software/hardware vulnerabilities, or malicious worker nodes in a distributed setting. Two recent attacks have been challenging state-of-the-art defenses though, often successfully precluding the model from even fitting the training set. The main identified weakness in current defenses is their requirement of a sufficiently low variance-norm ratio for the stochastic gradients. We propose a practical method which, despite increasing the variance, reduces the variance-norm ratio, mitigating the identified weakness. We assess the effectiveness of our method over 736 different training configurations, comprising the 2 state-of-the-art attacks and 6 defenses. For confidence and reproducibility purposes, each configuration is run 5 times with specified seeds (1 to 5), totalling 3680 runs. In our experiments, when the attack is effective enough to decrease the highest observed top-1 cross-accuracy by at least 20% compared to the unattacked run, our technique systematically increases back the highest observed accuracy, and is able to recover at least 20% in more than 60% of the cases.

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Nicolas Henri Bernard Flammarion, Hristo Georgiev Papazov, Scott William Pesme

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size

\gamma

and momentum parameter

\beta

that allows u ...

2024

Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent

Graph Chatbot

Chat with Graph Search

On the Generalization of Stochastic Gradient Descent with Momentum

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Byzantine Fault-Tolerance in Federated Local SGD Under 2f-Redundancy

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

On the Generalization of Stochastic Gradient Descent with Momentum

Byzantine Fault-Tolerance in Federated Local SGD Under 2f-Redundancy