ADAGRAD Avoids Saddle Points

Adaptive first-order methods in optimization are prominent in machine learning and data science owing to their ability to automatically adapt to the landscape of the function being optimized. However, their convergence guarantees are typically stated in terms of vanishing gradient norms, which leaves open the issue of converging to undesirable saddle points (or even local maximizers). In this paper, we focus on the ADAGRAD family of algorithms – with scalar, diagonal or full-matrix preconditioning – and we examine the question of whether the method’s trajectories avoid saddle points. A major challenge that arises here is thatADAGRAD’s step-size (or, more accurately, the method’s preconditioner) evolves over time in a filtration-dependent way, i.e., as a function of all gradients observed in earlier iterations; as a result, avoidance results for methods with a constant or vanishing step-size do not apply. We resolve this challenge by combining a series of step-size stabilization arguments with a recursive representation of the ADAGRAD preconditioner that allows us to employ stable manifold techniques and ultimately show that the induced trajectories avoid saddle points from almost any initial condition.

Graph Chatbot

Chattez avec Graph Search

Topics in statistical physics of high-dimensional machine learning

From Kernel Methods to Neural Networks: A Unifying Variational Formulation

Hamiltonian Deep Neural Networks Guaranteeing Non-Vanishing Gradients by Design

Topics in statistical physics of high-dimensional machine learning

From Kernel Methods to Neural Networks: A Unifying Variational Formulation

Hamiltonian Deep Neural Networks Guaranteeing Non-Vanishing Gradients by Design