**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Stochastic Zeroth-Order Optimisation Algorithms with Variance Reduction

Abstract

Introduction of optimisation problems in which the objective function is black box or obtaining the gradient is infeasible, has recently raised interest in zeroth-order optimisation methods. As an example finding adversarial examples for Deep Learning models (Chen et al. (2017); Moosavi-Dezfooli et al. (2016)) is one of the most common applications in which zeroth-order methods could be used. These optimisation methods use only function values at certain points to estimate the gradient. Most current approaches iteratively sample a random search direction along which they compute an estimation of the gradient (Nesterov and Spokoiny (2017); Conn et al. (2009); Wibisono et al. (2012)). However, due to the high variance in the search direction, these methods usually need d times more iterations than the standard gradient methods, where d is the dimensionality of the problem. So it seems that the main effort for improving the zeroth-order methods should be in reducing the variance of the gradient estimate. In this work we will analyse the gradient-free oracle which uses random directions sampled form a Gaussian distribution. Our analysis shows that in smooth and strongly convex setting, we have a convergence rate of O( d/T) which clearly shows the dependency to the dimension of the problem. Furthermore we propose some variance reduction methods to make the zeroth-order optimisation faster. We experiment our proposed methods in Python to compare their convergence in stochastic and non-stochastic setting. Our empirical results show that in a setting that number of allowed function evaluation is fixed, using a variance reduction method (e.g. momentum) can make the convergence of zeroth-order methods happen faster.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related concepts (23)

Black box

In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. I

Stochastic

Stochastic (stəˈkæstɪk; ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that

Normal distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function

Related publications (57)

Loading

Loading

Loading

Nicolas Boumal, Christopher Arnold Criscitiello

We describe the first gradient methods on Riemannian manifolds to achieve accelerated rates in the non-convex case. Under Lipschitz assumptions on the Riemannian gradient and Hessian of the cost function, these methods find approximate first-order critical points faster than regular gradient descent. A randomized version also finds approximate second-order critical points. Both the algorithms and their analyses build extensively on existing work in the Euclidean case. The basic operation consists in running the Euclidean accelerated gradient descent method (appropriately safe-guarded against non-convexity) in the current tangent space, then moving back to the manifold and repeating. This requires lifting the cost function from the manifold to the tangent space, which can be done for example through the Riemannian exponential map. For this approach to succeed, the lifted cost function (called the pullback) must retain certain Lipschitz properties. As a contribution of independent interest, we prove precise claims to that effect, with explicit constants. Those claims are affected by the Riemannian curvature of the manifold, which in turn affects the worst-case complexity bounds for our optimization algorithms.

We have developed a new derivative-free algorithm based on Radial Basis Functions (RBFs). Derivative-free optimization is an active field of research and several algorithms have been proposed recently. Problems of this nature in the industrial setting are quite frequent. The reason is that in a number of applications the optimization process contains simulation packages which are treated as black boxes. The development of our own algorithm was originally motivated by an application in biomedical imaging: the medical image registration problem. The particular characteristics of this problem have incited us to develop a new optimization algorithm based on trust-region methods. However it has been designed to be generic and to be applied to a wide range of problems. The main originality of our approach is the use of RBFs to build the models. In particular we have adapted the existing theory based on quadratic models to our own models and developed new procedures especially designed for models based on RBFs. We have tested our algorithm called BOOSTERS against state-of-the-art methods (UOBYQA, NEWUOA, DFO). On the medical image registration problem, BOOSTERS appears to be the method of choice. The tests on problems from the CUTEr collection show that BOOSTERS is comparable to, but not better than other methods on small problems (size 2-20). It is performing very well for medium size problems (20-80). Moreover, it is able to solve problems of dimension 200, which is considered very large in derivative-free optimization. We have also developed a new class of algorithms combining the robustness of derivative-free algorithms with the faster rate of convergence characterizing Newtonlike-methods. In fact, they define a new class of algorithms lying between derivative-free optimization and quasi-Newton methods. These algorithms are built on the skeleton of our derivative-free algorithm but they can incorporate the gradient when it is available. They can be interpreted as a way of doping derivative-free algorithms with derivatives. If the derivatives are available at each iteration, then our method can be seen as an alternative to quasi-Newton methods. At the opposite, if the derivatives are never evaluated, then the algorithm is totally similar to BOOSTERS. It is a very interesting alternative to existing methods for problems whose objective function is expensive to evaluate and when the derivatives are not available. In this situation, the gradient can be approximated by finite differences and its costs corresponds to n additional function evaluations assuming that Rn is the domain of definition of the objective function. We have compared our method with CFSQP and BTRA, two gradient-based algorithms, and the results show that our doped method performs best. We have also a theoretical analysis of the medical image registration problem based on maximization of mutual information. Most of the current research in this field is concentrated on registration based on nonlinear image transformation. However, little attention has been paid to the theoretical properties of the optimization problem. In our analysis, we focus on the continuity and the differentiability of the objective function. We show in particular that performing a registration without extension of the reference image may lead to discontinuities in the objective function. But we demonstrate that, under some mild assumptions, the function is differentiable almost everywhere. Our analysis is important from an optimization point of view and conditions the choice of a solver. The usual practice is to use generic optimization packages without worrying about the differentiability of the objective function. But the use of gradient-based methods when the objective function is not differentiable may result in poor performance or even in absence of convergence. One of our objectives with this analysis is also that practitioners become aware of these problems and to propose them new algorithms having a potential interest for their applications.

,

We consider distributed optimization over several devices, each sending incremental model updates to a central server. This setting is considered, for instance, in federated learning. Various schemes have been designed to compress the model updates in order to reduce the overall communication cost. However, existing methods suffer from a significant slowdown due to additional variance omega > 0 coming from the compression operator and as a result, only converge sublinearly. What is needed is a variance reduction technique for taming the variance introduced by compression. We propose the first methods that achieve linear convergence for arbitrary compression operators. For strongly convex functions with condition number kappa, distributed among n machines with a finite-sum structure, each worker having less than in components, we also (i) give analysis for the weakly convex and the non-convex cases and (ii) verify in experiments that our novel variance reduced schemes are more efficient than the baselines. Moreover, we show theoretically that as the number of devices increases, higher compression levels are possible without this affecting the overall number of communications in comparison with methods that do not perform any compression. This leads to a significant reduction in communication cost. Our general analysis allows to pick the most suitable compression for each problem, finding the rig ht balance between additional variance and communication savings. Finally, we also (iii) give analysis for arbitrary quantized updates.