# From Optimization to Sampling Through Gradient Flows

Communicated by *Notices* Associate Editor Reza Malek-Madani

Optimization and sampling algorithms play a central role in science and engineering as they enable finding optimal predictions, policies, and recommendations, as well as expected and equilibrium states of complex systems. The notion of “optimality” is formalized by the choice of an objective function, while the notion of an “expected” state is specified by a probabilistic model for the distribution of states. Optimizing rugged objective functions and sampling multimodal distributions is computationally challenging, especially in high-dimensional problems. For this reason, many optimization and sampling methods have been developed by researchers working in disparate fields such as Bayesian statistics, molecular dynamics, genetics, quantum chemistry, machine learning, weather forecasting, econometrics, and medical imaging.

State-of-the-art algorithms for optimization and sampling often rely on *ad-hoc* heuristics and empirical tuning, but some unifying principles have emerged that greatly facilitate the understanding of these methods and the communication of algorithmic innovations across scientific communities. This article is concerned with one such principle: the use of gradient flows, and discretizations thereof, to design and analyze optimization and sampling algorithms. The interplay between optimization, sampling, and gradient flows is an active research area and a thorough review of the extant literature is beyond the scope of this article.Footnote^{1} Our goal is to provide an accessible and lively introduction to some core ideas, emphasizing that gradient flows uncover the conceptual unity behind several existing algorithms and give a rich mathematical framework for their rigorous analysis.

^{1}

AMS Notices limits to the references per article; we refer to GTSA20CLGL 20GIHLS20 for further pointers to the literature.

We present motivating applications in section 1. Section 2 is focused on the gradient descent approach to optimization and introduces fundamental ideas such as preconditioning, convergence analysis, and time discretization of gradient flows. Sampling is discussed in section 3 in the context of Langevin dynamics viewed as a gradient flow of the Kullback–Leibler (KL) divergence with respect to (w.r.t) the Wasserstein geometry. Some modern applications of gradient flows for sampling are discussed in section 4, followed by concluding remarks in section 5.

## 1. Motivating Applications

We outline two applications in Bayesian statistics and molecular dynamics that illustrate some important challenges in optimization and sampling.

### 1.1. Bayesian statistics

In Bayesian statistics GCSR95, an initial belief about an unknown parameter is updated as data becomes available. Let denote the unknown parameter of interest belonging to the parameter space and let denote the *prior* distribution reflecting our initial belief. Furthermore, let be the observed data also belonging to an appropriate space Then Bayes’s rule identifies the distribution of . conditioned on the data :

where indicates that the right-hand side should be normalized to define a probability distribution. Here is called the *likelihood* function and is called the *posterior* distribution. Bayesian inference on is based on the posterior, which blends the information in the prior and the data.

The choice of prior and likelihood is a modeling task which, perhaps surprisingly, is often not the most challenging aspect of Bayesian inference. The main challenge is to extract information from the posterior since (i) it typically does not belong to a standard family of distributions, unless in the restrictive case of conjugate models GCSR95; (ii) the parameter can be high dimensional; and (iii) the normalizing constant in 1.1 (known as the *marginal likelihood*) is rarely available and it can be expensive to compute. These practical hurdles inform the design of optimization and sampling algorithms to find posterior statistics.

Of particular importance is the posterior mode or *maximum a posteriori (MAP)* estimator

Many optimization algorithms for MAP estimation start from an initial guess and produce iterates by discretizing a gradient flow with the property that for large Such gradient flows in parameter space will be discussed in section .2.

Computing MAP estimators is closely related to classic regularization techniques such as penalized least squares and Tikhonov regularization SAST18. In order to fully leverage the Bayesian framework, it is often desirable to consider other posterior statistics such as mean, variance, credible intervals, and task-specific functional moments, which can be written in the form

where is a suitable test function. Since is often high dimensional, the standard approach to compute these expectations is to use Monte Carlo Liu01: obtain samples from the posterior and then approximate ,

While Monte Carlo integration is in principle scalable to high dimensions, the task of generating posterior samples is still highly nontrivial. To that end, one may consider sampling where the sequence arises from discretizing a gradient flow with the property that for large Such gradient flows in the space of probability distributions will be discussed in sections 3 and 4.

To relate the discussion above to subsequent developments, we note that dropping the data from the notation, the posterior density can be written as

where is the marginal likelihood and is the negative logarithm of the posterior density.

### 1.2. Molecular dynamics

Another important source of challenging optimization and sampling problems is statistical mechanics, and in particular the simulation of molecular dynamics (see chapter 9 in Liu01). According to Boltzmann and Gibbs, the positions and momenta of the atoms in a molecular system of constant size, occupying a constant volume, and in contact with a heat bath (at constant temperature), are distributed according to

where is a normalizing constant known as the partition function, represents the inverse temperature, is a potential energy describing the interaction of the particles in the system, and represents the kinetic energy of the system. Letting we can write the Boltzmann–Gibbs distribution 1.3 in the form 1.2, with

As in Bayesian statistics, it is important to determine the most likely configuration of particles (i.e., the mode of along with expectations of certain test functions w.r.t. the Boltzmann distribution. These two tasks motivate the need for optimization and sampling algorithms that acknowledge that the potential ), is often a rough function with many local minima, that the dimension of and is large, and that finding the normalizing constant is challenging.

## 2. Optimization

In this section we discuss gradient flows for solution of the unconstrained minimization problem

where is a given objective function. Henceforth we take unless otherwise noted. As guiding examples, consider computing the mode of a posterior or Boltzmann distribution by minimizing given by 1.2 or 1.4. The methods described in this section are applicable beyond the specific problem of finding the modes, however, this interpretation will be of particular interest in relating the material in this section to our discussion of sampling in section 3.

### 2.1. Gradient systems

One of the most standard approaches to solve 2.1 is *gradient descent*, an optimization scheme that is based on the discretization of the gradient system

with user-defined initial value throughout this article ; will denote the gradient of the function at the point which will be tacitly assumed to exist wherever needed. ,

While equation 2.2 is perhaps the most popular formulation of the continuous-time gradient descent dynamics, the equivalent integral form below reveals more transparently some of its properties:

for all Indeed, notice that from .2.3 it is apparent that for i.e., the value of the function , decreases in time, and in all but a few trivial situations the decrease is strict. Another advantage of the reformulation 2.3 (or its inequality form 2.5 below) is that it can be adapted to more general settings with less mathematical structure than the one needed to make sense of 2.2. In particular, 2.3 can be used to motivate a notion of gradient flow in arbitrary metric spaces; see AGS08 for an in-depth discussion of this topic.

Notice that the relationship is only required in proving the *energy dissipation inequality*

since the reverse inequality is a consequence of Cauchy–Schwarz. Notice further that 2.2 implies 2.4, but in general the converse statement is not true. For example, the flow satisfies 2.4, but in general does not satisfy 2.2. Likewise, 2.2 implies (which follows directly from the chain rule), but not conversely. Indeed, in we may take to be any orthogonal matrix and consider so that but 2.2 is not, in general, satisfied. This digression illustrates that equation 2.3 captures in one single identity of *scalar* quantities the *vectorial* identity 2.2, even if it is not as intuitive as other scalar relations.

### 2.2. A note on convergence

Despite the fact that gradient descent satisfies the energy dissipation property, it is in general not true that as time goes to infinity the dynamics 2.2 converge to a global minimizer of 2.1. This could happen for different reasons. First, the problem 2.1 may not have a minimizer (e.g., take for Second, ). may have critical points associated with saddle points or local optima of as we illustrate in the next example.

Suitable assumptions on prevent the existence of local minimizers that are not global and also imply rates of convergence. One such assumption is the Polyak–Lojasiewic (PL) condition KNS16:

where and is a constant. Note that the PL condition readily implies that any stationary point of is a global minimizer, since

Under the PL condition we can easily obtain a convergence rate for continuous-time gradient descent.

One can verify the PL condition under various assumptions on the function KNS16. Here we present, as an important example, the case of convexity. For -strong one says that is -*strongly convex* if for any it holds that

for all This condition can be shown to be equivalent to .

from which we can see, after minimizing both sides w.r.t. that ,

which is equivalent to 2.7. From this we conclude that convexity implies the PL condition with the same constant -strong .

Note that strong convexity is a stronger condition than the PL condition. For example, the function (where satisfies the PL condition with ) but it is not strongly convex. ,

### 2.3. Choice of the metric

Let us consider a function of the form where Suppose that and that is very close to zero, as in Figure 2. We can now apply Proposition 2.3 with but since we assumed , is small we see that the right-hand side of 2.8 decreases very slowly. This suggests that gradient descent may take a long time to reach global minimum when initialized arbitrarily. ’s

The poor behavior of gradient descent described above arises whenever there are regions of points away from the minimizer at which the gradient of is very small. One approach to remedy this issue is to introduce a more general version of gradient descent that accelerates the dynamics in those regions where the gradient of is small. This is the goal of *preconditioning*. Let be a continuous field of positive definite matrices, i.e., a function that assigns to every point a positive definite matrix The .*preconditioned gradient descent* dynamics induced by is defined as:

Observe that 2.10 coincides with the original gradient descent dynamics 2.2 when is constant and equal to the identity matrix. On the other hand, when is convex and twice differentiable, choosing the Hessian of , results in a continuous-time analog of ,*Newton’s algorithm*. In the example in Figure 2, we can directly compute i.e., the Hessian is a fixed matrix since the potential , is quadratic. Substituting this choice of in 2.10 for that example gives the dynamics a scheme that achieves a much faster convergence rate. ,

The reader may wonder if we could have in fact chosen for large constant in order to induce a system that converges to equilibrium at a faster rate. However, as implied by our discussion in section 2.4 and, specifically, Remark 2.6 there is no benefit in doing so, as the cost of discretizing becomes correspondingly higher; rescaling the Hessian may be simply interpreted as a change of units. In general, there is a natural tension between accelerating continuous-time dynamics by changing the metric of the space, and producing accurate time discretizations for the resulting flows; see GTSA20 for a related discussion in the context of sampling algorithms. In a similar vein, we notice that the superior convergence rate and affine invariance of Newton’s algorithm comes at the price of utilizing the Hessian matrix which in many applications can be prohibitively costly to compute or store. To this end, constructing matrix fields , that are good proxies for the Hessian and that can be computed efficiently is the goal of preconditioning. Perhaps the most well-known family of such algorithms is the family of *Quasi-Newton* algorithms NW99 and in particular the *Broyden–Fletcher–Goldfarb–Shanno* (BFGS) algorithm, which approximates the Hessian using gradients calculated at previous iterates.

#### 2.3.1. Geometric interpretation

As a step toward introducing the material in section 3, here we give a geometric interpretation of equation 2.10. Specifically, we will show that 2.10 can still be understood as a gradient descent equation but w.r.t. a different metric on the parameter space For this purpose it is convenient to recall that a .*Riemannian manifold* is a manifold endowed with a family of inner products (often referred to as the *metric*), one for each point on the manifold, and which can be used to measure angles between vectors at every tangent plane of We will use . to denote the tangent plane at a given The standard example of a Riemannian manifold is . with the Euclidean inner product at every point. More general examples of Riemannian manifolds with can be generated from a field of positive definite matrices Consider the family of inner products: .

where denotes the standard Euclidean inner product. Notice that is indeed an inner product since is positive definite. In what follows we often suppress the dependence of on for brevity.

We now proceed to define the notion of the gradient of a function defined over an arbitrary Riemannian manifold. Let be a Riemannian manifold and let be a smooth enough function. The gradient of at the point relative to the metric denoted , is defined as the vector in for which the following identity holds:

for any differentiable curve

for any curve

In this light, 2.10 can be interpreted as a gradient descent algorithm, only that the gradient is taken w.r.t. a metric that is different from the standard Euclidean one. In section 3, where we discuss sampling, we will return to some of the insights that we have developed in this section. In particular, in order to define gradient descent dynamics of a functional over a manifold we need to specify two ingredients: 1) an energy

#### 2.3.2. Geodesic convexity

There are analogous conditions to the PL and strong convexity assumptions discussed in section 2.1 that guarantee the convergence of the flow 2.10 toward global minima of

where we have used

which generalizes the PL condition in the Euclidean setting to general inner products and gradients.

To introduce an appropriate notion of convexity that allows us to generalize the results of section 2.1 we need to introduce a few more ideas from Riemannian geometry. Given a Riemannian manifold