Skip to Main Content

The Coupling/Minorization/ Drift Approach to Markov Chain Convergence Rates

Yu Hang Jiang
Tong Liu
Zhiya Lou
Jeffrey S. Rosenthal
Shanshan Shangguan
Fei Wang
Zixuan Wu

Communicated by Notices Associate Editor Richard Levine

Article cover

1. Introduction

Imagine there is a grid of bushes, labeled , , …, , from top to bottom and left to right (the bunny example in the opening figure). There is a fluffy little bunny hiding in the middle bush, starving and ready to munch on some grass around it. Assume the bunny never gets full and the grass is never depleted. Once each minute, the bunny jumps from its current bush to one of the nearest other bushes (up, down, left, or right, not diagonal) or stays at its current location, each with equal probability. We can then ask about longer-term probabilities. For example, if the bunny starts at , the probability of jumping to after two steps is

But what happens to the probabilities after three steps? ten steps? more?

This paper investigates the convergence of such probabilities as the number of steps gets larger. As we will discuss later, such bounds are not only an interesting topic in their own right, they are also very important for reliably using Markov chain Monte Carlo (MCMC) computer algorithms 123 which are very widely applied to numerous problems in statistics, finance, computer science, physics, combinatorics, and more. After reviewing the standard eigenvalue approach in Section 3, we will concentrate on the use of “coupling,” and specifically on the use of “minorization” (Section 4) and “drift” (Section 6) conditions. We note that coupling is a very broad topic with many different variations and applications (see, e.g., 9), and has even inspired its own algorithms (such as “coupling from the past”). And, there are many other methods of bounding convergence of Markov chains, including continuous-time limits, different metrics, path coupling, non-Markovian couplings, spectral analysis, operator theory, and more, as well as numerous other related topics, which we are not able to cover here.

2. Markov Chains

The above bunny model is an example of a Markov chain (in discrete time and space). In general, a Markov chain is specified by three ingredients:

1. A state space , which is a collection of all of the states the Markov chain might be at. In the bunny example, , , …, .

2. An initial distribution (probability measure) , where is the probability of starting within at time 0. In the bunny example, , and .

3. A collection of transition probability distributions on for each state . The distribution represents the probabilities of the Markov chain going from to the next state after one unit of time. In a discrete state space like the bunny example, the transition probabilities can be simply written as , where is the probability of jumping to from . Indeed, in the bunny example:

For example, in the second row, because from , the probabilities of jumping to each of , , , or are each .

We write for the probability that the Markov chain is at state  after  steps. Given the initial distribution and transition probabilities , we can compute inductively by

On a discrete space, this formula reduces to . In matrix form, regarding the as row-vectors, this means . It follows by induction that , where is the th matrix power of , also called the -step transition matrix. Here is the probability of jumping to from in steps. Indeed, if we take , so with for all , then . This makes sense since if we start at , then is the probability of moving from to in steps.

One main question in Markov chain analysis is whether the probabilities will converge to a certain distribution, i.e., whether exists. If it does, then letting in the relation indicates that must be stationary, i.e., . On a finite state space, this means that is a left eigenvector of the matrix with corresponding eigenvalue 1.

In the bunny example, by solving the system of linear equations given by , the stationary probability distribution can be computed to be the following vector:

In fact, the bunny example satisfies general theoretical properties called irreducibility and aperiodicity, which guarantee that the stationary distribution is unique, and that converges to as (see, e.g., 8). However, in this paper we shall focus on quantitative convergence rates, i.e., how large has to be to make sufficiently close to .

Figure 1.

3. Eigenvalue Analysis on Finite State Spaces

When the state space is finite and small, it is sometimes possible to obtain a quantitative bound on the convergence rate through direct matrix analysis (e.g., 6). We require eigenvalues and left-eigenvectors such that . For example, for the above bunny process, we compute (numerically, for simplicity) the eigenvalues and left-eigenvectors in Figure 1.

To be specific, assume that the bunny starts from the center bush , so . We can express this in terms of the above eigenvector basis as the linear combination:

(Here , corresponding to the eigenvalue .) Recalling that and that by definition, we compute that, e.g.,

Since and , the triangle inequality implies that

This shows that , and gives a strong bound on the difference between them. For example, whenever , i.e., only six steps are required to make the bunny’s probability of being at within 0.01 of its limiting (stationary) probability. Other states besides can be handled similarly.

Unfortunately, such direct eigenvalue or spectral analysis becomes more and more challenging on larger and more complicated examples, especially on nonfinite state spaces. So, we next introduce a different technique which, while less tight, is more widely applicable.

4. Coupling and Minorization Conditions

The idea of coupling is to create two different copies of a random object, and compare them. Coupling has a long history in probability theory, with many different applications and approaches (see, e.g., 9). A key idea is the coupling inequality. Suppose we have two random variables and , each with its own distribution. Then for any subset , we can write

But here , since they both refer to the same event, so those two terms cancel. Also, each of and are between 0 and , so their difference must be . Hence,

Since this upper bound is uniform over subsets , we can even take a supremum over , to also bound the total variation distance:

That is, the total variation distance between the probability laws and is bounded above by the probability that the random variables and are not equal. To apply this fact to Markov chains, the following condition is very helpful.

Definition.

A Markov chain with state space and transition probabilities satisfies a minorization condition if there exists a (measurable) subset , a probability measure on , a constant , and a positive integer , such that

We call such a small set. In particular, if (the entire state space), then the Markov chain satisfies a uniform minorization condition, also referred to as Doeblin’s condition.

Figure 2.

The example satisfying the minorization condition.

For a concrete example, suppose the state space is the half-line , with transition probabilities given by

That is, from a state , the chain moves to an equal mixture of an Exponential(2) distribution and a half-normal distribution with mean 0 and standard deviation . In this case, for all (see Figure 2), so the chain satisfies a uniform minorization condition with , , and .

The uniform minorization condition implies that there exists a common overlap of size between all of the transition probabilities. This allows us to formulate a coupling construction of two different copies and of a Markov chain, as follows. Assume for now that . First, choose and independently. Then, inductively for :

1. If , choose and let . The chains have already coupled, and they will remain equal forever.

2. If , flip a coin whose probability of Heads is . If it shows Heads, choose and let . Otherwise, update and independently with probabilities given by

(The minorization condition guarantees that these “residual” probabilities are nonnegative, and hence are probability measures since their total mass equals .) This construction ensures that overall, and for any : indeed, if the two chains are unequal at time , then

Figure 3.

If , then we can use the above construction for the times , with replaced by , and with replaced by . Then, if desired, we can later “fill in” the intermediate states for , from their appropriate conditional distributions given the already-constructed values of and .

Now, since and is a stationary distribution, therefore for all . And, every steps, the two chains probability is at least of coupling (i.e., of the coin showing Heads). So, , where means floor. The coupling equality then implies the following.

Theorem 1.

If is a Markov chain on , whose transition probabilities satisfy a uniform minorization condition for some , then for any positive integer and any ,

For the above Markov chain 1, we showed a uniform minorization condition with and . So, Theorem 1 immediately implies that , which is if , i.e., this chain converges within six steps.

Assume is finite, and for some there is at least one state such that the th column of is all positive, i.e., for all . Then we can set , and , so that for all , i.e., an -step uniform minorization condition is satisfied with that value of .

4.1. Application to bunny example

The bunny example does not satisfy a one-step minorization condition, since every column of has some zeros, so we instead consider its two-step transition probabilities, , in Figure 3.

In this two-step transition matrix, the fifth column only contains positive values, since no matter where the bunny starts, there will always be at least a chance that it will jump to the center bush in two steps. Thus, we can satisfy a two-step minorization condition by taking

and as above. Then, we can apply Theorem 1, with and , to conclude that

For example, if we want the distribution of the bunny’s location to be within 0.01 of the stationary distribution