The Coupling/Minorization/ Drift Approach to Markov Chain Convergence Rates
Yu Hang Jiang
Tong Liu
Zhiya Lou
Jeffrey S. Rosenthal
Shanshan Shangguan
Fei Wang
Zixuan Wu
Communicated by Notices Associate Editor Richard Levine
1. Introduction
Imagine there is a $3\times 3$ grid of bushes, labeled $G_1$,$G_2$, …, $G_9$, from top to bottom and left to right (the bunny example in the opening figure). There is a fluffy little bunny hiding in the middle bush, starving and ready to munch on some grass around it. Assume the bunny never gets full and the grass is never depleted. Once each minute, the bunny jumps from its current bush to one of the nearest other bushes (up, down, left, or right, not diagonal) or stays at its current location, each with equal probability. We can then ask about longer-term probabilities. For example, if the bunny starts at $G_5$, the probability of jumping to $G_7$ after two steps is
But what happens to the probabilities after three steps? ten steps? more?
This paper investigates the convergence of such probabilities as the number of steps gets larger. As we will discuss later, such bounds are not only an interesting topic in their own right, they are also very important for reliably using Markov chain Monte Carlo (MCMC) computer algorithms 123 which are very widely applied to numerous problems in statistics, finance, computer science, physics, combinatorics, and more. After reviewing the standard eigenvalue approach in Section 3, we will concentrate on the use of “coupling,” and specifically on the use of “minorization” (Section 4) and “drift” (Section 6) conditions. We note that coupling is a very broad topic with many different variations and applications (see, e.g., 9), and has even inspired its own algorithms (such as “coupling from the past”). And, there are many other methods of bounding convergence of Markov chains, including continuous-time limits, different metrics, path coupling, non-Markovian couplings, spectral analysis, operator theory, and more, as well as numerous other related topics, which we are not able to cover here.
2. Markov Chains
The above bunny model is an example of a Markov chain (in discrete time and space). In general, a Markov chain is specified by three ingredients:
1. A state space$\mathcal{X}$, which is a collection of all of the states the Markov chain might be at. In the bunny example, $\mathcal{X} = \{G_1$,$G_2$, …, $G_9\}$.
2. An initial distribution (probability measure) $\mu _0(\cdot )$, where $\mu _0(A)$ is the probability of starting within $A \subset \mathcal{X}$ at time 0. In the bunny example, $\mu _0(G_5)=1$, and $\mu _0(G_i)=0\ \forall i\neq 5$.
3. A collection of transition probability distributions$P(x, \cdot )$ on $\mathcal{X}$ for each state $x \in \mathcal{X}$. The distribution $P(x,\cdot )$ represents the probabilities of the Markov chain going from $x$ to the next state after one unit of time. In a discrete state space like the bunny example, the transition probabilities can be simply written as $P = \{p_{ij}\}_{i,j\in \mathcal{X}}$, where $p_{ij}$ is the probability of jumping to $j$ from $i$. Indeed, in the bunny example:
For example, in the second row, $p_{21} = p_{22} = p_{23} = p_{25} = \frac{1}{4}$ because from $G_2$, the probabilities of jumping to each of $G_1$,$G_2$,$G_3$, or $G_5$ are each $\frac{1}{4}$.
We write $\mu _n(i)$ for the probability that the Markov chain is at state $i$ after $n$ steps. Given the initial distribution $\mu _0$ and transition probabilities $P(x, \cdot )$, we can compute $\mu _n$ inductively by
On a discrete space, this formula reduces to $\mu _n(j) = \sum _{i \in \mathcal{X}} p_{ij} \, \mu _{n-1}(i)$. In matrix form, regarding the $\mu _n$ as row-vectors, this means $\mu _n = \mu _{n-1} \, P$. It follows by induction that $\mu _n = \mu _0 \, P^n$, where $P^n$ is the $n$th matrix power of $P$, also called the $n$-step transition matrix. Here $(P^n)_{ij}$ is the probability of jumping to $j$ from $i$ in $n$ steps. Indeed, if we take $\mu _0 = (0, 0, \dots , 1, \dots , 0)$, so $\mu _0(i)=1$ with $\mu _0(j)=0$ for all $j\not =i$, then $\mu _n(j) = \sum _{r=0} \mu _0(r) (P^n)_{rj} = (P^n)_{ij}$. This makes sense since if we start at $i$, then $\mu _n(j)$ is the probability of moving from $i$ to $j$ in $n$ steps.
One main question in Markov chain analysis is whether the probabilities $\mu _n$ will converge to a certain distribution, i.e., whether $\pi \coloneq \lim _{n\to \infty } \mu _n$ exists. If it does, then letting $n\to \infty$ in the relation $\mu _{n+1} = \mu _n P$ indicates that $\pi$ must be stationary, i.e., $\pi = \pi P$. On a finite state space, this means that $\pi$ is a left eigenvector of the matrix $P$ with corresponding eigenvalue 1.
In the bunny example, by solving the system of linear equations given by $\pi P = \pi$, the stationary probability distribution can be computed to be the following vector:
In fact, the bunny example satisfies general theoretical properties called irreducibility and aperiodicity, which guarantee that the stationary distribution $\pi$ is unique, and that $\mu _n$ converges to $\pi$ as $n\to \infty$ (see, e.g., 8). However, in this paper we shall focus on quantitative convergence rates, i.e., how large $n$ has to be to make $\mu _n$ sufficiently close to $\pi$.
When the state space is finite and small, it is sometimes possible to obtain a quantitative bound on the convergence rate through direct matrix analysis (e.g., 6). We require eigenvalues $\lambda _i$ and left-eigenvectors $v_i$ such that $v_i P = \lambda _i v_i$. For example, for the above bunny process, we compute (numerically, for simplicity) the eigenvalues and left-eigenvectors in Figure 1.
To be specific, assume that the bunny starts from the center bush $G_5$, so $\mu _0 = (0,0,0,0,1,0,0,0,0)$. We can express this $\mu _0$ in terms of the above eigenvector basis as the linear combination:
(Here $v_0 = \pi$, corresponding to the eigenvalue $\lambda _0=1$.) Recalling that $\mu _n = \mu _0 P^n$ and that $v_i P = \lambda _i \, v_i$ by definition, we compute that, e.g.,
This shows that $\mu _n(G_5) \to \pi (G_5)$, and gives a strong bound on the difference between them. For example, $|\mu _n(G_5)-\pi (G_5)| < 0.01$ whenever $n \ge 6$, i.e., only six steps are required to make the bunny’s probability of being at $G_5$ within 0.01 of its limiting (stationary) probability. Other states besides $G_5$ can be handled similarly.
Unfortunately, such direct eigenvalue or spectral analysis becomes more and more challenging on larger and more complicated examples, especially on nonfinite state spaces. So, we next introduce a different technique which, while less tight, is more widely applicable.
4. Coupling and Minorization Conditions
The idea of coupling is to create two different copies of a random object, and compare them. Coupling has a long history in probability theory, with many different applications and approaches (see, e.g., 9). A key idea is the coupling inequality. Suppose we have two random variables $X$ and $Y$, each with its own distribution. Then for any subset $A$, we can write
$$\begin{multline*} \big |{\mathrm{Prob}}(X\in A) - {\mathrm{Prob}}(Y \in A)\big | \\ \ = \ \big |{\mathrm{Prob}}(X\in A, \ X=Y) + {\mathrm{Prob}}(X\in A, \ X\not =Y) \\ - {\mathrm{Prob}}(Y\in A, \ X=Y) - {\mathrm{Prob}}(Y\in A, \ X\not =Y)\big | \, . \end{multline*}$$
But here ${\mathrm{Prob}}(X\in A, \ X=Y) = {\mathrm{Prob}}(Y\in A, \ X=Y)$, since they both refer to the same event, so those two terms cancel. Also, each of ${\mathrm{Prob}}(X\in A, \ X\not =Y)$ and ${\mathrm{Prob}}(Y\in A, \ X\not =Y)$ are between 0 and ${\mathrm{Prob}}(X\not =Y)$, so their difference must be $\le {\mathrm{Prob}}(X\not =Y)$. Hence,
That is, the total variation distance between the probability laws $\mathcal{L}(X)$ and $\mathcal{L}(Y)$ is bounded above by the probability that the random variables $X$ and $Y$ are not equal. To apply this fact to Markov chains, the following condition is very helpful.
Definition.
A Markov chain with state space $\mathcal{X}$ and transition probabilities $P$ satisfies a minorization condition if there exists a (measurable) subset $C \subseteq \mathcal{X}$, a probability measure $\nu$ on $\mathcal{X}$, a constant $\epsilon > 0$, and a positive integer $n_0$, such that
$$\begin{equation*} P^{n_0}(x, \cdot ) \geq \epsilon \nu (\cdot ), \ x \in C \, . \end{equation*}$$
We call such $C$ a small set. In particular, if $C = \mathcal{X}$ (the entire state space), then the Markov chain satisfies a uniform minorization condition, also referred to as Doeblin’s condition.
Figure 2.
The example satisfying the minorization condition.
For a concrete example, suppose the state space is the half-line $\mathcal{X}=[0,\infty )$, with transition probabilities given by
That is, from a state $x$, the chain moves to an equal mixture of an Exponential(2) distribution and a half-normal distribution with mean 0 and standard deviation $x+1$. In this case, $P(x,dy) \ge e^{-2y} \, dy$ for all $x$ (see Figure 2), so the chain satisfies a uniform minorization condition with $n_0=1$,$\nu (y) = 2e^{-2y}$, and $\epsilon = \frac{1}{2}$.
The uniform minorization condition implies that there exists a common overlap of size $\epsilon$ between all of the transition probabilities. This allows us to formulate a coupling construction of two different copies $\{X_n\}$ and $\{X'_n\}$ of a Markov chain, as follows. Assume for now that $n_0=1$. First, choose $X_0 \sim \mu _0(\cdot )$ and $X_0' \sim \pi (\cdot )$ independently. Then, inductively for $n=0,1,2,\ldots$:
1. If $X_n = X_n'$, choose $z \sim P(X_n, \cdot )$ and let $X'_{n + 1} = X_{n + 1} = z$. The chains have already coupled, and they will remain equal forever.
2. If $X_n \not = X_n'$, flip a coin whose probability of Heads is $\epsilon$. If it shows Heads, choose $z \sim \nu (\cdot )$ and let $X'_{n + {1}} = X_{n + {1}} = z$. Otherwise, update $X_{n + {1}}$ and $X'_{n + 1}$ independently with probabilities given by
$$\begin{gather*} {\mathrm{Prob}}(X_{n + 1} \in A) = \frac{P(X_n, A) - \epsilon \nu (A)}{1 - \epsilon },\\ {\mathrm{Prob}}(X'_{n + 1} \in A) = \frac{P(X'_n, A) - \epsilon \nu (A)}{1 - \epsilon } \, . \end{gather*}$$
(The minorization condition guarantees that these “residual” probabilities are nonnegative, and hence are probability measures since their total mass equals ${P(X_n,\mathcal{X}) - \epsilon \nu (\mathcal{X}) \over 1-\epsilon } = {1-\epsilon \over 1-\epsilon } = 1$.) This construction ensures that overall, ${\mathrm{Prob}}(X_{n+1}\in A |X_n = x) = P(x,A)$ and ${\mathrm{Prob}}(X'_{n+1}\in A |X'_n = x)= P(x,A)$ for any $x \in \mathcal{X}$: indeed, if the two chains are unequal at time $n$, then
If $n_0 > 1$, then we can use the above construction for the times $n=0,n_0,2n_0,\ldots$, with $n+1$ replaced by $n+n_0$, and with $P(\cdot ,\cdot )$ replaced by $P^{n_0}(\cdot ,\cdot )$. Then, if desired, we can later “fill in” the intermediate states $X_n$ for $jn_0 < n < (j+1)n_0$, from their appropriate conditional distributions given the already-constructed values of $X_{jn_0}$ and $X_{(j+1)n_0}$.
Now, since $X'_0 \sim \pi (\cdot )$ and $\pi$ is a stationary distribution, therefore $X'_n \sim \pi (\cdot )$ for all $n$. And, every $n_0$ steps, the two chains probability is at least $\epsilon$ of coupling (i.e., of the coin showing Heads). So, ${\mathrm{Prob}}(X_n \not = X_{n}') \le (1 - \epsilon )^{\lfloor n/ n_0 \rfloor }$, where $\lfloor \cdot \rfloor$ means floor. The coupling equality then implies the following.
Theorem 1.
If $\{X_n\}$ is a Markov chain on $\mathcal{X}$, whose transition probabilities satisfy a uniform minorization condition for some $\epsilon > 0$, then for any positive integer $n$ and any $x\in \mathcal{X}$,
For the above Markov chain 1, we showed a uniform minorization condition with $n_0=1$ and $\epsilon =1/2$. So, Theorem 1 immediately implies that $\left\lVert \mathcal{L}(X_n) - \pi (\cdot ) \right\rVert _{TV} \le (1 - \epsilon )^{\lfloor n/ n_0 \rfloor } = (1 - (1/2))^n = 2^{-n}$, which is $< 0.01$ if $n \ge 6$, i.e., this chain converges within six steps.
Assume $\mathcal{X}$ is finite, and for some $n_0\in \mathbb{N}$ there is at least one state $j\in \mathcal{X}$ such that the $j$th column of $P^{n_0}$ is all positive, i.e., $(P^{n_0})_{ij} > 0$ for all $i\in \mathcal{X}$. Then we can set $\epsilon = \sum _{j \in \mathcal{X}}\min _{i \in \mathcal{X}} (P^{n_0})_{ij} > 0$, and $\nu (j) = \epsilon ^{-1} \, \min _{i \in \mathcal{X}} (P^{n_0})_{ij}$, so that $(P^{n_0})_{ij} \geq \epsilon \, \nu (j)$ for all $i,j \in \mathcal{X}$, i.e., an $n_0$-step uniform minorization condition is satisfied with that value of $\epsilon$.
4.1. Application to bunny example
The bunny example does not satisfy a one-step minorization condition, since every column of $P$ has some zeros, so we instead consider its two-step transition probabilities, $P^2$, in Figure 3.
In this two-step transition matrix, the fifth column only contains positive values, since no matter where the bunny starts, there will always be at least a $\frac{9}{80}$ chance that it will jump to the center bush $(G_5)$ in two steps. Thus, we can satisfy a two-step minorization condition by taking
and $\nu (j) = \epsilon ^{-1} \, \min _{i \in \mathcal{X}} (P^{n_0})_{ij}$ as above. Then, we can apply Theorem 1, with $n_0=2$ and $\epsilon = 9/80$, to conclude that