Mathematical Aspects of Mixing Times in Markov Chains

This is the text version of the file http://mathnet.kaist.ac.kr/papers/georgiain/Prasad/survey.ps.
G o o g l e automatically generates text versions of documents as we crawl the web.

Google is neither affiliated with the authors of this page nor responsible for its content.

Page 1

R. Montenegro and P. Tetali

February 3, 2006

Page 2

Abstract

In the past few years we have seen a surge in the theory of finite Markov chains, by way of

new techniques to bounding the convergence to stationarity. This includes functional techniques

such as logarithmic Sobolev and Nash inequalities, refined spectral and entropy techniques, and

the evolving set methodology. We attempt to give a more or less self-contained treatment of some

of these modern techniques. There have been other important contributions to this theory such

as variants on coupling techniques and decomposition methods, which are not included here; our

choice was to keep the analytical methods as the theme of this presentation. We illustrate the

strength of the main techniques by way of simple examples as well as with a brief and improved

analysis of the Thorp shuffle.

Page 3

Contents

Introduction

1 Basic Bounds on Mixing Times

1.1 Preliminaries: Distances and mixing times . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3 Discrete Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Does Reversibility Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Advanced Functional Techniques

2.1 Log-Sobolev and Nash Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Spectral profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Comparison methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Evolving Set Methods

3.1 Bounding Distances by Evolving Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Mixing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Modified Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Lower Bounds on Mixing Times

4.1 A geometric lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 A spectral lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 A log-Sobolev lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Examples

5.1 Sharpness of bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Discrete Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3 Comparing Mixing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4 New Cheeger Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 The Thorp Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.5.1 Modeling the Thorp Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.5.2 Spectral Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.5.3 Evolving Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5.4 Bounding the l

norm of functions . . . . . . . . . . . . . . . . . . . . . . . . 52

Page 4

6 Miscellaneous

6.1 The Fastest Mixing Markov Process Problem . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Alternative description of the spectral gap, and the entropy constant . . . . . . . . . 57

6.3 Perturbation Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Bibliography

Appendix

Page 5

Introduction

While the basic theory of finite Markov chains has been well studied and understood, the advent of

theory of computing has renewed interest in this classical subject. There have already been some

excellent surveys focusing on combinatorial, computational and statistical physics applications of

finite Markov chains. We mention here a few sources, by way of survey articles, for the interested

reader. For a good overview of the basic techniques in estimating the mixing times of finite Markov

chains, see [31], [33], and [29]. The recent manuscript of Dyer et al. [24] describes several comparison

theorems for reversible as well as nonreversible Markov chains. However, much of the theory

surveyed in this article is rather recent theoretical (analytical) development and is so far unavailable

in a unified presentation. The significance of these new methods is as follows.

It is classical and elementary to show that the inverse spectral gap of a reversible Markov chain

captures the mixing time (in L

and L

) up to a factor of log(1/π

∗

), where π

∗

= min

π(x) denotes

the smallest entry in the stationary probability (vector) π of the chain. While the logarithmic

Sobolev constant captures the L

-mixing time up to a factor of log log(1/π

∗

), it is typically much

harder to bound – to mention a specific example, the exact constant is open for the 3-point space

with arbitrary invariant measure; also in a few cases, the log-Sobolev constant is known not to

give tight bounds on the L

-mixing time. The main strength of the spectral profile techniques and

the evolving set methodology considered in this survey seems to be that of avoiding extra penalty

factors such as log log(1/π

∗

). In the present write-up, the above is illustrated with a couple of

simple examples, and with the now-famous Thorp shuffle, for which an improved O(d

) mixing

time is described, building on the proof of Morris that proved the first polynomial (in d) bound of

O(d

). The approach to L

-mixing time using the spectral profile has the additional advantage of

yielding known (upper) estimates on mixing time, under a log-Sobolev inequality and/or a Nash-

type inequality. Thus various functional analytic approaches to mixing times can be unified with the

approach of bounding the spectral profile. The one exception to this is the approach to stationarity

using relative entropy, and techniques to bounding the so-called entropy constant have been rather

limited.

A brief history of the above development can perhaps be summarized as follows. A fundamental

contribution, by way of initiating several subsequent works, was made by Lovasz and Kannan in

[36] in which they introduced the notion of average conductance to bound the total variation mixing

time. This result was further strengthened and developed by Morris and Peres using the so-called

evolving sets, where they analyze a given chain by relating it to an auxiliary chain (a dual of sorts)

on subsets of the states of the original chain. While this was introduced in [45] in a (martingale-

based) probabilistic language, it turns out to be, retrospectively, an independent and alternative

view of the notion of a Doob transform introduced and investigated by Diaconis and Fill [19].

Further refinement and generalization of the evolving sets approach was done in detail by [42]. The

functional analog of some of this is done via the spectral profile, developed for the present context

Page 6

of finite Markov chains, in [26], while having its origins in the developments by [4] and [15] in the

context of manifolds.

Besides summarizing much of the above recent developments in this exciting topic, we address

some classical aspects as well. In discrete-time, much of the literature uses laziness assumptions

to avoid annoying technical difficulties. While laziness is a convenient assumption, it slows down

the chain by a factor of 2, which may not be desirable in practice. We take a closer look at this

issue and report bounds which reflect the precise dependence on laziness. The notion of modified

conductance circumvents laziness altogether, and we discuss this aspect briefly and compare it

to bounds derived from the functional approach. Further details on the modified conductance

and its usefulness can be found in [43]. Another issue is that of the role of reversibility (a.k.a.

detailed balance conditions). We tried to pay particular attention to it, due to current trend

in the direction of analyzing various nonreversible Markov chains. Although often a convenient

assumption, we avoid as much as possible this additional assumption. In particular, we include

a proof of the lower bound on the total variation mixing time in terms of the second eigenvalue

in the general case. Besides providing upper and lower bounds for the mixing time of reversible

and non-reversible chains, we report recent successes (with brief analysis) in the analysis of some

non-reversible chains; see for example, the discrete logarithm example and the Thorp shuffle.

In Chapter 1 we introduce notions of mixing times and prove the basic upper bounds on these

notions using Poincare and logarithmic Sobolev type functional constants. In Chapter 2 we move

on to recent results using the spectral profile, as opposed to using simply the second eigenvalue.

In Chapter 3 we review the evolving set methods. Our treatment of lower bounds on mixing

times is provided in Chapter 4. We consider several examples for illustration in Chapter 5. In

the penultimate chapter, we gather a few recent results together. This includes recent results on

the so-called fastest mixing Markov chain problem, and a recent theorem [40] from perturbation

theory of finite Markov chains; this theorem relates the stability of a stochastic matrix (subject

to perturbations) to the rate of convergence to equilibrium of the matrix. We also recall here an

old but not so widely known characterization of the spectral gap, since there have been recent

results utilizing this formulation. The Appendix contains a discussion on the relations between the

distances considered in this paper, and others such as relative pointwise (L

∞

) distance.

Acknowledgments. We thank Pietro Caputo and Eric Carlen for several helpful discussions while

preparing this write-up.

Page 7

Chapter 1

Basic Bounds on Mixing Times

1.1 Preliminaries: Distances and mixing times

Let (Ω,P,π) denote a transition probability matrix (or Markov kernel) of a finite Markov chain on

a finite state space Ω with a unique invariant measure π. That is

P(x,y) ≥ 0, for all x,y ∈ Ω, and ∑

y∈Ω

P(x,y) = 1, for all x ∈ Ω.

∑

x∈Ω

π(x)P(x,y) = π(y), for all y ∈ Ω.

We assume throughout this paper that P is irreducible and that π has full support (Ω). For standard

definitions and introduction to finite Markov chains, we refer the reader to [47] or [1].

It is a classical fact that if P is aperiodic then the measures P

(x,·) approach π as n → ∞.

Alternatively, let k

(y) = P

(x,y)/π(y) denote the density with respect to π at time n ≥ 0, or

simply k

(y) when the start state or the start distribution is unimportant or clear from the context.

Then the density k

(y) converges to 1 as n → ∞. A proper quantitative statement may be stated

using any one of several norms. In terms of L

-distance

− 1

= ∑

y∈Ω

(y) − 1|

π(y) 1 ≤ p < +∞,

it becomes twice the total variation norm P

(x,·) − π

when p = 1. (Note that some authors

prefer to define the total variation norm as simply the L

norm.) Another important measure of

closeness (but not a norm) is the informational divergence,

D(P

(x,·)|π) = Ent

) = ∫ k

log k

dπ = ∑

(x,y)log

(x,y)

π(y)

where the entropy Ent

(f) = E

f log

Each of these distances are convex, in the sense that if µ and ν are two distributions, and

s ∈ [0,1] then dist((1 − s)µ + sν,π) ≤ (1 − s)dist(µ,π) + sdist(ν,π). For instance, D(µ|π) =

Ent

(µ/π) = E

log

is convex in µ because f log f is convex. A convex distance dist(µ,π)

Page 8

satisfies the condition

dist(σP

,π) = dist (∑

x∈Ω

σ(x)P

(x,·),π)

≤ ∑

x∈Ω

σ(x)dist(P

(x,·),π)

≤ max

x∈Ω

dist(P

(x,·),π),

(1.1)

and so distance is maximized when the initial distribution is concentrated at a point. To study the

rate of convergence it then suffices to study the rate when the initial distribution is a point mass

(where δ

is 1 at point x ∈ Ω and 0 elsewhere; likewise, let 1

be one only on set A ⊂ Ω).

Definition 1.1. The total variation, relative entropy and L

mixing times are defined as follows.

τ(ϵ)

= min{n : P

(x,·) − π

− 1

≤ ϵ}

(ϵ) = min{n : Ent(k

) = D(P

(x,·) | π) ≤ ϵ}

(ϵ) = min{n : √Var(k

) = k

− 1

≤ ϵ}

Some authors consider the chi-square (χ

) distance, which is just Var(k

), in which case the

corresponding chi-square mixing time is τ

(ϵ) = τ

(√ϵ). In the Appendix it is seen that τ

(ϵ)

usually gives a good bound on L

∞

convergence, and so for most purposes nothing stronger than

mixing need be considered.

An important concept in studying Markov chains is the notion of reversibility. The time-reversal

∗

is defined by the identity π(x)P

∗

(x,y) = π(y)P(y,x), x,y ∈ Ω and is the adjoint of P in the

standard inner product for L

(π), that is 〈f,Pg〉

= 〈P

∗

f,g〉

where

〈f,g〉

= ∑

x∈Ω

π(x)f(x)g(x)

and a matrix M acts on a function as

M f(x) = ∑

y∈Ω

M(x,y)f(y).

If P

∗

= P then P is said to be time-reversible, or to satisfy the detailed balance condition. Given

a Markov chain P two natural reversible chains that can be constructed are the additive reversibi-

lization

P+P

∗

, and multiplicative reversibilization PP

∗

The Dirichlet form is an important bilinear form which can be used in a characterization of

eigenvalues of a reversible chain (see Lemma 1.19), and is also used to define the spectral gap and

the logarithmic Sobolev type inequalities:

Definition 1.2. For f,g : Ω → R, let E(f,g) = E

(f,g) denote the Dirichlet form,

E(f,g) = 〈f,(I − P)g〉

= ∑

x,y

f(x)(g(x) − g(y))P(x,y)π(x).

Page 9

If f = g then

E(f,f) =

2 ∑

x,y∈Ω

(f(x) − f(y))

P(x,y)π(x),

and

(f,f) = E

∗

(f,f) = E

P+P

∗

(f,f),

while if P is reversible then also E(f,g) = E(g,f).

1.2 Continuous Time

Many mixing time results arise naturally in the continuous time setting, so we consider this case

first.

Let L denote the (discrete) Laplacian operator given by L = −(I−P). Then for t ≥ 0, H

= e

represents the continuized chain [1] (or the heat kernel) corresponding to the discrete Markov kernel

P. The continuized chain simply represents a Markov process {X

}

t≥0

in Ω with initial distribution,

(say), and transition matrices

= e

−t(I−P)

∞

∑

n=0

= e

−t

∞

∑

n=0

, t ≥ 0,

with the generator L = −(I−P). Thus H

(x,y) denotes the probability that the rate one continuous

Markov chain having started at x is at y at time t. Let h

(y) = H

(x,y)/π(y), for each y ∈ Ω,

denote its density with respect to π at time t ≥ 0, and h

(y) when the start state or the start

distribution is unimportant or clear from the context. Also, let

∗

= e

∗

∞

∑

n=0

∗

)

be the semigroup associated to the dual L

∗

= −(I − P

∗

). The following is elementary and a useful

technical fact.

Lemma 1.3. For any h

and all t ≥ 0, h

= H

∗t

. Consequently, for any x ∈ Ω,

(x)

= L

∗

(x).

Using Lemma 1.3, the following lemma is easy to establish.

Lemma 1.4.

Var(h

) = −2E(h

)

(1.2)

Ent(h

) = −E(h

,log h

)

(1.3)

Proof. Indeed,

Var(h

) = ∫

dπ = 2∫ h

∗

dπ = 2∫ L(h

dπ = −2E(h

Page 10

Ent(h

) = ∫

log h

dπ = ∫ (log h

+ 1)L

∗

dπ

= ∫ L(log h

dπ = −E(h

,log h

The above motivates the following definitions of the spectral gap λ and the entropy constant

Definition 1.5. Let λ > 0 and ρ

> 0 be the optimal constants in the inequalities:

λVar

f ≤ E(f,f), for all f : Ω → R.

Ent

f ≤ E(f,log f), for all f : Ω → R

(1.4)

Lemma 1.19 (Courant-Fischer theorem) shows that for a reversible Markov chain, the second

largest eigenvalue λ

(of P) satisfies the simple relation 1 − λ

= λ. However, reversibility is not

needed for the following result.

Corollary 1.6. Let π

∗

= min

x∈Ω

π(x). Then,

(ϵ) =

min{t : √Var(h

) = h

− 1

≤ ϵ}

≤

λ (

log

1 − π

∗

+ log

ϵ)

(1.5)

(ϵ) = min{t : Ent(h

) = D(H

(x,·) | π) ≤ ϵ} ≤

(log log 1π

∗

+ log

ϵ)

(1.6)

Proof. Simply solve the differential equations,

Var(h

) = −2E(h

) ≤ −2λVar(h

)

(1.7)

and

Ent(h

) = −E(h

,log h

) ≤ −ρ

Ent(h

(1.8)

and note that Var(h

) ≤

1−π

∗

and Ent(h

) ≤ log

∗

(e.g. by equation (1.1)).

It is worth noting here that the above functional constants λ and ρ

indeed capture the rate of

decay of variance and relative entropy, respectively, of H

for t > 0:

Proposition 1.7. If c > 0 then

(a) Var

f) ≤ e

−ct

Var

f, for all f and t > 0, if and only if λ ≥ c.

(b) Ent

f) ≤ e

−ct

Ent

f, for all f > 0 and t > 0, if and only if ρ

≥ c.

Proof. The “if” part of the proofs follows from (1.7) and (1.8). The only if is also rather elementary

and we bother only with that of part (b): Starting with the hypothesis, we may say, for every f > 0,

and for t > 0,

t (

Ent

f) − Ent

f) ≤

t (

−ct

− 1)Ent

Letting t ↓ 0, we get −E(f,log f) ≤ −cEnt

Page 11

While there have been several techniques (linear-algebraic and functional-analytic) to help

bound the spectral gap, the analogous problem of getting good estimates on ρ

seems challenging.

The following inequality relating the two Dirichlet forms introduced above also motivates the study

of the classical logarithmic Sobolev inequality.

Lemma 1.8. If f ≥ 0 then

2E(√f,√f) ≤ E(f,log f)

Proof. Observe that

a(log a − log b) = 2alog

√a

√

b ≥

2a(1 −

√

√a) = 2

√a(√a

−

√

by the relation log c ≥ 1 − c

−1

. Then

E(f,log f) = ∑

x,y

f(x)(log f(x) − log f(y))P(x,y)π(x)

≥ 2∑

x,y

1/2

(x)(f

1/2

(x) − f

1/2

(y))P(x,y)π(x)

= 2E(√f,√f)

Let ρ(P) > 0 denote the logarithmic Sobolev constant of P defined as follows.

Definition 1.9.

ρ = ρ(P) = inf

Entf

E(f,f)

Entf

Proposition 1.10. For every irreducible chain P,

2ρ ≤ ρ

≤ 2λ.

Proof. The first inequality is immediate, using Lemma 1.8. The second follows from applying (1.4)

to functions f = 1 +ϵg, for g ∈ L

(π) with E

g = 0. Assume ϵ ≪ 1, so that f ≥ 0. Then using the

Taylor approximation, log(1 + ϵg) = ϵg − 1/2(ϵ)

+ o(ϵ

), we may write

Ent

(f) =

π(g

) + o(ϵ

and

E(f,log f) = −ϵE

((Lg)log(1 + ϵg)) = ϵ

E(g,g) + o(ϵ

Thus starting from (1.4), and applying to f as above, we get

E(g,g) ≥

+ o(ϵ

Canceling ϵ

and letting ϵ ↓ 0, yields the second inequality of the proposition, since E

g = 0.

Remark 1.11. The relation 2ρ ≤ 2λ can be strengthened somewhat to ρ ≤ λ/2 by a direct

application of the method used above. Also, under the additional assumption of reversibility, the

inequality in Lemma 1.8 can be strengthened by a factor of 2 to match this, as explained in [21],

in turn improving the above corollary to 4ρ ≤ ρ

for reversible chains.

Page 12

1.3 Discrete Time

In discrete time we consider two approaches to mixing time, both of which are equivalent. The first

approach involves operator norms, and is perhaps the more intuitive of the two methods.

Proposition 1.12. In discrete time,

(ϵ) ≤ ⌈

1 − P

∗

log

ϵ√π

∗

⌉ ,

where

∗

= sup

f:Ω→R,

Ef=0

∗

This result has appeared in mixing time literature in many equivalent forms. A few can be

found in Remark 1.17 at the end of this section.

Proof. Since k

i+1

− 1 = P

∗

− 1) and E(k

− 1) = 0 for all i then

Var(k

) =

− 1

= P

∗n

− 1)

≤ ( P

∗ n

− 1

)

= P

∗ 2n

Var(k

Solving for when variance drops to ϵ

and using the approximations log x ≤ −(1−x) and Var(k

) ≤

1−π

∗

gives the result.

A good example in which this bound has been used in practice can be found in Section 5.2, in

which we discuss a recent proof that Pollard’s Rho algorithm for discrete logarithm requires order

√nlog

n steps to detect a collision, and likely determine the discrete log.

In Proposition 1.12 the mixing bound followed almost immediately from the definition. However,

there is an alternate approach to this problem which bears more of a resemblance to the continuous

time result and is more convenient for showing refined bounds.

The discrete time analog of differentiation is to take the difference Var(P

∗

f) − Var(f). The

analog of equation (1.2) is the following lemma of Mihail [38], as formulated by Fill [25].

Lemma 1.13. Given Markov chain P and function f : Ω → R, then

Var(P

∗

f) − Var(f) = −E

∗

(f,f) ≤ −Var(f)λ

∗

Proof. Since E

f = E

(Kf) for any transition probability matrix K, then

Var(P

∗

f) − Var(f) = 〈P

∗

f,P

∗

f〉

− 〈f,f〉

= −〈f,(I − PP

∗

)f〉

giving the equality. The inequality follows from the definition of λ

∗

Observe that 1 − λ

∗

is the largest non-trivial singular value of P.

We now use this lemma to bound mixing time of a discrete time chain. It will be seen that

mixing in discrete time is related to the eigenvalue gap of the multiplicative reversibilization PP

∗

whereas in the previous section we found that mixing time in continuous time is related to the

eigenvalue gap of the additive reversibilization

P+P

∗

(since λ = λ

= λ

P+P

∗

Page 13

Corollary 1.14. In discrete time a (non-reversible) Markov chain satisfies

(ϵ) ≤ ⌈

∗

log

ϵ√π

∗

⌉ ≤ ⌈ 1αλ log 1

ϵ√π

∗

⌉

where α ∈ [0,1] is such that ∀x ∈ Ω : P(x,x) ≥ α. For a reversible Markov chain

(ϵ) ≤ ⌈

1 − λ

max

log

ϵ√π

∗

⌉ ≤ ⌈

min{2α, λ}

log

ϵ√π

∗

⌉

where λ

max

= max{λ

, |λ

n−1

|} when λ

= 1 − λ is the largest non-trivial eigenvalue of P and

n−1

≥ −1 is the smallest eigenvalue.

Proof. If k

is the initial density then k

= (P

∗

)

, and so by Lemma 1.13, Var(k

) ≤ Var(k

n−1

)(1−

∗

). By induction

Var(k

) ≤ Var(k

) (1 − λ

∗

)

(1.9)

Solving for when variance drops to ϵ

and using the approximation log(1 − λ

∗

) ≤ −λ

∗

gives

the first bound.

For the second bound, ∀x ∈ Ω : P

∗

(x,x) = P(x,x) ≥ α and so

π(x)PP

∗

(x,y) ≥ π(x)P(x,x)P

∗

(x,y) + π(x)P(x,y)P

∗

(y,y)

≥ α π(y)P(y,x) + α π(x)P(x,y).

Then

∗

(f,f) ≥ α E(f,f) + αE(f,f) = 2αE(f,f)

and so λ

∗

≥ 2αλ.

For the reversible case, Lemma 1.18 shows that P has an eigenbasis. If λ

is an eigenvalue of P

with corresponding right eigenvector v

then PP

∗

= P

= λ

, and so the eigenvalues of PP

∗

are just {λ

}. By Lemma 1.19 (i.e. Courant-Fischer) it follows that

∗

= λ

= 1 − max{λ

,λ

n−1

} = 1 − λ

max

Solving equation (1.9) then gives the first reversible bound.

Finally, if P is reversible then

n−1

−α

1−α

is the smallest eigenvalue of the reversible Markov chain

P−α I

1−α

, so Lemma 1.18 shows that

n−1

−α

1−α

≥ −1. Re-arranging the inequality gives the relation

−λ

n−1

≤ 1 − 2α, and so λ

max

= max{λ

,−λ

n−1

} = 1 − min{λ,2α}.

Remark 1.15. As mentioned, at the beginning of this section, the two approaches to bounding

mixing in this section are equivalent.

1 − λ

∗

sup

f:Ω→R

Var(f) − E

∗

(f,f)

Var(f)

= sup

f:Ω→R,

Ef=0

〈f,f〉

− 〈f,(I − PP

∗

)f〉

〈f,f〉

sup

f:Ω→R,

Ef=0

〈P

∗

f,P

∗

f〉

〈f,f〉

= P

∗ 2

The second equality is because the numerator and denominator are invariant under addition of a

constant to f, so it may be assumed that Ef = 0.

Page 14

Our concluding remark will require knowledge of the L

→ L

operator norm:

Definition 1.16. Suppose T : R

Ω

→ R

Ω

is an operator taking functions f : Ω → R to other such

functions. Then, given p,q ∈ [1,∞], let T

p→q

be the optimal constant in the inequality

≤ T

p→q

, for all f : Ω → R.

Remark 1.17. It has already been seen that P

∗

= 1 − λ

∗

. We now consider a few other

equivalent forms which have appeared in mixing bounds equivalent to Proposition 1.12.

First, consider the operator norm. Let E denote the expectation operator, that is, E is a square

matrix with rows all equal to π. Then PE = EP = E

and so (P − E)f = (P − E)(f − Ef). Since

also f

≥ f −Ef

then the supremum in the definition of P

∗

−E

2→2

will occur with Ef = 0,

in which case also (P

∗

− E)f = Pf. In short P

∗

− E

2→2

= P

∗

It may seem more intuitive to work with P instead of P

∗

. In fact, both cases are the same.

∗

− E

2→2

sup

∗

− E)f

= sup

sup

|〈(P

∗

− E)f,g〉

sup

|〈f,(P − E)g〉

| = sup

sup

|〈(P − E)g,f〉

P − E

2→2

The second equality is a form of L

duality, because f

= sup

|〈f,g〉

| when 1/p+1/q = 1,

or equivalently f

= sup

∣

∫ fg dπ∣

∣

. This is really just an extension of the dot product

property that if g = 1 then f · g = f g cos θ is maximized by g = f/ f .

Some authors have worked with complex valued functions. Note that if f : Ω → C and T is a

real valued square matrix then

= T(Ref)

+ T(Imf)

≤ T

2→2

( Ref

+ Imf

) = T

2→2

that is, the worst choice of f is a real function.

In summary,

∗

= P

∗

− E

2→2

= P − E

2→2

= sup

f:Ω→C

(P − E)f

= sup

f:Ω→C,

Ef=0

1.4 Does Reversibility Matter?

Many mixing results were originally shown only in the context of a reversible Markov chain. In this

survey we are able to avoid this requirement in most cases. However, there are still circumstances

under which reversible and non-reversible chains behave differently, and not just as an artifact of

the analysis. In this section we discuss these differences, and also prove a few classical lemmas

about reversible chains which were used in the discrete time results given above, and which explain

why the reversibility assumption is helpful.

The difference between reversible and non-reversible results is most apparent when upper

and lower bounds on distances are given. Combining the above work, recalling that λ

max

max{λ

,|λ

n−1

|}, and the lower bounds of Theorem 4.5, we have

if P reversible

|λ

max

≤ d(n) ≤

|λ

max

√

1−π

∗

if non-reversible :

max

i>0

|λ

≤ d(n) ≤

(

√1

− λ

∗

)

√

1−π

∗

Page 15

where

d(n) = max

(x,·) − π(·)

denotes the worst variation distance after n steps.

In particular, this shows that in the reversible case the variation distance is determined, up to

a multiplicative factor, by the size of the largest magnitude eigenvalue. The rapid mixing property

is then entirely characterized by whether 1 − |λ

max

| is polynomially large or not.

In contrast, Example 5.2 gives a convergent non-reversible chain with complex eigenvalues such

that max

i>0

|λ

| = 1/

√2 but λ

∗

= 0. The non-reversible lower bound given above then converges

to zero with n, as it should, while the upper bound is constant and useless.

Another case in which reversibility will play a key role is comparison of mixing times. This

is a method by which the mixing time of a Markov chain P can be bounded by instead studying

mixing time of a similar, but easier to analyze chainP. For many Markov chains this is the only

way known to bound mixing time. In Theorem 5.3 we find good comparison is possible if P andP

are reversible, while if P is non-reversible then there is a slight worsening, but ifP is non-reversible

then the comparison is much worse. Example 5.4 shows that even for walks as simple as those on

a cycle Z/nZ each of these three cases are necessary, and not just artifacts of the method of proof.

The main reason why a reversible Markov chain is better behaved is that it has a complete real

valued spectral decomposition, and because the spectral gap λ is exactly related to eigenvalues of

the reversible chain. For the sake of completeness, we now show these classical properties.

Lemma 1.18. If P is reversible and irreducible on state space of size |Ω| = n, then it has a complete

spectrum of real eigenvalues with magnitudes at most one, that is

1 = λ

≥ λ

≥ ··· ≥ λ

n−1

≥ −1.

A non-reversible chain may have complex valued eigenvalues, as in Example 5.2.

Proof. Let(√π) = diag(√π(1),√π(2),... , √π(n)) denote the diagonal matrix with entries drawn

from π. The matrix M = (√π)P(√π)

−1

is a symmetric matrix because

M(x,y) = √

π(x)

π(y)

P(x,y) =

π(x)P(x,y)

√π(x)π(y)

π(y)P(y,x)

√π(x)π(y)

= M(y,x).

Since P is similar to the real valued symmetric matrix M it follows from the spectral theorem that

P has a complete spectrum of real eigenvalues and real eigenvectors.

Left eigenvector v and right eigenvector w are orthogonal if their eigenvalues λ

= λ

, as

v w = (vP)w = v(Pw) = λ

v w .

(1.10)

In particular, eigenvalue 1 has right eigenvector 1, and so if eigenvalue λ

= 1 has left eigenvector

then ∑

(x) = v1 = 0. Then v

has both positive and negative entries, and for ϵ sufficiently

small σ = π + ϵv

is a probability distribution. However, the n step distribution is given by

σP

= (π + ϵv

= π + ϵλ

and since v

has a negative entry then if |λ

| > 1 then σP

will have a negative entry for sufficiently

large n, contradicting the fact that σP

is a probability distribution.

Page 16

The Courant-Fischer theorem shows the connection between eigenvalues and Dirichlet forms for

a reversible Markov chain.

Lemma 1.19. In a reversible Markov chain the eigenvalues satisfy

1 − λ

inf

f:Ω→R,

f=constant

E(f,f)

Var(f)

1 + λ

n−1

inf

f:Ω→R,

f=constant

F(f,f)

Var(f)

where

F(f,f) = 〈f,(I + P)f〉

2 ∑

x,y∈Ω

(f(x) + f(y))

P(x,y)π(x).

In particular, 1 − λ

= λ.

In Section 5.4 we will find that in the non-reversible case this becomes an inequality, with

1 − Reλ

≥ λ.

Proof. The numerator and denominator in the infinum are invariant under adding a constant to f,

so it may be assumed that Ef = 0, that is, 〈f,1〉

= 0.

Let {v

} be a set of right eigenvectors of P forming an orthonormal eigenbasis for R

Ω

, with

= 1. Given f : Ω → R then f = ∑c

with c

= 〈f,v

〉

, and so

E(f,f) = 〈f,(I − P)f〉

= ∑

i,j∈Ω

〈v

,(I − P)v

〉

= ∑

(1 − λ

) ≥ ∑

(1 − λ

)

with an equality when f = v

. Also,

Var(f) = 〈f,f〉

− 〈f,1〉

= 〈∑

, ∑

〉

− 〈∑

,1〉

= ∑ c

and so

E(f,f)

Var(f) ≥

1 − λ

with an equality when f = v

. The result then follows.

The same argument, but with I + P instead of I − P gives the result for λ

n−1

Page 17

Chapter 2

Advanced Functional Techniques

The relation between functional constants and mixing time bounds was studied in Chapter 1. In

this section it is shown that information on functions of large variance, or on functions with small

support, can be exploited to show better mixing time bounds.

The basic argument is this. Recall that

Var(h

) = −2E(h

). If E(f,f) ≥ G(Var(f)) for

some G : R

→ R

and for all f : Ω → R

with Ef = 1, then it follows that

Var(h

) =

−2E(h

) ≤ −2G(Var(h

)). Setting I(t) = Var(h

) then this implies

(ϵ) = ∫

(ϵ)

1dt ≤ ∫

Var(h

)

−2G(I)

(2.1)

The argument carries over to discrete time if G is non-decreasing and α ∈ [0,1] is such that

∀x ∈ Ω : P(x,x) ≥ α. To see this, observe that by Lemma 1.13

Var(k

n+1

) − Var(k

) = −E

∗

) ≤ −2αE(k

) ≤ −2αG(Var(k

)).

Since both I(n) = Var(k

) and G(x) are non-decreasing, the piecewise linear extension of I(n) to

t ∈ R

will satisfy

dt ≤ −

2αG(I).

At integer t, the derivative can be taken from either right or left. It follows that

(ϵ) = ∫

(ϵ)

1dt ≤ ⌈∫

Var(h

)

−2αG(I)

⌉ .

(2.2)

If instead E

∗

(f,f) ≥ G(Var(f)) then simply drop the 2α from this bound.

This idea will shortly be exploited to obtain fairly simply proofs relating the log-Sobolev constant

and Nash inequalities to continuous and discrete time L

mixing, and to generalize spectral gap

bounds on mixing.

2.1 Log-Sobolev and Nash Inequalities

Some of the best bounds on L

mixing times were shown by use of the log-Sobolev constant, a

method developed in the finite Markov chain setting by Diaconis and Saloff-Coste [21]. Equation

(2.1) will show a bound in terms of the log-Sobolev constant if we can show a relation between the

Page 18

Dirichlet form E(h

) and a function of the variance Var(h

). The following lemma establishes

this connection.

Lemma 2.1. If f is non-negative then

Ent(f

) ≥ Ef

log

(Ef)

and in particular, if Ef = 1 then

E(f,f) ≥ ρEnt(f

) ≥ ρ(1 + Var(f))log(1 + Var(f)).

Proof. By definition

Ent(f

) = Ef

log

= 2Ef

log

fEf

+ Ef

log

(Ef)

Now, apply the approximation log x ≥ 1 − 1/x.

log

fEf

≥ Ef

(1 − Ef

fEf )

= Ef

− Ef

= 0

Those familiar with cross-entropy might prefer to rewrite this proof as follows. Noting that the

cross-entropy H(f,g) = Ef log

≥ 0 for densities f, g, the proof is equivalent to the statement

Ent(f

) = 2Ef

H (

Ef )

+ Ef

log

(Ef)

≥ Ef

log

(Ef)

Diaconis and Saloff-Coste also showed that the Nash-inequality can be used to study L

mixing

[22]. The Dirichlet form can also be lower bounded in terms of variance by using a Nash inequality.

Lemma 2.2. Given a Nash Inequality

2+1/D

≤ C [E(f,f) +

] f

1/D

which holds for every function f : Ω → R and some constants C, D, T ∈ R

, then whenever f ≥ 0

and Ef = 1 then

E(f,f) ≥ (1 + Var(f)) (

(1 + Var(f))

1/D

−

T )

Proof. The Nash inequality can be rewritten as

E(f,f) ≥ f

( 1C

( f

)

1/D

−

T )

However, f

= E|f| = 1, and Var(f) = f

− 1, giving the result.

Page 19

Corollary 2.3. Given the spectral gap λ and the log-Sobolev constant ρ and/or a Nash inequality

with DC ≥ T and D ≥ 2, and given ϵ ≤ 2, then the continuous time Markov chain satisfies

(ϵ) ≤

2ρ

log log

∗

λ (

+ log

ϵ)

(ϵ) ≤ T +

λ (

log

+ log

ϵ)

(ϵ) ≤ T +

2ρ

log log (

T )

λ (

+ log

ϵ)

Upper bounds for the discrete time Markov chain are a factor of 2 larger when Nash, log-Sobolev

and spectral gap are computed in terms of the chain PP

∗

, while when computed for P then they are

a factor α

−1

larger for α ∈ [0,1] satisfying ∀x ∈ Ω : P(x,x) ≥ α.

Proof. Apply equation (2.1) with the log-Sobolev bound of Lemma 2.1 when Var(h

) ≥ 4, and the

spectral gap bound E(f,f) ≥ λVar(f) when Var(h

) < 4, to obtain

(ϵ) ≤ ∫

Var(h

)

−2ρ(1 + I)log(1 + I)

+ ∫

−2λI

−2ρ

(log log(1 + 4) − log log(1 + Var(h

))) +

−2λ

log

Simplify this by Var(h

) ≤

1−π

∗

, and apply ρ ≤ λ/2 to the log log(5) term.

For the second mixing bound use the Nash bound of Lemma 2.2 when Var(h

) ≥ (DC/T)

−1,

and the spectral bound when Var(h

) < (DC/T)

− 1. The Nash portion of the integral is then

∫

(DC/T)

−1

Var(h

)

−2(1 + I) (

(1+I)

1/D

−

)

= −

log (1 −

C/T

(1 + I)

1/D

)∣

∣

(DC/T)

−1

Var(h

)

≤ −

log(1 − 1/D) ≤

2(D − 1)

≤ T

The second inequality is because log(1 − 1/x) ≥ −

x−1

For the third bound use the Nash bound when Var(h

) ≥ (DC/T)

−1, the log-Sobolev bound

for (DC/T)

− 1 > Var(h

) ≥ 4 and the spectral bound when Var(h

) < 4.

For the discrete time case proceed similarly, but with equation (2.2) instead of equation (2.1).

The continuous time log-Sobolev bound is comparable to a result of Diaconis and Saloff-Coste

[21], while the discrete time log-Sobolev bound is comparable to a bound of Miclo [37].

For a reversible, continuous time chain, hypercontractivity ideas can be used to improve the

log-Sobolev portion of these bounds by a factor of two. A tedious differentiation and a few ap-

proximations (see around equation (3.2) of [21]) show that for any t

∈ R, a reversible chain will

satisfy

1+e

4ρ(t−t

)

≤ 0,

and so the norm is decreasing in t. If t

= k +

4ρ

log [log(1 + Var(h

)) − 1] for some k ∈ R

then

≤ h

k 1+e

−4ρt

≤ h

1−2/ log(1+Var(h

))

2/ log(1+Var(h

))

= e.

Page 20

The second inequality was the relation f

∗

≤ f

1−2/q

2/q

when q ≥ 2 and q

∗

∈ [1,2] is the

conjugate exponent (i.e. 1/q + 1/q

∗

= 1), see Chapter 8, Lemma 41 of [1]. The equality is because

k 1

= 1 and h

= 1 + Var(h

This shows that between steps k and t

the variance drops from Var(h

) to Var(h

) = h

−

1 ≤ e

− 1. When combined with our earlier Nash and spectral work we obtain

(ϵ) ≤ ∫

(DC/T)

−1

Var(h

)

−2(1 + I) (

(1+I)

1/D

−

)

4ρ

log [log(DC/T)

− 1] +

∫

−1

−2λI

< T +

4ρ

log log (

T )

λ (

1 + log

ϵ)

The factor of two loss in the non-reversible case seems to be unavoidable, and is likely due to the

fact that E

(f,f) = E

∗

(f,f) = E

P+P

∗

(f,f) and so ρ and λ cannot capture the difference between

the non-reversible chain P and it’s additive reversibilization

P+P

∗

. For instance, in Corollary 1.14

and Remark 1.11, reversibility also affected the mixing time by a factor of two.

2.2 Spectral profile

Faber-Krahn inequalities were developed by Grigor’yan, Coulhon and Pittet [15] (see also [27] and

[16]) to study the rate of decay of the heat kernel, and in the finite Markov setting by Goel,

Montenegro and Tetali [26]. The results of Chapter 1 will be improved by proving a better lower

bound on E(f,f) and then solving a subsequent differential equation.

Definition 2.4. For a non-empty subset S ⊂ Ω the first Dirichlet eigenvalue on S is given by

(S) = inf

f∈c

(S)

E(f,f)

Var(f)

where c

(S) = {f ≥ 0 : supp(f) ⊂ S} is the set of non-negative functions supported on S. The

spectral profile Λ : [π

∗

,∞) → R is given by Λ(r) = inf

∗

≤π(S)≤r

(S).

To utilize the spectral profile Λ(r) in studying mixing time we require a lemma relating the

Dirichlet form E(f,f) to Λ(r), improving on the basic bound E(f,f) ≥ λVar(f) used earlier.

Lemma 2.5. For every non-constant function f : Ω → R

E(f,f) ≥

Λ(

4(Ef)

Var f )

Var(f).

Proof. Given a ∈ R use the notation a

= max{a,0} to denote the positive part. For c constant,

E(f,f) = E(f −c,f −c). Also, E(f −c,f −c) ≥ E((f −c)

,(f −c)

) because ∀a,b ∈ R : (a−b)

≥

− b

)

. It follows that when 0 ≤ c < maxf then

E(f,f) ≥ E((f − c)

,(f − c)

)

≥ Var((f − c)

)

inf

u∈c

(f>c)

E(u,u)

Var(u)

≥ Var((f − c)

)Λ(π(f > c)).

Page 21

The inequalities ∀a,b ≥ 0 : (a − b)

≥ a

− 2ba and (a − b)

≤ a show that

Var((f − c)

) = E(f − c)

− (E(f − c)

)

≥ Ef

− 2cEf − (Ef)

Let c = Var(f)/4Ef and apply Markov’s inequality π(f > c) < (Ef)/c,

E(f,f) ≥ (Var(f) − 2cEf)Λ(Ef/c) =

Var(f)Λ (

4(Ef)

Var f )

A mixing time theorem then follows easily.

Theorem 2.6. In continuous time

(ϵ) ≤ ∫

1/2

4 π

∗

r Λ(r)

log

2√2

In discrete time

(ϵ) ≤ ⌈∫

1/2

4 π

∗

2dr

r Λ

∗

(r)

∗

log

2√2

ϵ ⌉

≤ ⌈∫

1/2

4 π

∗

α r Λ(r)

αλ

log

2√2

ϵ ⌉

where α ∈ [0,1] is such that ∀x ∈ Ω : P(x,x) ≥ α.

Proof. Apply equation (2.1) with the spectral profile bound of Lemma 2.5 to obtain

(ϵ) ≤ ∫

Var(h

)

−I Λ(4/I)

+ ∫

−2λI

A change of variables to r = 4/I(t) implies that

(ϵ) ≤ ∫

1/2

4/Var(h

)

r Λ(r)

2λ

log

and the final simplification Var(h

) ≤

∗

− 1 < 1/π

∗

leads to the theorem.

For the discrete time case use the remark after equation (2.2) instead. The second inequality

follows from Λ

∗

(r) ≥ 2αΛ(r), an immediate consequence of the relation E

∗

(f,f) ≥ 2αE(f,f)

shown in the proof of Corollary 1.14.

Observe that trivially Λ(r) ≥ λ, and so Theorem 2.6 leads to the bound τ

(ϵ) ≤

log(1/2

√

2 π

∗

ϵ),

about a factor of two worse than the conventional spectral bound of equation (1.5). The same fac-

tor of two loss occurs in the discrete case. However, both theorems can be significantly better if

Λ(r) ≫ λ for small values of r. As an example of this, in [26] the authors show that given the

log-Sobolev constant and a Nash inequality, then

Λ(r) ≥ ρ

log(1/r)

1 − r

and Λ(r) ≥

C r

1/2D

−

Page 22

By applying the Nash bound on Λ(r) for r ≤ (T/2DC)

and the log-Sobolev bound when

(T/2DC)

≤ r ≤ 1/2, then integration in Theorem 2.6 establishes the bound

(ϵ) ≤ 2T +

log log (

2DC

T )

log

2√2

(2.3)

This is only a factor two weaker than that found with our more direct approach earlier. A further

example of spectral profile will be given in our study of the Thorp shuffle later.

2.3 Comparison methods

It sometimes happens that a Markov chain is difficult to study, but a related chain is more man-

ageable. In this situation the comparison method has been widely used to bound spectral gap,

log-Sobolev constant and Nash inequalities (see [49, 20, 21, 22]). The argument applies to the

bounds in this chapter as well.

Theorem 2.7. Consider two Markov chains P andP on the same state space Ω, and for every

x,y ∈ Ω withP(x,y) > 0 define a directed path γ

from x to y along edges in P. Let Γ denote the

set of all such paths. Then

(f,f) ≥

A E

(f,f), Var

(f) ≤ M Var

(f), Ent

) ≤ M Ent

where M = max

π(x)

and

A = A(Γ) = max

a,b∈Ω,

P(a,b)=0

π(a)P(a,b)

∑

x,y∈Ω,

(a,b)∈γ

π(x)P(x,y)|γ

Proof. First, consider the Dirichlet forms:

(f,f) =

2 ∑

x,y

(f(x) − f(y))

π(x)P(x,y)

2 ∑

x,y

(

∑

(a,b)∈γ

(f(a) − f(b))

)

π(x)P(x,y)

≤

2 ∑

x,y

∑

(a,b)∈γ

(f(a) − f(b))

|γ

|π(x)P(x,y)

2 ∑

a,b

(f(a) − f(b))

π(a)P(a,b)

π(a)P(a,b) ∑

x,y

(a,b)∈γ

π(x)P(x,y)|γ

≤ E

(f,f)A.

For variance we have

Var

(f) = inf

c∈R

(f(x) − c)

≤ inf

c∈R

M E

(f(x) − c)

= M Var

(f).

Page 23

For entropy, observe that

Ent

) = ∑

x∈Ω

π(x) (f

(x)log

(x)

− f

(x) + E

)

= inf

c>0

∑

x∈Ω

π(x) (f

(x)log

(x)

− f

(x) + c)

≤ inf

c>0

M ∑

x∈Ω

ˆπ(x) (f

(x)log

(x)

− f

(x) + c)

= M Ent

The second equality follows from differentiating with respect to c to see that the minimum occurs

at c = E

, while the inequality required the fact that a log

− a + b ≥ a(1 −

) − a + b = 0 and

so f

log

− f

+ c ≥ 0.

An easy consequence of this is that spectral gap, log-Sobolev and spectral profile bounds can

be compared.

Corollary 2.8.

≥

M A

, ρ

≥

M A

, Λ

(r) ≥

M A

(r).

The log-Sobolev and spectral profile mixing time bounds of P are thus at worst a factor MA

times larger than those ofP.

If the distribution π =π then a Nash inequality forP, along with the relation E

(f,f) ≥

(f,f), immediately yields a Nash inequality for P. It is not immediately clear how to compare

Nash inequality bounds if π =π. However, one can compare the spectral profile bounds used to

show equation (2.3), and so the mixing time of P is at most M A times the bound equation (2.3)

gives forP. Alternatively, one can compare E

(f,f) to E

(f,f) and Var

(f) to Var

(f) in the

original proofs of the mixing times.

In the case of a reversible chain Diaconis and Saloff-Coste [20] observe that it is also possible

to compare λ

n−1

if the paths are of odd length.

Theorem 2.9. Consider two Markov chains P andP on the same state space Ω, and for every

x ∈ Ω define a directed circuit γ

from x to itself along edges in P. Let Γ

∗

denote the set of all

such paths. Then

(f,f) ≥

∗

(f,f),

where M = max

π(x)

and

∗

= A

∗

(Γ

∗

) = max

a,b∈Ω,

P(a,b)=0

π(a)P(a,b)

∑

x,y∈Ω,

(a,b)∈γ

π(x)P(x,y)|γ

(a,b),

where r

(a,b) is the number of times the edge (a,b) appears in path γ

Page 24

Proof. The proof is identical to that for comparison of E(f,f), except that if the path γ

is given

by x = x

... ,x

= y for m odd then f(x) + f(y) is rewritten as

f(x) + f(y) = (f(x) + f(x

)) − (f(x

) + f(x

)) + ···

−(f(x

m−2

) + f(x

m−1

)) + (f(x

m−1

) + f(y)).

In particular,

1 − λ

max

(P) ≥

∗

(1 − λ

max

(P)).

Page 25

Chapter 3

Evolving Set Methods

Many mixing time results first estimate set expansion and then relate it to mixing time bounds.

An early breakthrough in the study of mixing times was the conductance bound

λ ≥ Φ

where

Φ = min

A⊂Ω

Q(A,A

)

min{π(A),π(A

)}

(see [30, 34]). Essentially the same proof can be used (see [26]) to show a conductance profile

bound, that

Λ(r) ≥ Φ

(r)/2

where

Φ(r) = min

A⊂Ω,

π(A)≤r

Q(A,A

)

min{π(A),π(A

)}

Given α ∈ [0,1] such that ∀x ∈ Ω : P(x,x) ≥ α this can be boosted to

Λ(r) = (1 − α)Λ

P−αI

1−α

(r) ≥

1 − α

P−αI

1−α

(r) =

1 − α

( Φ(r)

1 − α)

(r)

2(1 − α)

(3.1)

and so by Corollary 1.14 and Theorem 2.6 a discrete time chain mixes in time

(ϵ) ≤ ⌈

1−α

log

ϵ√π

∗

⌉

and

(ϵ) ≤ ⌈∫

4/ϵ

4π

∗

2dr

1−α

rΦ

(r)⌉

(3.2)

In the common setting of a reversible, lazy (i.e. α ≥ 1/2) chain Corollary 1.14 also implies the

slightly stronger bound

(ϵ) ≤ ⌈

log

ϵ√π

∗

⌉ .

(3.3)

In this section we develop a more direct method of proof. This can give stronger set bounds,

bounds for distances other than L

-distance, and also leads to an extension of conductance which

applies even with no holding probability. All work will be done in discrete time, but carries over

easily to continuous time, as discussed at the end of the chapter. The results and their proofs are

based on the work of Morris and Peres [45] and Montenegro [42].

3.1 Bounding Distances by Evolving Sets

Results in this section are found by working with a dual process. Given a Markov chain on Ω with

transition matrix P, a dual process consists of a walk P

on some state space V and a link, or

Page 26

transition matrix, Λ from V to Ω such that

PΛ = ΛP

In particular, P

Λ = ΛP

and so the evolution of P

and P

will be closely related. This relation

is given visually by Figure 3.1.

Walk on

Ω

Figure 3.1: The dual walk P

projects onto the original chain P.

Diaconis and Fill [19] studied the use of dual Markov chains in bounding separation distance.

Independently, Morris and Peres [45] proposed the same walk on sets and used it to bound L

distance. Montenegro [42] sharpened this technique and extended it to other distances.

In order to relate a property of sets (conductance) to a property of the original walk (mixing

time) we construct a walk on sets that is a dual to the original Markov chain. A natural candidate

to link a walk on sets to a walk on states is the projection Λ(S,y) =

π(y)

π(S)

(y). Diaconis and

Fill [19] have shown that for certain classes of Markov chains that the walkK below is the unique

dual process with link Λ, so this is the walk on sets that should be considered. We use notation of

Morris and Peres [45].

Definition 3.1. Given set A ⊂ Ω a step of the evolving set process is given by choosing u ∈ [0,1]

uniformly at random, and transitioning to the set

= {y ∈ Ω : Q(A,y) ≥ uπ(y)} = {y ∈ Ω : P

∗

(y,A) ≥ u}

The walk is denoted by S

, S

, ..., S

, with transition kernel K

(A,S) = Prob(S

= S|S

= A).

The Doob transform of this process is the Markov chain on sets given byK(S,S ) =

π(S )

π(S)

K(S,S ),

with n-step transition probabilitiesK

(S,S ) =

π(S )

π(S)

(S,S ).

Heuristically, a step of the evolving set process consists of choosing a uniform value of u, and

then A

is the set of vertices y that get at least a u-fraction of their size π(y) from the set A.

The Doob transform produces another Markov chain because of a Martingale property.

Lemma 3.2. If A ⊂ Ω then

∑

A ⊂Ω

π(A )K(A,A ) = ∫

π(A

)du = π(A)

Proof.

∫

π(A

)du = ∑

y∈Ω

π(y)Prob(y ∈ A

) = ∑

y∈Ω

π(y)

Q(A,y)

π(y)

= π(A)

Page 27

The walkK is a dual process of P.

Lemma 3.3. If S ⊂ Ω, y ∈ Ω and Λ(S,y) =

π(y)

π(S)

(y) is the projection linkage, then

PΛ(S,y) = Λ

(S,y).

Proof.

PΛ(S,y) = ∑

z∈S

π(z)

π(S)

P(z,y) =

Q(S,y)

π(S)

(S,y) = ∑

S y

K(S,S )

π(y)

π(S )

π(y)

π(S) ∑

S y

K(S,S ) =

Q(S,y)

π(S)

The final equality is because ∑

S y

K(S,S ) = Prob(y ∈ S ) = Q(S,y)/π(y).

With duality it becomes easy to write the n step transitions in terms of the walkK.

Lemma 3.4. LetE

denote expectation underK

. If x ∈ Ω and S

= {x} then

(x,y) =E

(y),

where π

(y) =

(y)π(y)

π(S)

denotes the probability distribution induced on set S by π.

Proof.

(x,y) = (P

Λ)({x},y) = (Λ

)({x},y) =E

(y)

The final equality is because Λ(S,y) = π

(y).

Recall from equation (1.1) that if a distance dist(µ,π) is convex in µ then the worst initial

distribution is a point mass. Given the preceding lemma it is easy to show a distance bound for all

such convex distances.

Theorem 3.5. Consider a finite Markov chain with stationary distribution π. Any distance

dist(µ,π) which is convex in µ satisfies

dist(P

(x,·),π) ≤

dist(π

,π)

whenever x ∈ Ω and S

= {x}.

Proof. By Lemma 3.4 and convexity,

dist(P

(x,·),π) = dist(E

,π) ≤

dist(π

,π).

In particular, if dist(µ,π) = L

(

) for a convex functional L

: (R

)

Ω

→ R then the distance

is convex and the conditions of the theorem are satisfied. The total variation distance satisfies this

condition with L

(f) =

f −1

1,π

, relative entropy with L

(f) = E

f log f, and L

distance with

(f) = f − 1

2,π

, and so the following bounds are immediate:

Page 28

Theorem 3.6. If x ∈ Ω and S

= {x} then in discrete time

(x,·) − π

≤

(1 − π(S

)),

D(P

(x,·) π)

≤

log

π(S

)

(x,·) − π

≤

√1 − π(S

)

π(S

)

Related arguments apply to other distances, such as Hellinger or Wasserstein distances. See

Montenegro [42] for details.

3.2 Mixing Times

Mixing time bounds can be shown through a procedure similar to that of spectral profile bounds.

Throughout this section assume that the distance to be studied is of the form

dist(P

(x,·),π) ≤

f(π(S

))

for a decreasing function f : [0,1] → R

. For instance, the distance bounds in Theorem 3.6 are all

of this form. Let τ(ϵ) = min{n :E

f(π(S

)) ≤ ϵ}, so that the mixing time in terms of the distance

is upper bounded by τ(ϵ).

The analog of spectral profile Λ

∗

(r) will be the f-congestion:

Definition 3.7. Given a function f : [0,1] → R

the f-congestion profile is

(r) = max

A⊂Ω,

π(A)≤r

(A)

where

(A) = ∫

f(π(A

))

f(π(A))

du.

The f-congestion is C

= max

A⊂Ω

(A).

Note that if f(z) = f(1 − z) then u-almost everywhere A

= (A

)

1−u

and a simple calculation

shows that C

(A) = C

). Therefore, when r ≥ 1/2 and f(z) = f(1 − z) then let C

(r) = C

(1/2).

The analog of Lemma 1.13 will be the following:

Lemma 3.8.E

n+1

f(π(S

n+1

)) −

f(π(S

)) = −

f(π(S

))(1 − C

zf(z)

))

≤ −(1 − C

zf(z)

f(π(S

))

Proof. The inequality is because 1 − C

zf(z)

≤ 1 − C

zf(z)

(S) for all S ⊂ Ω. For the equality,

n+1

f(π(S

n+1

)) =E

∑

K(S

,S)f(π(S))

f(π(S

))∑

K(S

,S)π(S)f(π(S))

π(S

)f(π(S

))

f(π(S

))C

zf(z)

)

Page 29

The analog of Corollary 1.14 is the following:

Corollary 3.9. In discrete time

dist(P

(x,·),π) ≤ C

zf(z)

f(π(x)) and τ(ϵ) ≤ ⌈

1 − C

zf(z)

log

f(π

∗

)

ϵ ⌉

Proof. By Lemma 3.8,E

n+1

f(π(S

n+1

)) ≤ C

zf(z)

f(π(S

)), and by inductionE

f(π(S

)) ≤

zf(z)

f(π(S

)). Solving for when this drops to ϵ and using the approximation log C

zf(z)

≤ −(1 −

zf(z)

), gives the corollary.

This can be generalized to take into consideration set sizes. Theorem 2.6 will have two analogs,

a stronger bound under a fairly weak convexity condition, with about a factor of two lost in the

general case.

Theorem 3.10. In discrete time, if f is differentiable then

τ(ϵ) ≤ ⌈∫

−1

(ϵ)

∗

−f (x)dx

f(x)(1 − C

zf(z)

(x))⌉

if x(1 − C

zf(z)

−1

(x))) is convex, while in general

τ(ϵ) ≤ ⌈∫

−1

(ϵ/2)

−1

(f(π

∗

)/2)

−2f (x)dx

f(x)(1 − C

zf(z)

(x))⌉

Proof. First consider the convex case.

By Lemma 3.8 and Jensen’s inequality for the convex function x (1 − C

zf(z)

−1

(x))),

n+1

f(π(S

n+1

)) −

f(π(S

)) = −

f(π(S

))(1 − C

zf(z)

))

≤ −

f(π(S

)) [1 − C

zf(z)

−1

◦ f(π(S

)))]

≤ −[E

f(π(S

))] [1 − C

zf(z)

−1

f(π(S

))))] .

(3.4)

Since I(n) =E

f(π(S

)) and 1 − C

zf(z)

−1

(x)) are non-increasing, the piecewise linear extension

of I(n) to t ∈ R

satisfies

I (t) ≤ −I(t) [1 − C

zf(z)

−1

(I(t)))]

At integer t the derivative can be taken from either right or left. This can be solved as in the proof

of Theorem 2.6.

For the general case, use Lemma 3.11 instead of convexity at (3.4).

Lemma 3.11. If Z ≥ 0 is a nonnegative random variable and g is a nonnegative increasing

function, then

E (Z g(Z)) ≥

g(EZ/2).

Proof. (from [45]) Let A be the event {Z ≥ EZ/2}. Then E(Z 1

) ≤ EZ/2, so E(Z1

) ≥ EZ/2.

Therefore,

E (Z g(2Z)) ≥ E (Z1

g(EZ)) ≥

g(EZ).

Let U = 2Z to get the result.

Page 30

It is fairly easy to translate these to mixing time bounds. For instance, by Theorem 3.6 it is

appropriate to let f(z) = √

1−z

for L

bounds. Then the bounds from Corollary 3.9 and Theorem

3.10 imply:

(ϵ) ≤

⌈

1 − C

√

z(1−z)

log

ϵ√π

∗

⌉

⌈

∫

1+ϵ

∗

2x(1 − x)(1 − C

√

z(1−z)

(x))⌉

⌈

∫

1+ϵ

4π∗

1+3π∗

x(1 − x)(1 − C

√

z(1−z)

(x))⌉

≤

⌈

1 − C

√

z(1−z)

log

ϵ√π

∗

⌉

⌈

∫

1/ϵ

∗

2u(1 − C

√

z(1−z)

(u))⌉

⌈

∫

4/ϵ