This is the text version of the file http://mathnet.kaist.ac.kr/papers/georgiain/Prasad/survey.ps.
G o o g l e automatically generates text versions of documents as we crawl the web.


Google is neither affiliated with the authors of this page nor responsible for its content.

Page 1
Mathematical Aspects of Mixing Times in Markov Chains
R. Montenegro and P. Tetali
February 3, 2006

Page 2
Abstract
In the past few years we have seen a surge in the theory of finite Markov chains, by way of
new techniques to bounding the convergence to stationarity. This includes functional techniques
such as logarithmic Sobolev and Nash inequalities, refined spectral and entropy techniques, and
the evolving set methodology. We attempt to give a more or less self-contained treatment of some
of these modern techniques. There have been other important contributions to this theory such
as variants on coupling techniques and decomposition methods, which are not included here; our
choice was to keep the analytical methods as the theme of this presentation. We illustrate the
strength of the main techniques by way of simple examples as well as with a brief and improved
analysis of the Thorp shuffle.

Page 3
Contents
Introduction
3
1 Basic Bounds on Mixing Times
5
1.1 Preliminaries: Distances and mixing times . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2 Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3 Discrete Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Does Reversibility Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Advanced Functional Techniques
15
2.1 Log-Sobolev and Nash Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Spectral profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Comparison methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Evolving Set Methods
23
3.1 Bounding Distances by Evolving Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Mixing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Modified Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Lower Bounds on Mixing Times
35
4.1 A geometric lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 A spectral lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 A log-Sobolev lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Examples
42
5.1 Sharpness of bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Discrete Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Comparing Mixing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 New Cheeger Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 The Thorp Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5.1 Modeling the Thorp Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5.2 Spectral Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5.3 Evolving Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5.4 Bounding the l
2
norm of functions . . . . . . . . . . . . . . . . . . . . . . . . 52
1

Page 4
6 Miscellaneous
54
6.1 The Fastest Mixing Markov Process Problem . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Alternative description of the spectral gap, and the entropy constant . . . . . . . . . 57
6.3 Perturbation Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Bibliography
60
Appendix
64
2

Page 5
Introduction
While the basic theory of finite Markov chains has been well studied and understood, the advent of
theory of computing has renewed interest in this classical subject. There have already been some
excellent surveys focusing on combinatorial, computational and statistical physics applications of
finite Markov chains. We mention here a few sources, by way of survey articles, for the interested
reader. For a good overview of the basic techniques in estimating the mixing times of finite Markov
chains, see [31], [33], and [29]. The recent manuscript of Dyer et al. [24] describes several comparison
theorems for reversible as well as nonreversible Markov chains. However, much of the theory
surveyed in this article is rather recent theoretical (analytical) development and is so far unavailable
in a unified presentation. The significance of these new methods is as follows.
It is classical and elementary to show that the inverse spectral gap of a reversible Markov chain
captures the mixing time (in L
1
and L
2
) up to a factor of log(1/π
), where π
= min
x
π(x) denotes
the smallest entry in the stationary probability (vector) π of the chain. While the logarithmic
Sobolev constant captures the L
2
-mixing time up to a factor of log log(1/π
), it is typically much
harder to bound – to mention a specific example, the exact constant is open for the 3-point space
with arbitrary invariant measure; also in a few cases, the log-Sobolev constant is known not to
give tight bounds on the L
1
-mixing time. The main strength of the spectral profile techniques and
the evolving set methodology considered in this survey seems to be that of avoiding extra penalty
factors such as log log(1/π
). In the present write-up, the above is illustrated with a couple of
simple examples, and with the now-famous Thorp shuffle, for which an improved O(d
29
) mixing
time is described, building on the proof of Morris that proved the first polynomial (in d) bound of
O(d
44
). The approach to L
2
-mixing time using the spectral profile has the additional advantage of
yielding known (upper) estimates on mixing time, under a log-Sobolev inequality and/or a Nash-
type inequality. Thus various functional analytic approaches to mixing times can be unified with the
approach of bounding the spectral profile. The one exception to this is the approach to stationarity
using relative entropy, and techniques to bounding the so-called entropy constant have been rather
limited.
A brief history of the above development can perhaps be summarized as follows. A fundamental
contribution, by way of initiating several subsequent works, was made by Lovasz and Kannan in
[36] in which they introduced the notion of average conductance to bound the total variation mixing
time. This result was further strengthened and developed by Morris and Peres using the so-called
evolving sets, where they analyze a given chain by relating it to an auxiliary chain (a dual of sorts)
on subsets of the states of the original chain. While this was introduced in [45] in a (martingale-
based) probabilistic language, it turns out to be, retrospectively, an independent and alternative
view of the notion of a Doob transform introduced and investigated by Diaconis and Fill [19].
Further refinement and generalization of the evolving sets approach was done in detail by [42]. The
functional analog of some of this is done via the spectral profile, developed for the present context
3

Page 6
of finite Markov chains, in [26], while having its origins in the developments by [4] and [15] in the
context of manifolds.
Besides summarizing much of the above recent developments in this exciting topic, we address
some classical aspects as well. In discrete-time, much of the literature uses laziness assumptions
to avoid annoying technical difficulties. While laziness is a convenient assumption, it slows down
the chain by a factor of 2, which may not be desirable in practice. We take a closer look at this
issue and report bounds which reflect the precise dependence on laziness. The notion of modified
conductance circumvents laziness altogether, and we discuss this aspect briefly and compare it
to bounds derived from the functional approach. Further details on the modified conductance
and its usefulness can be found in [43]. Another issue is that of the role of reversibility (a.k.a.
detailed balance conditions). We tried to pay particular attention to it, due to current trend
in the direction of analyzing various nonreversible Markov chains. Although often a convenient
assumption, we avoid as much as possible this additional assumption. In particular, we include
a proof of the lower bound on the total variation mixing time in terms of the second eigenvalue
in the general case. Besides providing upper and lower bounds for the mixing time of reversible
and non-reversible chains, we report recent successes (with brief analysis) in the analysis of some
non-reversible chains; see for example, the discrete logarithm example and the Thorp shuffle.
In Chapter 1 we introduce notions of mixing times and prove the basic upper bounds on these
notions using Poincare and logarithmic Sobolev type functional constants. In Chapter 2 we move
on to recent results using the spectral profile, as opposed to using simply the second eigenvalue.
In Chapter 3 we review the evolving set methods. Our treatment of lower bounds on mixing
times is provided in Chapter 4. We consider several examples for illustration in Chapter 5. In
the penultimate chapter, we gather a few recent results together. This includes recent results on
the so-called fastest mixing Markov chain problem, and a recent theorem [40] from perturbation
theory of finite Markov chains; this theorem relates the stability of a stochastic matrix (subject
to perturbations) to the rate of convergence to equilibrium of the matrix. We also recall here an
old but not so widely known characterization of the spectral gap, since there have been recent
results utilizing this formulation. The Appendix contains a discussion on the relations between the
distances considered in this paper, and others such as relative pointwise (L
) distance.
Acknowledgments. We thank Pietro Caputo and Eric Carlen for several helpful discussions while
preparing this write-up.
4

Page 7
Chapter 1
Basic Bounds on Mixing Times
1.1 Preliminaries: Distances and mixing times
Let (Ω,P,π) denote a transition probability matrix (or Markov kernel) of a finite Markov chain on
a finite state space Ω with a unique invariant measure π. That is
P(x,y) ≥ 0, for all x,y ∈ Ω, and ∑
y∈Ω
P(x,y) = 1, for all x ∈ Ω.
x∈Ω
π(x)P(x,y) = π(y), for all y ∈ Ω.
We assume throughout this paper that P is irreducible and that π has full support (Ω). For standard
definitions and introduction to finite Markov chains, we refer the reader to [47] or [1].
It is a classical fact that if P is aperiodic then the measures P
n
(x,·) approach π as n → ∞.
Alternatively, let k
x
n
(y) = P
n
(x,y)/π(y) denote the density with respect to π at time n ≥ 0, or
simply k
n
(y) when the start state or the start distribution is unimportant or clear from the context.
Then the density k
x
n
(y) converges to 1 as n → ∞. A proper quantitative statement may be stated
using any one of several norms. In terms of L
p
-distance
k
n
− 1
p
p
= ∑
y∈Ω
|k
n
(y) − 1|
p
π(y) 1 ≤ p < +∞,
it becomes twice the total variation norm P
n
(x,·) − π
TV
when p = 1. (Note that some authors
prefer to define the total variation norm as simply the L
1
norm.) Another important measure of
closeness (but not a norm) is the informational divergence,
D(P
n
(x,·)|π) = Ent
π
(k
x
n
) = ∫ k
x
n
log k
x
n
dπ = ∑
y
P
n
(x,y)log
P
n
(x,y)
π(y)
,
where the entropy Ent
π
(f) = E
π
f log
f
E
π
f
.
Each of these distances are convex, in the sense that if µ and ν are two distributions, and
s ∈ [0,1] then dist((1 − s)µ + sν,π) ≤ (1 − s)dist(µ,π) + sdist(ν,π). For instance, D(µ|π) =
Ent
π
(µ/π) = E
π
µ
π
log
µ
π
is convex in µ because f log f is convex. A convex distance dist(µ,π)
5

Page 8
satisfies the condition
dist(σP
n
,π) = dist (∑
x∈Ω
σ(x)P
n
(x,·),π)
≤ ∑
x∈Ω
σ(x)dist(P
n
(x,·),π)
≤ max
x∈Ω
dist(P
n
(x,·),π),
(1.1)
and so distance is maximized when the initial distribution is concentrated at a point. To study the
rate of convergence it then suffices to study the rate when the initial distribution is a point mass
δ
x
(where δ
x
is 1 at point x ∈ Ω and 0 elsewhere; likewise, let 1
A
be one only on set A ⊂ Ω).
Definition 1.1. The total variation, relative entropy and L
2
mixing times are defined as follows.
τ(ϵ)
= min{n : P
n
(x,·) − π
TV
=
1
2
k
x
n
− 1
1
≤ ϵ}
τ
D
(ϵ) = min{n : Ent(k
x
n
) = D(P
n
(x,·) | π) ≤ ϵ}
τ
2
(ϵ) = min{n : √Var(k
x
n
) = k
x
n
− 1
2
≤ ϵ}
Some authors consider the chi-square (χ
2
) distance, which is just Var(k
x
n
), in which case the
corresponding chi-square mixing time is τ
χ
2
(ϵ) = τ
2
(√ϵ). In the Appendix it is seen that τ
2
(ϵ)
usually gives a good bound on L
convergence, and so for most purposes nothing stronger than
L
2
mixing need be considered.
An important concept in studying Markov chains is the notion of reversibility. The time-reversal
P
is defined by the identity π(x)P
(x,y) = π(y)P(y,x), x,y ∈ Ω and is the adjoint of P in the
standard inner product for L
2
(π), that is 〈f,Pg〉
π
= 〈P
f,g〉
π
where
〈f,g〉
π
= ∑
x∈Ω
π(x)f(x)g(x)
and a matrix M acts on a function as
M f(x) = ∑
y∈Ω
M(x,y)f(y).
If P
= P then P is said to be time-reversible, or to satisfy the detailed balance condition. Given
a Markov chain P two natural reversible chains that can be constructed are the additive reversibi-
lization
P+P
2
, and multiplicative reversibilization PP
.
The Dirichlet form is an important bilinear form which can be used in a characterization of
eigenvalues of a reversible chain (see Lemma 1.19), and is also used to define the spectral gap and
the logarithmic Sobolev type inequalities:
Definition 1.2. For f,g : Ω → R, let E(f,g) = E
P
(f,g) denote the Dirichlet form,
E(f,g) = 〈f,(I − P)g〉
π
= ∑
x,y
f(x)(g(x) − g(y))P(x,y)π(x).
6

Page 9
If f = g then
E(f,f) =
1
2 ∑
x,y∈Ω
(f(x) − f(y))
2
P(x,y)π(x),
and
E
P
(f,f) = E
P
(f,f) = E
P+P
2
(f,f),
while if P is reversible then also E(f,g) = E(g,f).
1.2 Continuous Time
Many mixing time results arise naturally in the continuous time setting, so we consider this case
first.
Let L denote the (discrete) Laplacian operator given by L = −(I−P). Then for t ≥ 0, H
t
= e
tL
represents the continuized chain [1] (or the heat kernel) corresponding to the discrete Markov kernel
P. The continuized chain simply represents a Markov process {X
t
}
t≥0
in Ω with initial distribution,
µ
0
(say), and transition matrices
H
t
= e
−t(I−P)
=
n=0
t
n
L
n
n!
= e
−t
n=0
t
n
P
n
n!
, t ≥ 0,
with the generator L = −(I−P). Thus H
t
(x,y) denotes the probability that the rate one continuous
Markov chain having started at x is at y at time t. Let h
x
t
(y) = H
t
(x,y)/π(y), for each y ∈ Ω,
denote its density with respect to π at time t ≥ 0, and h
t
(y) when the start state or the start
distribution is unimportant or clear from the context. Also, let
H
t
= e
tL
=
n=0
t
n
(L
)
n
n!
be the semigroup associated to the dual L
= −(I − P
). The following is elementary and a useful
technical fact.
Lemma 1.3. For any h
0
and all t ≥ 0, h
t
= H
∗t
h
0
. Consequently, for any x ∈ Ω,
dh
t
(x)
dt
= L
h
t
(x).
Using Lemma 1.3, the following lemma is easy to establish.
Lemma 1.4.
d
dt
Var(h
t
) = −2E(h
t
,h
t
)
(1.2)
d
dt
Ent(h
t
) = −E(h
t
,log h
t
)
(1.3)
Proof. Indeed,
d
dt
Var(h
t
) = ∫
d
dt
h
2
t
dπ = 2∫ h
t
L
h
t
dπ = 2∫ L(h
t
)h
t
dπ = −2E(h
t
,h
t
).
7

Page 10
d
dt
Ent(h
t
) = ∫
d
dt
h
t
log h
t
dπ = ∫ (log h
t
+ 1)L
h
t
= ∫ L(log h
t
)h
t
dπ = −E(h
t
,log h
t
).
The above motivates the following definitions of the spectral gap λ and the entropy constant
ρ
0
.
Definition 1.5. Let λ > 0 and ρ
0
> 0 be the optimal constants in the inequalities:
λVar
π
f ≤ E(f,f), for all f : Ω → R.
ρ
0
Ent
π
f ≤ E(f,log f), for all f : Ω → R
+
.
(1.4)
Lemma 1.19 (Courant-Fischer theorem) shows that for a reversible Markov chain, the second
largest eigenvalue λ
1
(of P) satisfies the simple relation 1 − λ
1
= λ. However, reversibility is not
needed for the following result.
Corollary 1.6. Let π
= min
x∈Ω
π(x). Then,
τ
2
(ϵ) =
min{t : √Var(h
x
t
) = h
x
t
− 1
2
≤ ϵ}
1
λ (
1
2
log
1 − π
π
+ log
1
ϵ)
(1.5)
τ
D
(ϵ) = min{t : Ent(h
x
t
) = D(H
t
(x,·) | π) ≤ ϵ} ≤
1
ρ
0
(log log 1π
+ log
1
ϵ)
.
(1.6)
Proof. Simply solve the differential equations,
d
dt
Var(h
x
t
) = −2E(h
x
t
,h
x
t
) ≤ −2λVar(h
x
t
)
(1.7)
and
d
dt
Ent(h
x
t
) = −E(h
x
t
,log h
x
t
) ≤ −ρ
0
Ent(h
x
t
),
(1.8)
and note that Var(h
0
) ≤
1−π
π
and Ent(h
0
) ≤ log
1
π
(e.g. by equation (1.1)).
It is worth noting here that the above functional constants λ and ρ
0
indeed capture the rate of
decay of variance and relative entropy, respectively, of H
t
for t > 0:
Proposition 1.7. If c > 0 then
(a) Var
π
(H
t
f) ≤ e
−ct
Var
π
f, for all f and t > 0, if and only if λ ≥ c.
(b) Ent
π
(H
t
f) ≤ e
−ct
Ent
π
f, for all f > 0 and t > 0, if and only if ρ
0
≥ c.
Proof. The “if” part of the proofs follows from (1.7) and (1.8). The only if is also rather elementary
and we bother only with that of part (b): Starting with the hypothesis, we may say, for every f > 0,
and for t > 0,
1
t (
Ent
π
(H
t
f) − Ent
π
f) ≤
1
t (
e
−ct
− 1)Ent
π
f.
Letting t ↓ 0, we get −E(f,log f) ≤ −cEnt
π
f.
8

Page 11
While there have been several techniques (linear-algebraic and functional-analytic) to help
bound the spectral gap, the analogous problem of getting good estimates on ρ
0
seems challenging.
The following inequality relating the two Dirichlet forms introduced above also motivates the study
of the classical logarithmic Sobolev inequality.
Lemma 1.8. If f ≥ 0 then
2E(√f,√f) ≤ E(f,log f)
Proof. Observe that
a(log a − log b) = 2alog
√a
b ≥
2a(1 −
b
√a) = 2
√a(√a
b)
by the relation log c ≥ 1 − c
−1
. Then
E(f,log f) = ∑
x,y
f(x)(log f(x) − log f(y))P(x,y)π(x)
≥ 2∑
x,y
f
1/2
(x)(f
1/2
(x) − f
1/2
(y))P(x,y)π(x)
= 2E(√f,√f)
Let ρ(P) > 0 denote the logarithmic Sobolev constant of P defined as follows.
Definition 1.9.
ρ = ρ(P) = inf
Entf
2
=0
E(f,f)
Entf
2
.
Proposition 1.10. For every irreducible chain P,
2ρ ≤ ρ
0
≤ 2λ.
Proof. The first inequality is immediate, using Lemma 1.8. The second follows from applying (1.4)
to functions f = 1 +ϵg, for g ∈ L
2
(π) with E
π
g = 0. Assume ϵ ≪ 1, so that f ≥ 0. Then using the
Taylor approximation, log(1 + ϵg) = ϵg − 1/2(ϵ)
2
g
2
+ o(ϵ
2
), we may write
Ent
π
(f) =
1
2
ϵ
2
π(g
2
) + o(ϵ
2
),
and
E(f,log f) = −ϵE
π
((Lg)log(1 + ϵg)) = ϵ
2
E(g,g) + o(ϵ
2
).
Thus starting from (1.4), and applying to f as above, we get
ϵ
2
E(g,g) ≥
ρ
0
2
ϵ
2
E
π
g
2
+ o(ϵ
2
).
Canceling ϵ
2
and letting ϵ ↓ 0, yields the second inequality of the proposition, since E
π
g = 0.
Remark 1.11. The relation 2ρ ≤ 2λ can be strengthened somewhat to ρ ≤ λ/2 by a direct
application of the method used above. Also, under the additional assumption of reversibility, the
inequality in Lemma 1.8 can be strengthened by a factor of 2 to match this, as explained in [21],
in turn improving the above corollary to 4ρ ≤ ρ
0
for reversible chains.
9

Page 12
1.3 Discrete Time
In discrete time we consider two approaches to mixing time, both of which are equivalent. The first
approach involves operator norms, and is perhaps the more intuitive of the two methods.
Proposition 1.12. In discrete time,
τ
2
(ϵ) ≤ ⌈
1
1 − P
log
1
ϵ√π
⌉ ,
where
P
= sup
f:Ω→R,
Ef=0
P
f
2
f
2
.
This result has appeared in mixing time literature in many equivalent forms. A few can be
found in Remark 1.17 at the end of this section.
Proof. Since k
i+1
− 1 = P
(k
i
− 1) and E(k
i
− 1) = 0 for all i then
Var(k
n
) =
k
n
− 1
2
2
= P
∗n
(k
0
− 1)
2
2
≤ ( P
∗ n
k
0
− 1
2
)
2
= P
∗ 2n
Var(k
0
).
Solving for when variance drops to ϵ
2
and using the approximations log x ≤ −(1−x) and Var(k
0
) ≤
1−π
π
gives the result.
A good example in which this bound has been used in practice can be found in Section 5.2, in
which we discuss a recent proof that Pollard’s Rho algorithm for discrete logarithm requires order
√nlog
3
n steps to detect a collision, and likely determine the discrete log.
In Proposition 1.12 the mixing bound followed almost immediately from the definition. However,
there is an alternate approach to this problem which bears more of a resemblance to the continuous
time result and is more convenient for showing refined bounds.
The discrete time analog of differentiation is to take the difference Var(P
f) − Var(f). The
analog of equation (1.2) is the following lemma of Mihail [38], as formulated by Fill [25].
Lemma 1.13. Given Markov chain P and function f : Ω → R, then
Var(P
f) − Var(f) = −E
PP
(f,f) ≤ −Var(f)λ
PP
.
Proof. Since E
π
f = E
π
(Kf) for any transition probability matrix K, then
Var(P
f) − Var(f) = 〈P
f,P
f〉
π
− 〈f,f〉
π
= −〈f,(I − PP
)f〉
π
,
giving the equality. The inequality follows from the definition of λ
PP
.
Observe that 1 − λ
PP
is the largest non-trivial singular value of P.
We now use this lemma to bound mixing time of a discrete time chain. It will be seen that
mixing in discrete time is related to the eigenvalue gap of the multiplicative reversibilization PP
,
whereas in the previous section we found that mixing time in continuous time is related to the
eigenvalue gap of the additive reversibilization
P+P
2
(since λ = λ
P
= λ
P+P
2
).
10

Page 13
Corollary 1.14. In discrete time a (non-reversible) Markov chain satisfies
τ
2
(ϵ) ≤ ⌈
2
λ
PP
log
1
ϵ√π
⌉ ≤ ⌈ 1αλ log 1
ϵ√π
where α ∈ [0,1] is such that ∀x ∈ Ω : P(x,x) ≥ α. For a reversible Markov chain
τ
2
(ϵ) ≤ ⌈
1
1 − λ
max
log
1
ϵ√π
⌉ ≤ ⌈
1
min{2α, λ}
log
1
ϵ√π
where λ
max
= max{λ
1
, |λ
n−1
|} when λ
1
= 1 − λ is the largest non-trivial eigenvalue of P and
λ
n−1
≥ −1 is the smallest eigenvalue.
Proof. If k
0
is the initial density then k
n
= (P
)
n
k
0
, and so by Lemma 1.13, Var(k
n
) ≤ Var(k
n−1
)(1−
λ
PP
). By induction
Var(k
n
) ≤ Var(k
0
) (1 − λ
PP
)
n
.
(1.9)
Solving for when variance drops to ϵ
2
and using the approximation log(1 − λ
PP
) ≤ −λ
PP
gives
the first bound.
For the second bound, ∀x ∈ Ω : P
(x,x) = P(x,x) ≥ α and so
π(x)PP
(x,y) ≥ π(x)P(x,x)P
(x,y) + π(x)P(x,y)P
(y,y)
≥ α π(y)P(y,x) + α π(x)P(x,y).
Then
E
PP
(f,f) ≥ α E(f,f) + αE(f,f) = 2αE(f,f)
and so λ
PP
≥ 2αλ.
For the reversible case, Lemma 1.18 shows that P has an eigenbasis. If λ
i
is an eigenvalue of P
with corresponding right eigenvector v
i
then PP
v
i
= P
2
v
i
= λ
2
i
v
i
, and so the eigenvalues of PP
are just {λ
2
i
}. By Lemma 1.19 (i.e. Courant-Fischer) it follows that
λ
PP
= λ
P
2
= 1 − max{λ
2
1
2
n−1
} = 1 − λ
2
max
.
Solving equation (1.9) then gives the first reversible bound.
Finally, if P is reversible then
λ
n−1
−α
1−α
is the smallest eigenvalue of the reversible Markov chain
P−α I
1−α
, so Lemma 1.18 shows that
λ
n−1
−α
1−α
≥ −1. Re-arranging the inequality gives the relation
−λ
n−1
≤ 1 − 2α, and so λ
max
= max{λ
1
,−λ
n−1
} = 1 − min{λ,2α}.
Remark 1.15. As mentioned, at the beginning of this section, the two approaches to bounding
mixing in this section are equivalent.
1 − λ
PP
=
sup
f:Ω→R
Var(f) − E
PP
(f,f)
Var(f)
= sup
f:Ω→R,
Ef=0
〈f,f〉
π
− 〈f,(I − PP
)f〉
π
〈f,f〉
π
=
sup
f:Ω→R,
Ef=0
〈P
f,P
f〉
π
〈f,f〉
π
= P
∗ 2
The second equality is because the numerator and denominator are invariant under addition of a
constant to f, so it may be assumed that Ef = 0.
11

Page 14
Our concluding remark will require knowledge of the L
p
→ L
q
operator norm:
Definition 1.16. Suppose T : R
→ R
is an operator taking functions f : Ω → R to other such
functions. Then, given p,q ∈ [1,∞], let T
p→q
be the optimal constant in the inequality
Tf
q
≤ T
p→q
f
p
, for all f : Ω → R.
Remark 1.17. It has already been seen that P
= 1 − λ
PP
. We now consider a few other
equivalent forms which have appeared in mixing bounds equivalent to Proposition 1.12.
First, consider the operator norm. Let E denote the expectation operator, that is, E is a square
matrix with rows all equal to π. Then PE = EP = E
2
and so (P − E)f = (P − E)(f − Ef). Since
also f
2
≥ f −Ef
2
then the supremum in the definition of P
−E
2→2
will occur with Ef = 0,
in which case also (P
− E)f = Pf. In short P
− E
2→2
= P
.
It may seem more intuitive to work with P instead of P
. In fact, both cases are the same.
P
− E
2→2
=
sup
f
2
=1
(P
− E)f
2
= sup
f
2
=1
sup
g
2
=1
|〈(P
− E)f,g〉
π
|
=
sup
f
2
=1
sup
g
2
=1
|〈f,(P − E)g〉
π
| = sup
g
2
=1
sup
f
2
=1
|〈(P − E)g,f〉
π
|
=
P − E
2→2
.
The second equality is a form of L
p
duality, because f
p
= sup
g
q
=1
|〈f,g〉
π
| when 1/p+1/q = 1,
or equivalently f
p
= sup
g
q
=1
∫ fg dπ∣
. This is really just an extension of the dot product
property that if g = 1 then f · g = f g cos θ is maximized by g = f/ f .
Some authors have worked with complex valued functions. Note that if f : Ω → C and T is a
real valued square matrix then
Tf
2
2
= T(Ref)
2
2
+ T(Imf)
2
2
≤ T
2
2→2
( Ref
2
2
+ Imf
2
2
) = T
2
2→2
f
2
2
,
that is, the worst choice of f is a real function.
In summary,
P
= P
− E
2→2
= P − E
2→2
= sup
f:Ω→C
(P − E)f
2
f
2
= sup
f:Ω→C,
Ef=0
Pf
2
f
2
.
1.4 Does Reversibility Matter?
Many mixing results were originally shown only in the context of a reversible Markov chain. In this
survey we are able to avoid this requirement in most cases. However, there are still circumstances
under which reversible and non-reversible chains behave differently, and not just as an artifact of
the analysis. In this section we discuss these differences, and also prove a few classical lemmas
about reversible chains which were used in the discrete time results given above, and which explain
why the reversibility assumption is helpful.
The difference between reversible and non-reversible results is most apparent when upper
and lower bounds on distances are given. Combining the above work, recalling that λ
max
=
max{λ
1
,|λ
n−1
|}, and the lower bounds of Theorem 4.5, we have
if P reversible
:
1
2
max
|
n
≤ d(n) ≤
1
2
max
|
n
1−π
π
if non-reversible :
1
2
max
i>0
i
|
n
≤ d(n) ≤
1
2
(
√1
− λ
PP
)
n
1−π
π
12

Page 15
where
d(n) = max
x
P
n
(x,·) − π(·)
TV
denotes the worst variation distance after n steps.
In particular, this shows that in the reversible case the variation distance is determined, up to
a multiplicative factor, by the size of the largest magnitude eigenvalue. The rapid mixing property
is then entirely characterized by whether 1 − |λ
max
| is polynomially large or not.
In contrast, Example 5.2 gives a convergent non-reversible chain with complex eigenvalues such
that max
i>0
i
| = 1/
√2 but λ
PP
= 0. The non-reversible lower bound given above then converges
to zero with n, as it should, while the upper bound is constant and useless.
Another case in which reversibility will play a key role is comparison of mixing times. This
is a method by which the mixing time of a Markov chain P can be bounded by instead studying
mixing time of a similar, but easier to analyze chainP. For many Markov chains this is the only
way known to bound mixing time. In Theorem 5.3 we find good comparison is possible if P andP
are reversible, while if P is non-reversible then there is a slight worsening, but ifP is non-reversible
then the comparison is much worse. Example 5.4 shows that even for walks as simple as those on
a cycle Z/nZ each of these three cases are necessary, and not just artifacts of the method of proof.
The main reason why a reversible Markov chain is better behaved is that it has a complete real
valued spectral decomposition, and because the spectral gap λ is exactly related to eigenvalues of
the reversible chain. For the sake of completeness, we now show these classical properties.
Lemma 1.18. If P is reversible and irreducible on state space of size |Ω| = n, then it has a complete
spectrum of real eigenvalues with magnitudes at most one, that is
1 = λ
0
≥ λ
1
≥ ··· ≥ λ
n−1
≥ −1.
A non-reversible chain may have complex valued eigenvalues, as in Example 5.2.
Proof. Let(√π) = diag(√π(1),√π(2),... , √π(n)) denote the diagonal matrix with entries drawn
from π. The matrix M = (√π)P(√π)
−1
is a symmetric matrix because
M(x,y) = √
π(x)
π(y)
P(x,y) =
π(x)P(x,y)
√π(x)π(y)
=
π(y)P(y,x)
√π(x)π(y)
= M(y,x).
Since P is similar to the real valued symmetric matrix M it follows from the spectral theorem that
P has a complete spectrum of real eigenvalues and real eigenvectors.
Left eigenvector v and right eigenvector w are orthogonal if their eigenvalues λ
v
= λ
w
, as
λ
v
v w = (vP)w = v(Pw) = λ
w
v w .
(1.10)
In particular, eigenvalue 1 has right eigenvector 1, and so if eigenvalue λ
i
= 1 has left eigenvector
v
i
then ∑
x
v
i
(x) = v1 = 0. Then v
i
has both positive and negative entries, and for ϵ sufficiently
small σ = π + ϵv
i
is a probability distribution. However, the n step distribution is given by
σP
n
= (π + ϵv
i
)P
n
= π + ϵλ
n
i
v
i
,
and since v
i
has a negative entry then if |λ
i
| > 1 then σP
n
will have a negative entry for sufficiently
large n, contradicting the fact that σP
n
is a probability distribution.
13

Page 16
The Courant-Fischer theorem shows the connection between eigenvalues and Dirichlet forms for
a reversible Markov chain.
Lemma 1.19. In a reversible Markov chain the eigenvalues satisfy
1 − λ
1
=
inf
f:Ω→R,
f=constant
E(f,f)
Var(f)
,
1 + λ
n−1
=
inf
f:Ω→R,
f=constant
F(f,f)
Var(f)
,
where
F(f,f) = 〈f,(I + P)f〉
π
=
1
2 ∑
x,y∈Ω
(f(x) + f(y))
2
P(x,y)π(x).
In particular, 1 − λ
1
= λ.
In Section 5.4 we will find that in the non-reversible case this becomes an inequality, with
1 − Reλ
i
≥ λ.
Proof. The numerator and denominator in the infinum are invariant under adding a constant to f,
so it may be assumed that Ef = 0, that is, 〈f,1〉
π
= 0.
Let {v
i
} be a set of right eigenvectors of P forming an orthonormal eigenbasis for R
, with
v
0
= 1. Given f : Ω → R then f = ∑c
i
v
i
with c
i
= 〈f,v
i
π
, and so
E(f,f) = 〈f,(I − P)f〉
π
= ∑
i,j∈Ω
c
i
c
j
〈v
i
,(I − P)v
j
π
= ∑
i
c
2
i
(1 − λ
i
) ≥ ∑
i
c
2
i
(1 − λ
1
)
with an equality when f = v
1
. Also,
Var(f) = 〈f,f〉
π
− 〈f,1〉
π
= 〈∑
i
c
i
v
i
, ∑
j
c
j
v
j
π
− 〈∑
i
c
i
v
i
,1〉
π
= ∑ c
2
i
and so
E(f,f)
Var(f) ≥
1 − λ
1
with an equality when f = v
1
. The result then follows.
The same argument, but with I + P instead of I − P gives the result for λ
n−1
.
14

Page 17
Chapter 2
Advanced Functional Techniques
The relation between functional constants and mixing time bounds was studied in Chapter 1. In
this section it is shown that information on functions of large variance, or on functions with small
support, can be exploited to show better mixing time bounds.
The basic argument is this. Recall that
d
dt
Var(h
t
) = −2E(h
t
,h
t
). If E(f,f) ≥ G(Var(f)) for
some G : R
+
→ R
+
and for all f : Ω → R
+
with Ef = 1, then it follows that
d
dt
Var(h
t
) =
−2E(h
t
,h
t
) ≤ −2G(Var(h
t
)). Setting I(t) = Var(h
t
) then this implies
τ
2
(ϵ) = ∫
τ
2
(ϵ)
0
1dt ≤ ∫
ϵ
2
Var(h
0
)
dI
−2G(I)
.
(2.1)
The argument carries over to discrete time if G is non-decreasing and α ∈ [0,1] is such that
∀x ∈ Ω : P(x,x) ≥ α. To see this, observe that by Lemma 1.13
Var(k
n+1
) − Var(k
n
) = −E
PP
(k
n
,k
n
) ≤ −2αE(k
n
,k
n
) ≤ −2αG(Var(k
n
)).
Since both I(n) = Var(k
n
) and G(x) are non-decreasing, the piecewise linear extension of I(n) to
t ∈ R
+
will satisfy
dI
dt ≤ −
2αG(I).
At integer t, the derivative can be taken from either right or left. It follows that
τ
2
(ϵ) = ∫
τ
2
(ϵ)
0
1dt ≤ ⌈∫
ϵ
2
Var(h
0
)
dI
−2αG(I)
⌉ .
(2.2)
If instead E
PP
(f,f) ≥ G(Var(f)) then simply drop the 2α from this bound.
This idea will shortly be exploited to obtain fairly simply proofs relating the log-Sobolev constant
and Nash inequalities to continuous and discrete time L
2
mixing, and to generalize spectral gap
bounds on mixing.
2.1 Log-Sobolev and Nash Inequalities
Some of the best bounds on L
2
mixing times were shown by use of the log-Sobolev constant, a
method developed in the finite Markov chain setting by Diaconis and Saloff-Coste [21]. Equation
(2.1) will show a bound in terms of the log-Sobolev constant if we can show a relation between the
15

Page 18
Dirichlet form E(h
t
,h
t
) and a function of the variance Var(h
t
). The following lemma establishes
this connection.
Lemma 2.1. If f is non-negative then
Ent(f
2
) ≥ Ef
2
log
Ef
2
(Ef)
2
,
and in particular, if Ef = 1 then
E(f,f) ≥ ρEnt(f
2
) ≥ ρ(1 + Var(f))log(1 + Var(f)).
Proof. By definition
Ent(f
2
) = Ef
2
log
f
2
Ef
2
= 2Ef
2
log
fEf
Ef
2
+ Ef
2
log
Ef
2
(Ef)
2
Now, apply the approximation log x ≥ 1 − 1/x.
Ef
2
log
fEf
Ef
2
≥ Ef
2
(1 − Ef
2
fEf )
= Ef
2
− Ef
2
Ef
Ef
= 0
Those familiar with cross-entropy might prefer to rewrite this proof as follows. Noting that the
cross-entropy H(f,g) = Ef log
f
g
≥ 0 for densities f, g, the proof is equivalent to the statement
Ent(f
2
) = 2Ef
2
H (
f
2
Ef
2
,
f
Ef )
+ Ef
2
log
Ef
2
(Ef)
2
≥ Ef
2
log
Ef
2
(Ef)
2
.
Diaconis and Saloff-Coste also showed that the Nash-inequality can be used to study L
2
mixing
[22]. The Dirichlet form can also be lower bounded in terms of variance by using a Nash inequality.
Lemma 2.2. Given a Nash Inequality
f
2+1/D
2
≤ C [E(f,f) +
1
T
f
2
2
] f
1/D
1
which holds for every function f : Ω → R and some constants C, D, T ∈ R
+
, then whenever f ≥ 0
and Ef = 1 then
E(f,f) ≥ (1 + Var(f)) (
(1 + Var(f))
1/D
C
1
T )
Proof. The Nash inequality can be rewritten as
E(f,f) ≥ f
2
2
( 1C
( f
2
f
1
)
1/D
1
T )
However, f
1
= E|f| = 1, and Var(f) = f
2
2
− 1, giving the result.
16

Page 19
Corollary 2.3. Given the spectral gap λ and the log-Sobolev constant ρ and/or a Nash inequality
with DC ≥ T and D ≥ 2, and given ϵ ≤ 2, then the continuous time Markov chain satisfies
τ
2
(ϵ) ≤
1
log log
1
π
+
1
λ (
1
4
+ log
1
ϵ)
τ
2
(ϵ) ≤ T +
1
λ (
D
2
log
DC
T
+ log
1
ϵ)
τ
2
(ϵ) ≤ T +
1
log log (
DC
T )
D
+
1
λ (
1
4
+ log
1
ϵ)
Upper bounds for the discrete time Markov chain are a factor of 2 larger when Nash, log-Sobolev
and spectral gap are computed in terms of the chain PP
, while when computed for P then they are
a factor α
−1
larger for α ∈ [0,1] satisfying ∀x ∈ Ω : P(x,x) ≥ α.
Proof. Apply equation (2.1) with the log-Sobolev bound of Lemma 2.1 when Var(h
t
) ≥ 4, and the
spectral gap bound E(f,f) ≥ λVar(f) when Var(h
t
) < 4, to obtain
τ
2
(ϵ) ≤ ∫
4
Var(h
0
)
dI
−2ρ(1 + I)log(1 + I)
+ ∫
ϵ
2
4
dI
−2λI
=
1
−2ρ
(log log(1 + 4) − log log(1 + Var(h
0
))) +
1
−2λ
log
ϵ
2
4
Simplify this by Var(h
0
) ≤
1−π
π
, and apply ρ ≤ λ/2 to the log log(5) term.
For the second mixing bound use the Nash bound of Lemma 2.2 when Var(h
t
) ≥ (DC/T)
D
−1,
and the spectral bound when Var(h
t
) < (DC/T)
D
− 1. The Nash portion of the integral is then
(DC/T)
D
−1
Var(h
0
)
dr
−2(1 + I) (
(1+I)
1/D
C
1
T
)
= −
DT
2
log (1 −
C/T
(1 + I)
1/D
)∣
(DC/T)
D
−1
Var(h
0
)
≤ −
DT
2
log(1 − 1/D) ≤
DT
2(D − 1)
≤ T
The second inequality is because log(1 − 1/x) ≥ −
1
x−1
.
For the third bound use the Nash bound when Var(h
t
) ≥ (DC/T)
D
−1, the log-Sobolev bound
for (DC/T)
D
− 1 > Var(h
t
) ≥ 4 and the spectral bound when Var(h
t
) < 4.
For the discrete time case proceed similarly, but with equation (2.2) instead of equation (2.1).
The continuous time log-Sobolev bound is comparable to a result of Diaconis and Saloff-Coste
[21], while the discrete time log-Sobolev bound is comparable to a bound of Miclo [37].
For a reversible, continuous time chain, hypercontractivity ideas can be used to improve the
log-Sobolev portion of these bounds by a factor of two. A tedious differentiation and a few ap-
proximations (see around equation (3.2) of [21]) show that for any t
0
∈ R, a reversible chain will
satisfy
d
dt
h
t
1+e
4ρ(t−t
0
)
≤ 0,
and so the norm is decreasing in t. If t
0
= k +
1
log [log(1 + Var(h
k
)) − 1] for some k ∈ R
+
then
h
t
0
2
≤ h
k 1+e
−4ρt
0
≤ h
k
1−2/ log(1+Var(h
k
))
1
h
k
2/ log(1+Var(h
k
))
2
= e.
17

Page 20
The second inequality was the relation f
q
≤ f
1−2/q
1
f
2/q
2
when q ≥ 2 and q
∈ [1,2] is the
conjugate exponent (i.e. 1/q + 1/q
= 1), see Chapter 8, Lemma 41 of [1]. The equality is because
h
k 1
= 1 and h
k
2
2
= 1 + Var(h
k
).
This shows that between steps k and t
0
the variance drops from Var(h
k
) to Var(h
t
0
) = h
t
0
2
2
1 ≤ e
2
− 1. When combined with our earlier Nash and spectral work we obtain
τ
2
(ϵ) ≤ ∫
(DC/T)
D
−1
Var(h
0
)
dr
−2(1 + I) (
(1+I)
1/D
C
1
T
)
+
1
log [log(DC/T)
D
− 1] +
ϵ
2
e
2
−1
dI
−2λI
< T +
1
log log (
DC
T )
D
+
1
λ (
1 + log
1
ϵ)
.
The factor of two loss in the non-reversible case seems to be unavoidable, and is likely due to the
fact that E
P
(f,f) = E
P
(f,f) = E
P+P
2
(f,f) and so ρ and λ cannot capture the difference between
the non-reversible chain P and it’s additive reversibilization
P+P
2
. For instance, in Corollary 1.14
and Remark 1.11, reversibility also affected the mixing time by a factor of two.
2.2 Spectral profile
Faber-Krahn inequalities were developed by Grigor’yan, Coulhon and Pittet [15] (see also [27] and
[16]) to study the rate of decay of the heat kernel, and in the finite Markov setting by Goel,
Montenegro and Tetali [26]. The results of Chapter 1 will be improved by proving a better lower
bound on E(f,f) and then solving a subsequent differential equation.
Definition 2.4. For a non-empty subset S ⊂ Ω the first Dirichlet eigenvalue on S is given by
λ
1
(S) = inf
f∈c
+
0
(S)
E(f,f)
Var(f)
where c
+
0
(S) = {f ≥ 0 : supp(f) ⊂ S} is the set of non-negative functions supported on S. The
spectral profile Λ : [π
,∞) → R is given by Λ(r) = inf
π
≤π(S)≤r
λ
1
(S).
To utilize the spectral profile Λ(r) in studying mixing time we require a lemma relating the
Dirichlet form E(f,f) to Λ(r), improving on the basic bound E(f,f) ≥ λVar(f) used earlier.
Lemma 2.5. For every non-constant function f : Ω → R
+
,
E(f,f) ≥
1
2
Λ(
4(Ef)
2
Var f )
Var(f).
Proof. Given a ∈ R use the notation a
+
= max{a,0} to denote the positive part. For c constant,
E(f,f) = E(f −c,f −c). Also, E(f −c,f −c) ≥ E((f −c)
+
,(f −c)
+
) because ∀a,b ∈ R : (a−b)
2
(a
+
− b
+
)
2
. It follows that when 0 ≤ c < maxf then
E(f,f) ≥ E((f − c)
+
,(f − c)
+
)
≥ Var((f − c)
+
)
inf
u∈c
+
0
(f>c)
E(u,u)
Var(u)
≥ Var((f − c)
+
)Λ(π(f > c)).
18

Page 21
The inequalities ∀a,b ≥ 0 : (a − b)
2
+
≥ a
2
− 2ba and (a − b)
+
≤ a show that
Var((f − c)
+
) = E(f − c)
2
+
− (E(f − c)
+
)
2
≥ Ef
2
− 2cEf − (Ef)
2
.
Let c = Var(f)/4Ef and apply Markov’s inequality π(f > c) < (Ef)/c,
E(f,f) ≥ (Var(f) − 2cEf)Λ(Ef/c) =
1
2
Var(f)Λ (
4(Ef)
2
Var f )
A mixing time theorem then follows easily.
Theorem 2.6. In continuous time
τ
2
(ϵ) ≤ ∫
1/2
4 π
dr
r Λ(r)
+
1
λ
log
2√2
ϵ
.
In discrete time
τ
2
(ϵ) ≤ ⌈∫
1/2
4 π
2dr
r Λ
PP
(r)
+
2
λ
PP
log
2√2
ϵ ⌉
≤ ⌈∫
1/2
4 π
dr
α r Λ(r)
+
1
αλ
log
2√2
ϵ ⌉
where α ∈ [0,1] is such that ∀x ∈ Ω : P(x,x) ≥ α.
Proof. Apply equation (2.1) with the spectral profile bound of Lemma 2.5 to obtain
τ
2
(ϵ) ≤ ∫
8
Var(h
0
)
dI
−I Λ(4/I)
+ ∫
ϵ
2
8
dI
−2λI
.
A change of variables to r = 4/I(t) implies that
τ
2
(ϵ) ≤ ∫
1/2
4/Var(h
0
)
dr
r Λ(r)
+
1
log
8
ϵ
2
and the final simplification Var(h
0
) ≤
1
π
− 1 < 1/π
leads to the theorem.
For the discrete time case use the remark after equation (2.2) instead. The second inequality
follows from Λ
PP
(r) ≥ 2αΛ(r), an immediate consequence of the relation E
PP
(f,f) ≥ 2αE(f,f)
shown in the proof of Corollary 1.14.
Observe that trivially Λ(r) ≥ λ, and so Theorem 2.6 leads to the bound τ
2
(ϵ) ≤
1
λ
log(1/2
2 π
ϵ),
about a factor of two worse than the conventional spectral bound of equation (1.5). The same fac-
tor of two loss occurs in the discrete case. However, both theorems can be significantly better if
Λ(r) ≫ λ for small values of r. As an example of this, in [26] the authors show that given the
log-Sobolev constant and a Nash inequality, then
Λ(r) ≥ ρ
log(1/r)
1 − r
and Λ(r) ≥
1
C r
1/2D
1
T
.
19

Page 22
By applying the Nash bound on Λ(r) for r ≤ (T/2DC)
2D
and the log-Sobolev bound when
(T/2DC)
2D
≤ r ≤ 1/2, then integration in Theorem 2.6 establishes the bound
τ
2
(ϵ) ≤ 2T +
1
ρ
log log (
2DC
T )
2D
+
1
λ
log
2√2
ϵ
(2.3)
This is only a factor two weaker than that found with our more direct approach earlier. A further
example of spectral profile will be given in our study of the Thorp shuffle later.
2.3 Comparison methods
It sometimes happens that a Markov chain is difficult to study, but a related chain is more man-
ageable. In this situation the comparison method has been widely used to bound spectral gap,
log-Sobolev constant and Nash inequalities (see [49, 20, 21, 22]). The argument applies to the
bounds in this chapter as well.
Theorem 2.7. Consider two Markov chains P andP on the same state space Ω, and for every
x,y ∈ Ω withP(x,y) > 0 define a directed path γ
xy
from x to y along edges in P. Let Γ denote the
set of all such paths. Then
E
P
(f,f) ≥
1
A E
P
(f,f), Var
π
(f) ≤ M Var
π
(f), Ent
π
(f
2
) ≤ M Ent
π
(f
2
),
where M = max
x
π(x)
π(x)
and
A = A(Γ) = max
a,b∈Ω,
P(a,b)=0
1
π(a)P(a,b)
x,y∈Ω,
(a,b)∈γ
xy
π(x)P(x,y)|γ
xy
|.
Proof. First, consider the Dirichlet forms:
E
P
(f,f) =
1
2 ∑
x,y
(f(x) − f(y))
2
π(x)P(x,y)
=
1
2 ∑
x,y
(
(a,b)∈γ
xy
(f(a) − f(b))
)
2
π(x)P(x,y)
1
2 ∑
x,y
(a,b)∈γ
xy
(f(a) − f(b))
2
xy
|π(x)P(x,y)
=
1
2 ∑
a,b
(f(a) − f(b))
2
π(a)P(a,b)
1
π(a)P(a,b) ∑
x,y
(a,b)∈γ
xy
π(x)P(x,y)|γ
xy
|
≤ E
P
(f,f)A.
For variance we have
Var
π
(f) = inf
c∈R
E
π
(f(x) − c)
2
≤ inf
c∈R
M E
π
(f(x) − c)
2
= M Var
π
(f).
20

Page 23
For entropy, observe that
Ent
π
(f
2
) = ∑
x∈Ω
π(x) (f
2
(x)log
f
2
(x)
E
π
f
2
− f
2
(x) + E
π
f
2
)
= inf
c>0
x∈Ω
π(x) (f
2
(x)log
f
2
(x)
c
− f
2
(x) + c)
≤ inf
c>0
M ∑
x∈Ω
ˆπ(x) (f
2
(x)log
f
2
(x)
c
− f
2
(x) + c)
= M Ent
π
(f
2
).
The second equality follows from differentiating with respect to c to see that the minimum occurs
at c = E
π
f
2
, while the inequality required the fact that a log
a
b
− a + b ≥ a(1 −
b
a
) − a + b = 0 and
so f
2
log
f
2
c
− f
2
+ c ≥ 0.
An easy consequence of this is that spectral gap, log-Sobolev and spectral profile bounds can
be compared.
Corollary 2.8.
λ
P
1
M A
λ
P
, ρ
P
1
M A
ρ
P
, Λ
P
(r) ≥
1
M A
Λ
P
(r).
The log-Sobolev and spectral profile mixing time bounds of P are thus at worst a factor MA
times larger than those ofP.
If the distribution π =π then a Nash inequality forP, along with the relation E
P
(f,f) ≥
1
A
E
P
(f,f), immediately yields a Nash inequality for P. It is not immediately clear how to compare
Nash inequality bounds if π =π. However, one can compare the spectral profile bounds used to
show equation (2.3), and so the mixing time of P is at most M A times the bound equation (2.3)
gives forP. Alternatively, one can compare E
P
(f,f) to E
P
(f,f) and Var
π
(f) to Var
π
(f) in the
original proofs of the mixing times.
In the case of a reversible chain Diaconis and Saloff-Coste [20] observe that it is also possible
to compare λ
n−1
if the paths are of odd length.
Theorem 2.9. Consider two Markov chains P andP on the same state space Ω, and for every
x ∈ Ω define a directed circuit γ
x
from x to itself along edges in P. Let Γ
denote the set of all
such paths. Then
F
P
(f,f) ≥
1
A
F
P
(f,f),
where M = max
x
π(x)
π(x)
and
A
= A
) = max
a,b∈Ω,
P(a,b)=0
1
π(a)P(a,b)
x,y∈Ω,
(a,b)∈γ
xy
π(x)P(x,y)|γ
xy
|r
xy
(a,b),
where r
xy
(a,b) is the number of times the edge (a,b) appears in path γ
xy
.
21

Page 24
Proof. The proof is identical to that for comparison of E(f,f), except that if the path γ
xy
is given
by x = x
0
,x
1
,x
2
... ,x
m
= y for m odd then f(x) + f(y) is rewritten as
f(x) + f(y) = (f(x) + f(x
1
)) − (f(x
1
) + f(x
2
)) + ···
−(f(x
m−2
) + f(x
m−1
)) + (f(x
m−1
) + f(y)).
In particular,
1 − λ
max
(P) ≥
1
MA
(1 − λ
max
(P)).
22

Page 25
Chapter 3
Evolving Set Methods
Many mixing time results first estimate set expansion and then relate it to mixing time bounds.
An early breakthrough in the study of mixing times was the conductance bound
λ ≥ Φ
2
/2
where
Φ = min
A⊂Ω
Q(A,A
c
)
min{π(A),π(A
c
)}
(see [30, 34]). Essentially the same proof can be used (see [26]) to show a conductance profile
bound, that
Λ(r) ≥ Φ
2
(r)/2
where
Φ(r) = min
A⊂Ω,
π(A)≤r
Q(A,A
c
)
min{π(A),π(A
c
)}
.
Given α ∈ [0,1] such that ∀x ∈ Ω : P(x,x) ≥ α this can be boosted to
Λ(r) = (1 − α)Λ
P−αI
1−α
(r) ≥
1 − α
2
Φ
2
P−αI
1−α
(r) =
1 − α
2
( Φ(r)
1 − α)
2
=
Φ
2
(r)
2(1 − α)
(3.1)
and so by Corollary 1.14 and Theorem 2.6 a discrete time chain mixes in time
τ
2
(ϵ) ≤ ⌈
2
α
1−α
Φ
2
log
1
ϵ√π
and
τ
2
(ϵ) ≤ ⌈∫
4/ϵ
2
2dr
α
1−α
2
(r)⌉
.
(3.2)
In the common setting of a reversible, lazy (i.e. α ≥ 1/2) chain Corollary 1.14 also implies the
slightly stronger bound
τ
2
(ϵ) ≤ ⌈
1
Φ
2
log
1
ϵ√π
⌉ .
(3.3)
In this section we develop a more direct method of proof. This can give stronger set bounds,
bounds for distances other than L
2
-distance, and also leads to an extension of conductance which
applies even with no holding probability. All work will be done in discrete time, but carries over
easily to continuous time, as discussed at the end of the chapter. The results and their proofs are
based on the work of Morris and Peres [45] and Montenegro [42].
3.1 Bounding Distances by Evolving Sets
Results in this section are found by working with a dual process. Given a Markov chain on Ω with
transition matrix P, a dual process consists of a walk P
D
on some state space V and a link, or
23

Page 26
transition matrix, Λ from V to Ω such that
PΛ = ΛP
D
.
In particular, P
n
Λ = ΛP
n
D
and so the evolution of P
n
and P
n
D
will be closely related. This relation
is given visually by Figure 3.1.
P
D
P
D
P
D
Walk on
V
Walk on
P
P
P
Λ
Λ
Λ
Λ
Λ
Figure 3.1: The dual walk P
D
projects onto the original chain P.
Diaconis and Fill [19] studied the use of dual Markov chains in bounding separation distance.
Independently, Morris and Peres [45] proposed the same walk on sets and used it to bound L
2
distance. Montenegro [42] sharpened this technique and extended it to other distances.
In order to relate a property of sets (conductance) to a property of the original walk (mixing
time) we construct a walk on sets that is a dual to the original Markov chain. A natural candidate
to link a walk on sets to a walk on states is the projection Λ(S,y) =
π(y)
π(S)
1
S
(y). Diaconis and
Fill [19] have shown that for certain classes of Markov chains that the walkK below is the unique
dual process with link Λ, so this is the walk on sets that should be considered. We use notation of
Morris and Peres [45].
Definition 3.1. Given set A ⊂ Ω a step of the evolving set process is given by choosing u ∈ [0,1]
uniformly at random, and transitioning to the set
A
u
= {y ∈ Ω : Q(A,y) ≥ uπ(y)} = {y ∈ Ω : P
(y,A) ≥ u}
The walk is denoted by S
0
, S
1
, S
2
, ..., S
n
, with transition kernel K
n
(A,S) = Prob(S
n
= S|S
0
= A).
The Doob transform of this process is the Markov chain on sets given byK(S,S ) =
π(S )
π(S)
K(S,S ),
with n-step transition probabilitiesK
n
(S,S ) =
π(S )
π(S)
K
n
(S,S ).
Heuristically, a step of the evolving set process consists of choosing a uniform value of u, and
then A
u
is the set of vertices y that get at least a u-fraction of their size π(y) from the set A.
The Doob transform produces another Markov chain because of a Martingale property.
Lemma 3.2. If A ⊂ Ω then
A ⊂Ω
π(A )K(A,A ) = ∫
1
0
π(A
u
)du = π(A)
Proof.
1
0
π(A
u
)du = ∑
y∈Ω
π(y)Prob(y ∈ A
u
) = ∑
y∈Ω
π(y)
Q(A,y)
π(y)
= π(A)
24

Page 27
The walkK is a dual process of P.
Lemma 3.3. If S ⊂ Ω, y ∈ Ω and Λ(S,y) =
π(y)
π(S)
1
S
(y) is the projection linkage, then
PΛ(S,y) = Λ
K
(S,y).
Proof.
PΛ(S,y) = ∑
z∈S
π(z)
π(S)
P(z,y) =
Q(S,y)
π(S)
Λ
K
(S,y) = ∑
S y
K(S,S )
π(y)
π(S )
=
π(y)
π(S) ∑
S y
K(S,S ) =
Q(S,y)
π(S)
The final equality is because ∑
S y
K(S,S ) = Prob(y ∈ S ) = Q(S,y)/π(y).
With duality it becomes easy to write the n step transitions in terms of the walkK.
Lemma 3.4. LetE
n
denote expectation underK
n
. If x ∈ Ω and S
0
= {x} then
P
n
(x,y) =E
n
π
S
n
(y),
where π
S
(y) =
1
S
(y)π(y)
π(S)
denotes the probability distribution induced on set S by π.
Proof.
P
n
(x,y) = (P
n
Λ)({x},y) = (Λ
K
n
)({x},y) =E
n
π
S
n
(y)
The final equality is because Λ(S,y) = π
S
(y).
Recall from equation (1.1) that if a distance dist(µ,π) is convex in µ then the worst initial
distribution is a point mass. Given the preceding lemma it is easy to show a distance bound for all
such convex distances.
Theorem 3.5. Consider a finite Markov chain with stationary distribution π. Any distance
dist(µ,π) which is convex in µ satisfies
dist(P
n
(x,·),π) ≤
Ê
n
dist(π
S
n
,π)
whenever x ∈ Ω and S
0
= {x}.
Proof. By Lemma 3.4 and convexity,
dist(P
n
(x,·),π) = dist(E
n
π
S
n
,π) ≤
Ê
n
dist(π
S
n
,π).
In particular, if dist(µ,π) = L
π
(
µ
π
) for a convex functional L
π
: (R
+
)
→ R then the distance
is convex and the conditions of the theorem are satisfied. The total variation distance satisfies this
condition with L
π
(f) =
1
2
f −1
1,π
, relative entropy with L
π
(f) = E
π
f log f, and L
2
distance with
L
π
(f) = f − 1
2,π
, and so the following bounds are immediate:
25

Page 28
Theorem 3.6. If x ∈ Ω and S
0
= {x} then in discrete time
P
n
(x,·) − π
TV
Ê
n
(1 − π(S
n
)),
D(P
n
(x,·) π)
Ê
n
log
1
π(S
n
)
,
P
n
(x,·) − π
2
Ê
n
√1 − π(S
n
)
π(S
n
)
.
Related arguments apply to other distances, such as Hellinger or Wasserstein distances. See
Montenegro [42] for details.
3.2 Mixing Times
Mixing time bounds can be shown through a procedure similar to that of spectral profile bounds.
Throughout this section assume that the distance to be studied is of the form
dist(P
n
(x,·),π) ≤
Ê
n
f(π(S
n
))
for a decreasing function f : [0,1] → R
+
. For instance, the distance bounds in Theorem 3.6 are all
of this form. Let τ(ϵ) = min{n :E
n
f(π(S
n
)) ≤ ϵ}, so that the mixing time in terms of the distance
is upper bounded by τ(ϵ).
The analog of spectral profile Λ
PP
(r) will be the f-congestion:
Definition 3.7. Given a function f : [0,1] → R
+
the f-congestion profile is
C
f
(r) = max
A⊂Ω,
π(A)≤r
C
f
(A)
where
C
f
(A) = ∫
1
0
f(π(A
u
))
f(π(A))
du.
The f-congestion is C
f
= max
A⊂Ω
C
f
(A).
Note that if f(z) = f(1 − z) then u-almost everywhere A
u
= (A
c
)
c
1−u
and a simple calculation
shows that C
f
(A) = C
f
(A
c
). Therefore, when r ≥ 1/2 and f(z) = f(1 − z) then let C
f
(r) = C
f
=
C
f
(1/2).
The analog of Lemma 1.13 will be the following:
Lemma 3.8.E
n+1
f(π(S
n+1
)) −
Ê
n
f(π(S
n
)) = −
Ê
n
f(π(S
n
))(1 − C
zf(z)
(S
n
))
≤ −(1 − C
zf(z)
)E
n
f(π(S
n
))
Proof. The inequality is because 1 − C
zf(z)
≤ 1 − C
zf(z)
(S) for all S ⊂ Ω. For the equality,
E
n+1
f(π(S
n+1
)) =E
n
S
K(S
n
,S)f(π(S))
=E
n
f(π(S
n
))∑
S
K(S
n
,S)π(S)f(π(S))
π(S
n
)f(π(S
n
))
=E
n
f(π(S
n
))C
zf(z)
(S
n
)
26

Page 29
The analog of Corollary 1.14 is the following:
Corollary 3.9. In discrete time
dist(P
n
(x,·),π) ≤ C
n
zf(z)
f(π(x)) and τ(ϵ) ≤ ⌈
1
1 − C
zf(z)
log
f(π
)
ϵ ⌉
.
Proof. By Lemma 3.8,E
n+1
f(π(S
n+1
)) ≤ C
zf(z)
E
n
f(π(S
n
)), and by inductionE
n
f(π(S
n
)) ≤
C
n
zf(z)
f(π(S
0
)). Solving for when this drops to ϵ and using the approximation log C
zf(z)
≤ −(1 −
C
zf(z)
), gives the corollary.
This can be generalized to take into consideration set sizes. Theorem 2.6 will have two analogs,
a stronger bound under a fairly weak convexity condition, with about a factor of two lost in the
general case.
Theorem 3.10. In discrete time, if f is differentiable then
τ(ϵ) ≤ ⌈∫
f
−1
(ϵ)
π
−f (x)dx
f(x)(1 − C
zf(z)
(x))⌉
if x(1 − C
zf(z)
(f
−1
(x))) is convex, while in general
τ(ϵ) ≤ ⌈∫
f
−1
(ϵ/2)
f
−1
(f(π
)/2)
−2f (x)dx
f(x)(1 − C
zf(z)
(x))⌉
Proof. First consider the convex case.
By Lemma 3.8 and Jensen’s inequality for the convex function x (1 − C
zf(z)
(f
−1
(x))),
E
n+1
f(π(S
n+1
)) −
Ê
n
f(π(S
n
)) = −
Ê
n
f(π(S
n
))(1 − C
zf(z)
(S
n
))
≤ −
Ê
n
f(π(S
n
)) [1 − C
zf(z)
(f
−1
◦ f(π(S
n
)))]
≤ −[E
n
f(π(S
n
))] [1 − C
zf(z)
(f
−1
(E
n
f(π(S
n
))))] .
(3.4)
Since I(n) =E
n
f(π(S
n
)) and 1 − C
zf(z)
(f
−1
(x)) are non-increasing, the piecewise linear extension
of I(n) to t ∈ R
+
satisfies
I (t) ≤ −I(t) [1 − C
zf(z)
(f
−1
(I(t)))]
At integer t the derivative can be taken from either right or left. This can be solved as in the proof
of Theorem 2.6.
For the general case, use Lemma 3.11 instead of convexity at (3.4).
Lemma 3.11. If Z ≥ 0 is a nonnegative random variable and g is a nonnegative increasing
function, then
E (Z g(Z)) ≥
EZ
2
g(EZ/2).
Proof. (from [45]) Let A be the event {Z ≥ EZ/2}. Then E(Z 1
A
c
) ≤ EZ/2, so E(Z1
A
) ≥ EZ/2.
Therefore,
E (Z g(2Z)) ≥ E (Z1
A
g(EZ)) ≥
EZ
2
g(EZ).
Let U = 2Z to get the result.
27

Page 30
It is fairly easy to translate these to mixing time bounds. For instance, by Theorem 3.6 it is
appropriate to let f(z) = √
1−z
z
for L
2
bounds. Then the bounds from Corollary 3.9 and Theorem
3.10 imply:
τ
2
(ϵ) ≤
1
1 − C
z(1−z)
log
1
ϵ√π
1
1+ϵ
2
π
dx
2x(1 − x)(1 − C
z(1−z)
(x))⌉
1
1+ϵ
2
/4
4π∗
1+3π∗
dx
x(1 − x)(1 − C
z(1−z)
(x))⌉
1
1 − C
z(1−z)
log
1
ϵ√π
1/ϵ
2
π
du
2u(1 − C
z(1−z)
(u))⌉
4/ϵ