Resampling Markov Chain Monte Carlo Algorithms: Basic Analysis and Empirical Comparisons

Sampling from complex distributions is an important but challenging topic in scientific and statistical computation. We synthesize three ideas, tempering, resampling, and Markov moving, and propose a general framework of resampling Markov chain Monte Carlo (MCMC). This framework not only accommodates various existing algorithms, including resample-move, importance resampling MCMC, and equi-energy sampling, but also leads to a generalized resample-move algorithm. We provide some basic analysis of these algorithms within the general framework, and present three simulation studies to compare these algorithms together with parallel tempering in the difficult situation where new modes emerge in the tails of previous tempering distributions. Our analysis and empirical results suggest that generalized resample-move tends to perform the best among all the algorithms studied when the Markov kernels lead to fast mixing or even locally so toward restricted distributions, whereas parallel tempering tends to perform the best when the Markov kernels lead to slow mixing, without even converging fast to restricted distributions. Moreover, importance resampling MCMC and equi-energy sampling perform similarly to each other, often worse than independence Metropolis resampling MCMC. Therefore, different algorithms seem to have advantages in different settings.


INTRODUCTION
Sampling from complex distributions is an important but challenging topic in scientific and statistical computation (e.g., Liu 2001;Robert and Casella 2005). There have been several classes of Monte Carlo techniques introduced. The class of Markov chain Monte Carlo (MCMC) techniques include Metropolis-Hasting sampling, Gibbs sampling, and various extensions, for example, parallel tempering (Geyer 1991). The class of sequential Monte Carlo (SMC) techniques include sequential importance resampling or particle filtering (Gordon, Salmond, and Smith 1993;Kong, Liu, and Wong 1994), resample-move (Gilks and Berzuini 2001), and various extensions Del Moral, Doucet, and Jasra 2006). Recent developments of Monte Carlo techniques include equi-energy sampling (Kou, Zhou, and Wong 2006), importance resampling MCMC (Atchadé 2009), and self-interacting MCMC (Andrieu, Jasra, and Doucet 2011).
The foregoing review of Monte Carlo techniques is far from complete, but intended to highlight several basic idea shared by these techniques. The first idea, tempering, is broadly to consider a sequence of distributions P 1 , . . . , P m with unnormalized densities q 1 (x), . . . , q m (x) on the same sample space. The classical example is a sequence of Boltzmann distributions with q j (x) = exp{−u(x)/T j }, where u(x) is a potential function and T j is a temperature. There are at least two distinct situations. For one situation, all the m distributions are of interest, as in the example of Potts models in Section 3.3. For another situation, only P m is of interest and the rest are constructed to facilitate sampling from P m , as in the example of Gaussian mixtures in Section 3.2. In the latter case, it is typically expected that P 1 , . . . , P m are increasingly more difficult to simulate. But the levels of difficulty in sampling from distributions are usually complicated to categorize. Moreover, this condition is not necessarily required for the validity of algorithms discussed later, although violation of this condition can have considerable impacts on the performances of algorithms.
The second idea, resampling, is to randomly select elements with replacement from a sample for P j −1 as a way to generate a sample for P j . Importance resampling (Rubin 1987) is central to SMC techniques, where selection probabilities are proportional to q j (x)/q j −1 (x) and the selected elements are then automatically accepted except in some extensions of SMC, for example, rejection control by Liu, Chen, and Wong (1998). Alternatively, as discussed in Section 2.3.2, independence Metropolis resampling is essentially used in equi-energy sampling (Kou, Zhou, and Wong 2006) and a modified version by Atchadé (2010), where selection probabilities are uniform but the selected elements are either accepted or rejected according to the rejection rule in independence Metropolis sampling (e.g., Liu 2001, sec. 5.4.2).
The third idea, Markov moving, is, given an initial observation, to generate new observations for P j using a Markov kernel that leaves P j invariant. This exercise may lead directly to MCMC. But the use of Markov moving in the resample-move algorithm (Gilks and Berzuini 2001) is different, where one Markov move is performed on each element in an intermediate sample, which is obtained for P j by importance resampling. The effect of Markov moving is twofold. First, the new sample remains to be approximately distributed according to P j by the invariance of P j under the Markov kernel. Second, an observation may be selected into the intermediate sample for several times, but the copies of this observation can then be moved to different points. Therefore, more diversity is introduced into the new sample.
In this article, we synthesize these ideas and propose a general framework of resampling MCMC algorithms. This framework accommodates various existing algorithms, including resample-move, importance resampling MCMC, and equi-energy sampling, although not parallel tempering. We also propose a generalized resample-move algorithm by performing and retaining multiple Markov moves given each draw from importance resampling, while reducing the number of such draws. We provide some basic analysis of these algorithms within the general framework, which sheds new light on their connections and comparisons.
Finally, we present three simulation studies to compare various algorithms including parallel tempering.

A GENERAL FRAMEWORK
Let P 1 , . . . , P m be a sequence of distributions and 1 , . . . , m be the corresponding Markov transition kernels such that P j is invariant under j for j = 1, . . . , m. Pick an initial value x j 0 for sampling from P j for j = 1, . . . , m. We propose the following framework of algorithms, called resampling MCMC, to generate samples S 1 , . . . , S m of sizes n 1 , . . . , n m for P 1 , . . . , P m respectively.
2c. For i = 1, . . . , k j , accept y ji = z ji or reject z ji and let y ji = x j 0 if i = 1 or y j,i−1: j,i−1 if i ≥ 2 by a rejection rule, and then generate a chain of ji observations {y ji:1 , . . . , y ji: ji } using the Markov kernel j with the initial value y ji .
2d. Define a sample S j = {x j 1 , . . . , x jn j } for P j as the collection of {y ji:1 , . . . , y ji: ji } over i = 1, . . . , k j . Table 1 illustrates the relationship between different samples. The general algorithm involves two main operations, resampling (including acceptance-rejection) and Markov moving. As discussed in the Introduction, resampling is used to build an intermediate sample R j = {y j 1 , . . . , y jk j } from S j −1 and then Markov moving is used to rejuvenate the intermediate sample R j and obtain a final sample S j . x j −1,1 z j 1 y j 1 (y j 1:1 , . . . , y j 1: j 1 ) · · · · · · · · · · · · x j −1,i z ji y ji (y ji:1 , . . . , y ji: ji ) · · · · · · · · · · · · x j −1,n j −1 z jk j y jk j (y jk j :1 , . . . , y jk j : jk j ) It is helpful to see heuristically why resampling MCMC can be justified before delving into specific algorithms. The basic idea is an induction argument, which can be found in the formal proofs of asymptotic results on various specific algorithms, for example, Chopin (2004) on the resample-move algorithm, Kou, Zhou, and Wong (2006) on equi-energy sampling, and Atchadé (2010) on importance resampling and independence Metropolis resampling MCMC. First, P 1 is usually constructed such that S 1 can be directly obtained either by independent sampling or by MCMC with fast mixing. Suppose that S j −1 is a valid sample from P j −1 for a certain j ≥ 2. The resampling operation is designed, possibly with acceptance-rejection, such that the intermediate sample R j becomes a valid sample from P j . Then, by Markov moving that leaves P j invariant, the final sample S j remains valid for P j .
Even given the sequence of distributions P 1 , . . . , P m and Markov kernels 1 , . . . , m , there are a number of factors left to be specified in the general framework: how S 1 is generated in Step 1, how k j and ji are generated in Step 2a, what resampling scheme is used in Step 2b, and what rejection rule is used in Step 2c. Different choices of these factors lead to even different classes of algorithms, including but not restricted to various existing algorithms mentioned in the Introduction.
At one extreme, generating S 1 by MCMC, taking k j = 1 and j 1 = n j , and always rejecting z j 1 and letting y j 1 = x j 0 leads to standard MCMC algorithms for sampling from P 1 , . . . , P m separately. No resampling is done. At the other extreme, sequential importance resampling (Gordon, Salmond, and Smith 1993;Kong, Liu, and Wong 1994) corresponds to generating S 1 by independent sampling (made possible by the choice of P 1 ), taking k j = n j and ji = 1, using importance resampling in Step 2b, and always accepting y ji = z ji but skipping Markov moving so that y ji:1 = z ji in Step 2c. The resample-move algorithm (Gilks and Berzuini 2001) is obtained in the same way as sequential importance resampling, except that Markov moving is indeed performed in Step 2c. Therefore, the general framework of resampling MCMC accommodates both standard MCMC and SMC algorithms as special cases.

GENERALIZED RESAMPLE-MOVE
The foregoing discussion of the resample-move algorithm suggests an interesting extension in the general framework. Let j be a divisor of n j . Instead of drawing a sample of size n j from S j −1 and performing one Markov move to obtain S j , it is possible to draw a subsample of size k j = n j / j from S j −1 and then perform j successive Markov moves and retain all the j observations in S j . The resulting algorithm, called generalized resample-move, is as follows. Only the initial value x 10 for P 1 is used and the other values x 20 , . . . , x m0 are ignored.
2b. Draw k j = n j / j observations R j = {y j 1 , . . . , y j k j } with replacement by importance resampling from the sample S j −1 .
2c. For i = 1, . . . , k j , generate a chain of j observations {y ji:1 , . . . , y ji: j } using the kernel j with the initial value y ji .
The performance of the algorithm depends on the choices of j , the distances between distributions P j −1 and P j , and mixing of Markov chains with kernels j . To make clear the effects of these factors, we provide a general asymptotic result for two successive distributions P j −1 and P j , which are represented below by P 1 and P 2 . First, we give a definition, formalizing the condition that a sample can consist of dependent observations such as a Markov chain or a bootstrap sample or in other complicated manner, but some types of laws of large numbers and central limit theorems are valid for sample averages of functions of the observations. Definition 1. Let S 1 = {x 1 , . . . , x n 1 } be a sample from P 1 . Say that S 1 is an asymptotically regular sample if for any function ϕ(·), is the expectation of ϕ(x) for x drawn from P 1 , and n −1 1 v 1 (ϕ) is the O(n −1 1 ) term in the asymptotic variance ofφ 1 . If S 1 is an independent and identically distributed sample, then v 1 (ϕ) = V 1 (ϕ), the variance of ϕ(x) for x drawn from P 1 . In general, it is expected that v 1 (ϕ) ≥ V 1 (ϕ). Proposition 1. Let S 1 = {x 1 , . . . , x n 1 } be an asymptotically regular sample from P 1 , and R 2 = {y 1 , . . . , y k } be a sample of size k = n 2 / for P 2 obtained by importance resampling from S 1 . Let r(·) = p 2 (·)/p 1 (·), where p 1 (·) and p 2 (·) are the normalized densities for P 1 and P 2 . For i = 1, . . . , k, let {y i:1 , . . . , y i: } be a chain of observations obtained using the kernel 2 with the initial value y i . Then for any function g(·) under suitable conditions, as n 1 → ∞, n 2 → ∞, and either or k being bounded, j =1 g(y i:j ), E 2 (g) is the expectation of g(x) for x drawn from P 2 , {y 0:1 , . . . , y 0: } are a chain of observations obtained using the kernel 2 with an initial value y 0 , h (·) = E{ −1 j =1 g(y 0:j )|y 0 = ·}, and var y 0 ∼P 2 (·) denotes the variance with y 0 drawn from P 2 .
The remainder terms in E(ḡ 2 ) and var(ḡ 2 ) are provided to accommodate both the case of bounded and k → ∞ and the case of → ∞ and k bounded, which are separately treated in the proof in Appendix I. The foregoing result does not provide specific regularity conditions needed. But the variance formula (1) evidently agrees with the formal results in Gilks and Berzuini (2001) and Chopin (2004) in the special case of n 1 = n 2 = k and = 1. Our result can also be extended by induction to m (> 2) distributions as in Gilks and Berzuini (2001) and Chopin (2004). A rigorous treatment of these issues is beyond the scope of the current article.
There are clear, interesting interpretations of the leading terms in (1). The first term in (1) depends on three factors: the distance from P 1 to P 2 , encoded by the density ratio r(·), how flat the conditional expectation h (·) is about the constant E 2 (g) due to Markov transitions, and the dependency of observations in S 1 . In fact, the first two factors directly affect V 1 [r{h − E 2 (g)}], the variance of r{h − E 2 (g)} under P 1 , whereas the third factor affects how large v 1 [r{h − E 2 (g)}] is relatively to V 1 [r{h − E 2 (g)}]. The second term in (1) reflects the variance of the sample average −1 j =1 g(y 0:j ) from the Markov chain. In the following, we discuss how var(ḡ 2 ) is affected by the choice of , for fixed (P 1 , S 1 ) and (P 2 , 2 ).
For the special case of = 1, corresponding to the resample-move algorithm (Gilks and Berzuini 2001), the proof of Proposition 1 gives var n −1 where V 2 (g) is the variance of g(x) for x drawn from P 2 and h 1 (·) = E{g(y 0:1 )|y 0 = ·}, the conditional expectation after one Markov transition. For = 0 or sequential importance resampling without Markov moving (Gordon, Salmond, and Smith 1993;Kong, Liu, and Wong 1994), Lemma 1 in Appendix I gives The two asymptotic variances differ only in their first terms. The term v 1 [r{h 1 − E 2 (g)}] tends to be smaller than v 1 [r{g − E 2 (g)}], because the conditional expectation h 1 (·) is usually flatter than g(·) about E 2 (g), as can be seen by the inequality V 2 (h 1 ) ≤ V 2 (g). If, hypothetically, the kernel 2 is perfect such that y 0:1 is generated directly from P 2 , regardless of y 0 , then h 1 (·) = E 2 (g) and v 1 [r{h 1 − E 2 (g)}] = 0. But, when the kernel 2 leads to moderate to slow mixing, h 1 (·) may not differ much from g(·) and v 1 [r{h 1 − E 2 (g)}] may not be much smaller than v 1 [r{g − E 2 (g)}]. For the generalized resample-move algorithm, there is a tradeoff in the effects of increasing on var(ḡ 2 ) for fixed n 2 = k . On one hand, h (·) is based on Markov transitions and hence can be much flatter than g(·) about E 2 (g), so that the first term of var(ḡ 2 ) in (1) for can be much smaller for a large than that for = 1. On the other hand, the second term in (1) typically increases as increases by the standard formula for the normalized variance, In fact, for a larger , not only there are more autocovariance terms in (3) but also the coefficient, 1 − j/ , increases for the autocovariance at each lag j. Therefore, as increases with n 2 = k fixed, var(ḡ 2 ) depends on two competing factors: how much h (·) is flattened and hence v 1 [r{h − E 2 (g)}] is reduced by Markov transitions and how much var y 0 ∼P 2 { −1 j =1 g(y 0:j )} is inflated due to autocovariances. At the two extremes = 0 versus = n 2 (no versus full Markov moving), the comparison of asymptotic variances reduces essentially to the difference between the first term in (2) and the second term in (3) divided by = n 2 .
Based on the foregoing discussion, we examine qualitatively how a good choice of depends on three factors: the distance from P 1 to P 2 , mixing of the Markov chain for P 2 , and dependency within the sample S 1 (which, by the sequential structure, depends on the distances between successive distributions prior to P 1 and mixing of the Markov chains for P 1 and earlier distributions). There are at least two distinct cases, depending on mixing of the Markov chains: • If the Markov chain is fast mixing or even locally so (to be explained below) for P 2 , then a choice of moderate , say 10 or 100, might lead to satisfactory results. This is because, after a moderate number of Markov transitions, the conditional expectation h (·) can already be reduced to a nearly flat (or piecewise flat) function, but the sum of autocovariances in (3) can keep increasing in . Choosing a larger would not be worthwhile.
• If the Markov chain is slow mixing for P 2 , then a choice of large , say 1000 or more, might be warranted. This is because, even after a moderate number of Markov transitions, h (·) can be only slightly flattened, and the accumulation of autocovariances in (3) can be far from its limit as → ∞. Choosing a larger would lead to a better performance.
These two cases are, in fact, observed respectively in the two numerical examples in Section 3. See Appendix IV for further discussion.
There are several ideas that need to be explained for the first case above. We say that a Markov chain with transition kernel locally geometrically converges to P (or, informally, is locally fast mixing) if the sample space can be partitioned into J regions B 1 , . . . , B J such that the total variation distance decreases at an exponential rate as n → ∞, uniformly over x j ∈ B j (j = 1, . . . , J ), where n (x, ·) is the n-step transition kernel. That is, if the chain is initialized, with stationary probability P (B j ), from any x j ∈ B j , then the chain geometrically converges to P. This concept reduces to the usual geometric convergence for J = 1, but can be much weaker for J ≥ 2. For example, a random-walk Metropolis chain with Gaussian proprosals for sampling a well-separated Gaussian mixture is easily trapped in the region where the initial value is located, but it is locally fast mixing with the sample space partitioned according to the mixture components. There are unusual features for locally fast mixing Markov chains. First, as mentioned above, h (y 0 ) defined in Proposition 1 converges to only a piecewise constant function, because the limiting value depends on the region where y 0 is located. Second, the stationary autocorrelation, cor y 0 ∼P (y 0:1 , y 0:j ), does not, in general, converge to 0 as j → ∞, even though the sample autocorrelation converges to 0 for a chain initialized with any fixed y 0 . Then var y 0 ∼P 2 { −1 j =1 g(y 0:j )} in (3) diverges as increases, in contrast with the usual case for geometrically convergent chains. We provide numerical illustration in Appendix IV, and leave further analysis to future work.
Finally, the generalized resample-move algorithm for two distributions (m = 2) is reminiscent of the recommended strategy for posterior simulation in Bayesian statistics by Gelman et al. (2003). For that strategy, P 2 is the posterior distribution and P 1 is a modebased, possibly mixture approximation to P 2 . First, a sample S 1 is drawn from P 1 . Then, a few (say k) observations are obtained by importance resampling from S 1 , and used as starting points to run parallel Markov chains, each of length , for P 2 . Gelman et al. (2003) seemed to emphasize that importance resampling is useful to provide over-dispersed starting points for MCMC, and focus on the case where k is much smaller than . By comparison, we allow the case where is smaller than or comparable to k, and aim to exploit importance resampling as a technique to build an intermediate sample for P 2 given a good sample for P 1 .

OTHER ALGORITHMS
The main feature of the generalized resample-move algorithm in the framework of resampling MCMC is that both k j and ji are prespecified constants. We show that choosing k j and ji randomly leads to other existing algorithms, including importance resampling MCMC (Atchadé 2009), equi-energy sampling (Kou, Zhou, and Wong 2006), and modified equi-energy sampling by Atchadé (2010). Moreover, we examine the relationship between these algorithms and parallel tempering (Geyer 1991), which, however, does not belong to resampling MCMC.
For each j ≥ 2, consider a Bernoulli process to choose k j and ji . First, generate a sequence of independent Bernoulli variables ( j 1 , . . . , jn j ) with a fixed success probability 0 < α j < 1. The total number of successes is K j = n j i=1 ji . Then, k j and ji are determined as follows. The two cases, depending on j 1 , are mainly created to accommodate certain details at the beginning of various existing algorithms, even though such details are asymptotically negligible.
• If j 1 = 1, then let k j = K j and, for i ≥ 1, ji be the number of trials from the ith success until and excluding the (i + 1)th success or until and including the n j th trial. For example, n j = 9, k j = 3, ( j 1 , j 2 , j 3 ) = (3, 4, 2) for the Bernoulli sequence 100100010.
• If j 1 = 0, then let k j = 1 + K j , j 1 be the number of trials before the first success, and, for i ≥ 2, ji be the number of trials from the (i − 1)th success until and excluding the ith success or until and including the n j th trial. For example, n j = 9, k j = 3, ( j 1 , j 2 , j 3 ) = (2, 4, 3) for the Bernoulli sequence 001000100; Therefore, k j is essentially a binomial variable with mean n j α j , and ji is a geometric variable with mean α −1 j except for i = 1 or k j . If α j is specified as −1 j , then the expected size of resampling and the expected number of Markov moves are the same as those in the generalized resample-move algorithm.

Importance Resampling MCMC.
To obtain importance resampling MCMC (Atchadé 2009) in the framework of resampling MCMC, consider a Bernouilli process to choose k j and ji in Step 2a, importance resampling in Step 2b, and the following rejection rule in Step 2c: • If j 1 = 1, then accept y ji = z ji for all i = 1, . . . , k j .
• If j 1 = 0, then reject z j 1 and let y j 1 = x j 0 , and accept y ji = z ji for i = 2, . . . , k j .
All observations z ji are accepted except z j 1 obtained when j 1 = 0 (not when j 1 = 1). By concatenation of the individual chains (y ji:1 , . . . , y ji: ji ), the resulting algorithm can be equivalently described as follows such that observations are generated in a single chain for each distribution as in MCMC.
-Define a sample for P j as S j = {x j 1 , . . . , x jn j }.
It is interesting to see how the foregoing algorithm agrees with the general framework. For each success ji , a draw ξ ji is obtained by importance resampling from S j −1 . Then, a chain of observations are generated using the kernel j with the initial value ξ ji , until the next success. The number of draws from S j −1 is the number of successes, and the length of the chain generated given a draw from S j −1 is an inter-success time or a truncated one at the end of trials.
The original algorithm of importance resampling MCMC in Atchadé (2009) is the same as the foregoing algorithm except for an important difference. The reservoir for drawing ξ ji by importance resampling is not fixed as the entire sample S j −1 , but restricted to the history of observations {x j −1,1 , . . . , x j −1,i } up to time i. For this reason, the foregoing algorithm or Atchadé's (2009) algorithm is called, respectively, static or dynamic importance resampling MCMC. Assume that the sample size n j is equal to n for each j = 1, . . . , m. By an exchange of the order of iterations in j and i, dynamic importance resampling MCMC is equivalent to running m parallel chains as in the original description of Atchadé (2009).
* Generate x ji using the kernel j with the initial value ξ ji .
There is another difference ignored between dynamic importance resampling MCMC described above and the original algorithm in Atchadé (2009). For the latter algorithm, it is typical that x ji is taken directly as a weighted draw from the previous chain up to time i if ji = 1 and generated using the kernel j with the initial value x j,i−1 if ji = 0, so that the total number of Markov moves is random with mean (1 − α j ) n for the jth chain. For either static or dynamic importance resampling MCMC presented here, the total number of Markov moves is fixed at n for each sample. This property makes it convenient to compare various algorithms with the same number of Markov moves used.

Independence Metropolis Resampling MCMC and Equi-Energy Sampling.
Generalized resample-move and importance resampling MCMC have the common feature that importance resampling is used in Step 2b and all selected observations z ji are accepted in Step 2c, except for an artificial rejection at the beginning of the latter algorithm, in the framework of resampling MCMC. An alternative approach is independence Metropolis resampling, where simple random sampling from S j −1 is used and the selected observations z ji are either rejected or accepted according to the rejection rule in independence Metropolis sampling (e.g., Liu 2001, Sec. 5.4.2). The corresponding Steps 2b and 2c in Section 2.1 are as follows.
2b. Draw k j observations {z j 1 , . . . , z jk j } with replacement by simple random sampling from and, with the remaining probability, reject z ji and let y ji = x j 0 if i = 1 or y j,i−1: j,i−1 if i ≥ 2, and then generate a chain of ji observations {y ji:1 , . . . , y ji: ji } using the kernel j with the initial value y ji .
It is important to notice that each acceptance or rejection of z ji involves a reference observation, which is specifically set to y j,i−1: j,i−1 , the last observation in the chain generated with the initial value y j,i−1 .
With a Bernoulli process to choose k j and ji , the resulting algorithm can be equivalently described in the form of running a single chain for each distribution, similarly to static importance resampling MCMC in Section 2.3.1.
-Define a sample for P j as S j = {x j 1 , . . . , x jn j }.
Changing the reservoir from S j −1 to the history {x j −1,1 , . . . , x j −1,i } up to time i for drawing ζ ji in the foregoing algorithm leads to dynamic independence Metropolis resampling MCMC, which, similarly to dynamic importance resampling MCMC, can be equivalently described as running m parallel chains.
Dynamic independence Metropolis resampling MCMC or modified equi-energy sampling (Atchadé 2010): -Generate x 1i using the kernel 1 with the initial value x 1,i−1 .
This algorithm is identical to modified equi-energy sampling in Atchadé (2010), except for a technical difference (as in the end of Section 2.3.1). For the latter algorithm, x ji is generated using the kernel j with the initial value ξ ji if ji = 0, but taken directly as ξ ji if ji = 1, so that the total number of Markov moves is random with mean (1 − α j ) n for the jth chain. In contrast, the total number of Markov moves is fixed for each sample for static or dynamic independence Metropolis resampling MCMC presented here. Another related algorithm is sequentially interacting MCMC in Brockwell, Del Moral, and Doucet (2010) designed for sampling from distributions on nested spaces. When restricted to a common state space, the algorithm of Brockwell et al. corresponds to dynamic independence Metropolis resampling MCMC, with α j = 1 (i.e., always drawing ζ ji ) and no Markov moving (hence x ji = ξ ji ).
Equi-energy sampling (Kou, Zhou, and Wong 2006) is the same as dynamic independence Metropolis sampling described above except for the construction of the reservoir for drawing ζ ji . Consider the case where q j ( are fixed energy levels and U 1 is below the minimum energy. Two versions of equi-energy sampling are shown as follows, with the dynamic version corresponding more closely to the original algorithm of Kou, Zhou, and Wong (2006) (see Remark 6 in Section 2.4).
• Static equi-energy sampling: Same as static independence Metropolis resampling MCMC except that the resampling step is to draw • Dynamic equi-energy sampling (Kou, Zhou, and Wong 2006): Same as dynamic independence Metropolis resampling MCMC except that the resampling step is to Therefore, ζ ji is drawn from the higher-temperature ringD j −1,h constructed from the entire sample S j −1 or fromD j −1,i,h constructed from the history {x j −1,1 , . . . , x j −1,i }, with the ring index determined by x j,i−1 .

Parallel Tempering.
Parallel tempering (Geyer 1991) involves running m parallel chains, similarly to dynamic versions of resampling MCMC. However, there are fundamental differences. For resampling MCMC, an observation from a higher-temperature chain can be fed into a lower temperature chain by resampling and acceptance-rejection, but there is no mechanism for an observation from a lower-temperature chain to be taken into a higher-temperature chain. For parallel tempering, an observation from a higher-temperature chain and another from a lower temperature chain can literally be exchanged under some criterion. An advantage of parallel tempering is that the m chains jointly constitute a Markov chain in the product sample space. None of the resampling MCMC algorithms discussed have this property.
To connect resampling MCMC and parallel tempering (rather than introducing a competitive algorithm), consider the following procedure, obtained from dynamic independence Metropolis resampling MCMC by changing the reservoir for drawing ζ ji to the singleton Independence Metropolis tempering: • For i = 1, . . . , n, -Generate x 1i using the kernel 1 with the initial value x 1,i−1 .
The foregoing algorithm appears similar to parallel tempering described below, with the same formula (5) for acceptance probability. However, in the case of acceptance, x j −1,i−1 is substituted for, not exchanged with, x j,i−1 . For the modification from dynamic independence Metropolis resampling MCMC, it is also possible to take ζ ji = x j −1,i , which would be as good as taking ζ ji = x j −1,i−1 , but the acceptance probability would be (5) Parallel tempering (Geyer 1991): • For i = 1, . . . , n, -For j = 2, 3, . . . , m, * Generate a Bernoulli variable ji with success probability α j . If ji = 1, then exchange x j −1,i−1 and x j,i−1 with probability (5) and, with the remaining probability, take no action. If ji = 0, then take no action.
Our implementation of parallel tempering technically differs from those in the literature (e.g., Geyer 1991;Liu 2001, Sec. 10.4). At each time i, a proposal of exchange is made with some probability between every two adjacent chains. There can be exchanges between multiple pairs of adjacent chains, in the order of decreasing temperatures. Moreover, at each time i, there is always one Markov move for each chain. The total number of Markov moves is fixed at n for each sample, similarly as in the resampling MCMC algorithms discussed earlier.

COMPARISONS
Given the multitude of algorithms discussed, it is important to understand how their performances are compared to each other. A complete theoretical study of this problem is beyond the scope of this article. In Section 3, we evaluate various algorithms empirically in three simulation studies. In this section, we provide some basic analysis of the relationship between these algorithms.
Remark 1. Among the resampling MCMC algorithms discussed in Section 2.3, static importance resampling MCMC is the most related to generalized resample-move. The two algorithms differ only in their choices of k j and ji . For each j, generalized resamplemove involves drawing a sample of fixed size k j from S j −1 with importance weights and, given the ith draw, generating a chain of fixed length ji = j . By comparison, static importance resampling involves drawing a sample of random size k j from S j −1 with importance weights and, given the ith draw, generating a chain of random length ji , where k j and { ji : i = 1, . . . , k j } are determined from a Bernoulli process. The Bernoulli process is independent of other parts of the algorithm, and hence seems to only introduce additional randomness (or noise) into the algorithm. By this relationship, we expect that static importance resampling MCMC might not be as efficient as generalized resamplemove.
Remark 2. Static independence Metropolis resampling MCMC appears to differ from static importance resampling MCMC only in their resampling schemes (including the acceptance or rejection step), as indicated by their names. However, a careful examination of the details reveals that there are more subtle differences between the two algorithms than might be inferred from their names.
Formally, a single iteration of independence Metropolis resampling is as follows. Let S 1 be a sample for P 1 and y i−1 be the previous (or initial) observation for P 2 . Then, draw 1 observation z i from S 1 with uniform probabilities, and accept y i = z i with probability min[1, q 2 (z i )q 1 (y i−1 )/{q 1 (z i )q 2 (y i−1 )}] and, with the remaining probability, reject z i and let y i = y i−1 . A seemingly small but important issue is how the previous observation y i−1 is specified. For static independence Metropolis resampling MCMC in Section 2.3.2, the previous observation used in the acceptance or rejection of a draw z ji from S j −1 is set to y j,i−1: j,i−1 , the last observation generated by Markov moving from y j,i−1 , instead of y j,i−1 itself. (The index j can be taken as 2 and then removed to understand the notations.) In fact, changing the previous observation y j,i−1: j,i−1 to y j,i−1 leads to an entirely different algorithm, which we call lazy independence Metropolis resampling MCMC.
Lazy independence Metropolis resampling MCMC can be organized such that the sample {y j 1 , . . . , y jk j } is completely obtained from S j −1 with the initial value x j 0 by independence Metropolis resampling and then, for i = 1, . . . , k j , a chain of ji observations are generated from y ji by Markov moving. Therefore, this algorithm is the same as static importance resampling MCMC, except that, in the latter algorithm, {y j 1 , . . . , y jk j } is drawn from S j −1 by importance resampling without using the initial value x j 0 . However, by this relationship, we expect that lazy independence Metropolis sampling MCMC might not be as efficient as importance resampling MCMC for the following reason. It is known that importance sampling is more efficient than independence Metropolis sampling (Tan 2006). Suppose that {x 1 , . . . , x n } is an independent and identically distributed sample from P 1 and {y 1 , . . . , y n } is a sequence of observations obtained for P 2 by independence Metropolis sampling. Under mild regularity conditions, Tan (2006) showed that for the expectation of a function ϕ(x) under P 2 , the ratio estimator n i=1 ϕ(x i )w(x i )/ n i=1 w(x i ) has no greater asymptotic variance than n −1 n i=1 ϕ(y i ), where w(x) = q 2 (x)/q 1 (x). Resampling is more complicated than sampling, especially when resampling is done from a dependent sample. Nevertheless, the relative efficiency of importance resampling over independence Metropolis resampling is likely to be carried over to the comparison between the corresponding resampling algorithms.
For static independence Metropolis resampling MCMC, y ji is determined by acceptance or rejection, depending on the last observation y j,i−1: j,i−1 generated by Markov moving from y j,i−1 . Therefore, independence Metropolis resampling and Markov moving are interweaved in an interactive manner. This structure sets static independence Metropolis sampling MCMC apart from both lazy independence Metropolis and importance resampling MCMC. The foregoing argument, using Tan (2006), on the comparison between the latter two algorithms is no longer applicable to that between static independence Metropolis and importance resampling MCMC.
Remark 3. The interweaving structure can give independence Metropolis resampling MCMC an advantage over importance resampling MCMC and even generalized resamplemove with the choice of α j or j matched, when both the distance from P j −1 to P j is substantial and the Markov kernel j leads to slow mixing.
We provide some heuristic reasons for this claim. For independence Metropolis resampling MCMC, whenever a draw z ji from S j −1 is rejected, y ji is set to y j,i−1: j,i−1 . Then, the concatenation of {y j,i−1:1 , . . . , y j,i−1: j,i−1 } and {y j,i:1 , . . . , y j,i: j,i } forms a longer chain of successive observations under the kernel j . If two or more consecutive draws from S j −1 are rejected, then the corresponding chains can be concatenated into an even longer chain. But as a consequence of a substantial distance from P j −1 to P j , a large proportion of the draws {z ji : i = 1, . . . , k j } are likely to be rejected. Then, the individual chains {(y j,i:1 , . . . , y j,i: j,i ) : i = 1, . . . , k j } in S j can be grouped into a smaller number of longer chains, each generated with the initial value a draw z ji from S j −1 that is accepted as y ji . Such longer chains tend to better represent the distribution P j than a larger number of shorter chains, because S j −1 , a sample for P j −1 , might not sufficiently cover P j and the Markov chain is slow mixing under j . This is similar to a related situation for generalized resample-move in Section 2.2: if the Markov chain is slow mixing, reducing the size k j of resampling from S j −1 while increasing the number j of Markov moves might improve the performance. The analogue is not exact: the number of accepted draws and the lengths of Markov moves are dynamically determined for independence Metropolis resampling MCMC. Nevertheless, the concatenation of shorter chains into longer ones seems to be advantageous in the case of a large distance from P j −1 to P j and slow mixing under j .
Remark 4. Equi-energy sampling differs from independence Metropolis resampling MCMC only in the use of energy rings to construct resampling reservoir. Atchadé (2010) studied the dynamic version of the latter algorithm as modified equi-energy sampling, and claimed that using energy rings "does not add any significant feature to the algorithm from the theoretical standpoint." However, we argue that, by the use of energy rings, equi-energy sampling is more similar to importance resampling MCMC than to independence Metropolis resampling MCMC.
For static equi-energy sampling, recall from Section 2.3.2 that ζ ji is drawn from the energy ring of x j,i−1 , estimated based on S j −1 . Then, the energy u(ζ ji ) is similar to u(x j,i−1 ), so that the probability (4) of accepting ζ ji is large or close to 1, depending on how finely the sample space is partitioned into energy rings. Furthermore, ζ ji given x j,i−1 is uniformly distributed over the estimated energy ring of x j,i−1 . But what is approximately the distribution of ζ ji integrated over x j,i−1 ? If, hypothetically, x j,i−1 were considered a simple random draw from S j and if S j −1 and S j were fixed, then, by direct calculation, ζ ji would be a random draw from S j −1 with probabilities proportional toq j (x j −1,ν )/q j −1 (x j −1,ν ) for x j −1,ν ∈ S j −1 , whereq j −1 (x j −1,ν ) orq j (x j −1,ν ) is the relative frequency of the energy ring of x j −1,ν based on S j −1 or S j respectively. [In fact, the probability of x j,i−1 falling into the energy ring of x j −1,ν is proportional toq j (x j −1,ν ) and the probability of selecting x j −1,ν within its energy ring based on S j −1 is proportional toq −1 j −1 (x j −1,ν ).] In other words, ζ ji would be drawn from S j −1 by importance resampling, except that the ratio of unnormalized densities q j (x j −1,ν )/q j −1 (x j −1,ν ) is estimated by that of relative frequencieŝ q j (x j −1,ν )/q j −1 (x j −1,ν ). Although not rigorous, the foregoing analysis strongly suggests that equi-energy sampling is similar to importance resampling MCMC in Section 2.3.1, where ζ ji is drawn from S j −1 by importance resampling and then always accepted as ξ ji .
Remark 5. A dynamic resampling MCMC algorithm is less efficient than the static counterpart, because the resampling reservoir is only the history of observations in the previous chain. On the other hand, a dynamic algorithm can be implemented as running m parallel chains. Such implementation is time-saving on parallel computing systems, although the total computational load is unchanged. As a compromise, a semi-dynamic scheme can be used as in Kou, Zhou, and Wong (2006). For j ≥ 2, the chain for P j is not started until the chain for P j −1 has been run for a certain number of iterations (which we call waiting period), in addition to any burn-in period. An important issue is then how to choose the length of the waiting period. For space limitation, we do not investigate the semi-dynamic option further.
Remark 6. Parallel tempering differs fundamentally from resampling MCMC algorithms, as discussed in Section 2.3.3. Nevertheless, there is a similar mechanism to that in independence Metropolis resampling MCMC, such that shorter sequences of Markov moves are concatenated into longer ones. First, the jth chain S j can be segmented into subchains at times i where an exchange is proposed between the jth chain and an adjacent chain. The number of subchains and theirs lengths are controlled by the Bernoulli processes { j −1,i } and { ji }, independently of other parts of the algorithm. But whenever a proposed exchange is rejected, the initial value for the subsequent subchain is exactly the last observation in the previous chain, so that the two subchains can be smoothly connected. Then, the subchains in S j can be grouped into a smaller number of longer chains. By this structure, parallel tempering might perform better than importance resampling MCMC and generalized resample-move in the case of a substantial distance from P j −1 to P j and slow mixing under the kernel j , similarly as discussed in Remark 3.

Remark 7.
A number of algorithms discussed in the previous sections are also studied, sometimes developed independently, in statistical physics. Parallel tempering is known as replica exchange MCMC (Hukushima and Nemoto 1996). Independence Metropolis tempering and static independence Metropolis resampling MCMC are essentially equivalent to two implementations of the J-walking method (Frantz, Freeman, and Doll 1990). The former involves running two walkers (or chains) in tandem at two temperatures and the latter involves generating and storing a chain at a higher temperature beforehand. The resample-move algorithm seems to be introduced under the name of population annealing in statistical physics (Machta 2010).
For all the algorithms except parallel tempering, a sample for P 1 is obtained by directly running a Markov chain using the kernel 1 with an initial value x 10 . The other initial values x j 0 are set to x 10 for j ≥ 2. The sample sizes are set to facilitate a fair comparison between algorithms with different burn-in schedules. For our examples, m = 5 is used. Let n and b be roughly the sample and burn-in sizes. For dynamic algorithms labeled 8-11, a sample for P j is obtained with the first b iterations of the chain discarded as burn-in. A chain for P j +1 is also started after the first b iterations done in the chain for P j . For (P 1 , P 2 , . . . , P 5 ), the numbers of iterations are set respectively to (n + 5b, n + 4b, n + 3b, n + 2b, n + b), so that the sample sizes are (n + 4b, n + 3b, n + 2b, n + b, n). For static algorithms labeled 1-6, a sample for P 1 is still obtained after the first b iterations discarded, but the remaining samples are defined without burn-in. For the same numbers of iterations as above, the sample sizes are then (n + 4b, n + 4b, n + 3b, n + 2b, n + b) for (P 1 , P 2 , . . . , P 5 ). Finally, for parallel tempering, the number of parallel iterations is set to n + 3b, and the sample size is n + 2b with the first b iterations discarded as burn-in for all the distributions. Therefore, the combined number of iterations over the 5 distributions is 5n + 15b, the same for all the 11 algorithms.
We present three simulation studies, each with interesting features that make the sampling problem difficult. The first two studies involve sampling from Gaussian mixtures with wellseparated modes, using a sequence of tempering distributions. Of particular interest is that the difficulty of such problems can substantially increase when the Gaussian components have different variances and weights. In fact, the sequence of tempering distributions in the second study exhibits a phenomenon somewhat similar to phase transition in statistical physics. The third study considers the Potts model near phase transition, where the energy histogram is bimodal. Such a phase transition is known to be a major obstacle to Monte Carlo sampling.
The main difficulty is that the weight of a larger-variance component is exponentially amplified, relatively to a smaller-variance component in the geometric sequence of tempering distributions for a Gaussian mixture. For the bimodal mixture p(x), we show in Appendix I that for T ≤ 10, the tempering distribution with the unnormalized density {p(x)} 1/T is approximately a Gaussian mixture with two components, with means μ 1 and μ 2 , variances σ 2 1 T I d and σ 2 2 T I d , and weights proportional to λ , respectively, where I d is the d × d identity matrix. (For T large enough, the tempering distribution is essentially unimodal but not Gaussian.) The relative weight of the first, larger-variance component to the second component is (λ 1 /λ 2 ) 1/T (σ 1 /σ 2 ) (1−1/T )d = 2 d−(d+1)/T , which is exponential in d for fixed T and increasing in T for fixed d. Even for moderate d = 6, the first component quickly becomes dominant, in both variance and weight, over the second component as T increases, in contrast with the fact that, at temperature T = 1, the first component has a weight 1/3, only half of the weight of the second component. This phenomenon in the geometric tempering of a Gaussian mixture appears not to be sufficiently recognized in the literature, although it is related to torpid mixing of parallel and simulated tempering in Woodard, Schmidler, and Huber (2009).
For illustration, we take a sequence of m = 5 tempering distributions, q j (x) = {p(x)} 1/T j , where (T 1 , T 2 , T 3 , T 4 , T 5 ) = (50.0, 18.80, 7.07, 2.66, 1.0) are evenly spaced in the logarithm between 50 and 1. Figure 1 shows the smoothed histograms of the dth component of x at the five temperatures, based on single runs of the generalized resample-move algorithm using the random-walk Metropolis kernel j with proposal standard deviation 0.1 T j or 0.2 T j at temperature T j . See Figures A8-A9 for a clearer view of the same histograms in separate plots, instead of a single plot. Consider the histograms under the former choice of j . The two histograms for P 1 and P 2 are essentially unimodal at μ 1 , those for P 3 and P 4 are bimodal, with a major mode at μ 1 and a minor one at μ 2 , and that for the target P 5 is bimodal, although the weights of the two modes differ somewhat from the truth. These results, in agreement with the foregoing analysis of tempering, make clear how challenging it is to sample from such tempering distributions. As the temperature goes down to 1, the mode at μ 2 emerges from being a negligible one, to a minor one, and eventually to a major one, all in the tail area of the other mode at μ 1 . The challenge is borne out in the results under the second choice of the kernel j . The bulk of P 3 or P 4 appears to be adequately sampled, but the minor mode at μ 2 is not well sampled in the tail of P 3 or P 4 . These small discrepancies are then transferred into substantial ones for the sample of P 5 , where the relative weights of the two modes are reversed.
In the following subsections and Appendix III, we present three simulation experiments to compare various algorithms, depending on different choices of Markov kernels, subsampling strategies, and tempering distributions.

Experiment on Markov Kernels.
We compare the seven algorithms labeled 1-7, excluding dynamic algorithms, under two choices of the random-walk Metropolis kernel j with proposal standard deviation 0.1 T j or 0.2 T j . The parameter j is set to 10 and α j to 10%. The initial value x 10 = · · · = x 50 is generated as a Gaussian vector with mean 0 and variance I 6 . The sample sizes are set with n = 40,000 and b = 4000. For equi-energy sampling, u(x) is defined as − log{(2π ) d/2 p(x)} and (U 1 , U 2 , . . . , U 5 ) are set to (−9.25, −8.12, −5.12, 2.88, 24.12) such that (U h+1 − U h )/T m+1−h ≈ 1.13 for h = 1, . . . , 5 with U 6 = 80.62, in accord with the suggestion of Kou, Zhou, and Wong (2006). Figure 2 shows the squared biases and MSEs for the estimates of E(x (d) ) and var(x (d) ) under P 5 based on 200 repeated simulations, where x (d) is the dth component of x. The performances of the algorithms depend on whether the random-walk Metropolis kernel j is chosen with proposal standard deviation 0.1 T j or 0.2 T j . For all the six algorithms using Markov moving, the results under the first choice of j with proposal standard deviation 0.1 T j are substantially more accurate than under the second choice. A possible explanation is that using the proposal standard deviation 0.1 T j leads to better mixing near the  (x (d) ) in first and third plots and var(x (d) ) in second and fourth plots under the target P 5 , depending on proposal standard deviation τ j . The dotted lines are placed at 1, 2, 3, and 4 times the MSE based on generalized resample-move. mode at μ 2 , by the result that the scale of an optimal jumping rule is 2.4/ √ d = 0.98 times the scale of the target distribution (Gelman, Roberts, and Gilks 1996). The accuracy of sampling near this mode is crucial to obtaining a good sample for the target P 5 , as discussed earlier.
For the first choice of j , generalized resample-move and independence Metropolis resampling MCMC have the smallest MSEs. For the second choice of j , independence Metropolis resampling MCMC and parallel tempering have the smallest MSEs. By Remark 3 in Section 2.4, there is, on an average, a longer stretch of successive Markov moves starting from each accepted draw by resampling or swapping in independence Metropolis resampling MCMC or parallel tempering than in generalized resample-move. This difference may contribute to better performances of the former two algorithms under the second choice of the kernel j . As discussed above, Markov moving does not mix efficiently near the mode at μ 2 for this choice of j , but the accuracy of sampling near μ 2 is crucial.
The MSEs of importance resampling MCMC and equi-energy sampling are similar to each other, consistently with Remark 4 in Section 2.4. The performances of these two algorithms are comparable to or worse than that of generalized resample-move under the two choices of the kernel j .

Experiment on Subsampling Strategies.
For the proposal standard deviation 0.2 T j , we study the following ways of improving accuracy by increasing the number of iterations and then perform subsampling: • Make no change except redefining the kernel j as the transition from an initial value to the 10th iteration by random-walk Metropolis sampling.
• Increase the number of overall iterations in i by 10 times, with the same j and j = 10 or α j = 10% as before, and then perform subsampling of the draws at the rate of 1 in 10, successively, for (P 1 , . . . , P m ).
• Increase the number of overall iterations in i by 10 times, with the same j as before but j = 100 or α j = 1%, and then perform subsampling of the draws at the rate of 1 in 10, successively, for (P 1 , . . . , P m ).
For convenience, the three strategies are referred to as ITER, SUB1, and SUB2. For generalized resample-move, the strategy ITER is equivalent to SUB2, both corresponding to increasing j by 10 times to 100 with the same k j and then subsampling, whereas SUB1 involves increasing k j by 10 times with the same j = 10 and then subsampling. For algorithms labeled 4-7, the strategy ITER differs in a subtle manner from SUB2, although the expected numbers of operations of resampling or swapping are the same for both strategies. In Appendix III, Figure A3 shows the squared biases and MSEs based on 200 repeated simulations. A summary of the results is as follows. Among the three strategies, SUB1 is most effective for parallel tempering, whereas SUB2 is most effective for resampling MCMC algorithms (including ITER for generalized resample-move). Moreover, if the seven algorithms are compared with their best strategies of subsampling, independence Metropolis resampling MCMC and parallel tempering remain the most competitive, as in the simple case before the increase of iterations.

Setup and Difficulty.
The Potts model is important in statistical physics and has been used in various applications such as imaging analysis and spatial statistics. Consider a 10-state Potts model on a 20 × 20 lattice with periodic boundary conditions in the absence of a magnetic field. Each observation x corresponds to a collection of K = 20 2 spins (s 1 , . . . , s K ) on the lattice, where s i takes q = 10 possible values. At a temperature T, the density function of the Potts distribution is where u(x) = − i∼j 1{s i = s j }, with i ∼ j indicating that sites i and j are nearest neighbors, and Z = x exp{−u(x)/T } is the normalizing constant. Statistically, the Potts distribution belongs to an exponential family, with canonical statistic −u(x) and natural parameter β = T −1 . Let U = E{u(x)} and C = var{u(x)} under the Potts distribution. For notational simplicity, the dependency of Z, U, and C on T or β is suppressed. Then U = −(d/dβ) log Z and C = (d 2 /dβ 2 ) log Z by theory of exponential family. In statistical physics, Z is called the partition function, U is internal energy, and C/T 2 is specific heat (Newman and Barkema 1999).
A special case of the Potts model with two states (q = 2) is equivalent to the Ising model, where u(x) = − i∼j s i s j and each s i is either −1 or 1. Like the Ising model, the Potts model on an infinite lattice exhibits a phase transition at the inverse temperature β c = T −1 c = log(1 + √ q), about 1.426 for q = 10. But the critical behavior is richer and more general than that of the Ising model (Wu 1982). As shown later by Figure 3, the histograms of u(x), known as the energy histograms, are bimodal near the critical temperature T c . In contrast, the energy histograms are unimodal, centered at different locations for different temperatures under the Ising model (e.g., Newman and Barkema 1999, Figure 8. for a clearer view of the same histograms in separate plots, instead of a single plot. The Markov kernel j is defined as a random sweep using the single-spin-flip Metropolis algorithm (Newman and Barkema 1999, sec. 4.5.1). Each sweep consists of K iterations, 1 per spin, where each iteration involves randomly picking a spin s i , choosing a new value from the q − 1 remaining values, and then accepting or rejecting the new value by the Metropolis rule. The two sets of energy histograms obtained by generalized resample-move and parallel tempering appear similar to each other, but a careful examination shows that there are differences in the weights of the two modes. At the higher temperature T 1 or T 2 , the energy histogram per spin is essentially unimodal, centered about −0.85, corresponding to the disordered phase. At the lower temperature T 3 , T 4 , or T 5 , the energy histogram per spin becomes bimodal, with one mode still at −0.85 and a new mode at −1.7, corresponding to the ordered phase. As the temperature decreases, the mode located at about −1.7 grows in its weight, from being a negligible one, to a minor one, and eventually to a major one, so that the spin system moves from the disordered phase to the ordered one. These changes in the modes are somewhat similar to those in Figure 1 for the bimodal Gaussian mixture, and hence sampling from the Potts distributions faces similar difficulties to those discussed in Section 3.2.1. On the other hand, there is a subtle difference between Figures 1 and 3. The former shows the histograms of a component x (d) of x, whereas the latter shows the histograms of u(x)/K. For the Gaussian mixture in Section 3.2, the histograms of u(x) or − log{p(x)} (not shown for space limitation) differ markedly from those of x (d) : they are positively skewed and unimodal except at temperature T 5 = 1, where the energy histogram is bimodal.

Simulation Details.
We compare the algorithms labeled 3-7 in Section 3.1, excluding sequential importance resampling, resample-move, and dynamic algorithms. Without sufficient Markov moving, sequential importance resampling and resample-move perform poorly, relatively to other static algorithms. Instead, we include in the study two standard MCMC algorithms, applied separately to the Potts distributions at the selected temperatures. One is the single-spin-flip Metropolis algorithm mentioned above and the other is the Swendsen and Wang (1987) cluster algorithm, which is known to work more efficiently near the phase transition than the single-spin-flip Metropolis algorithm. For the algorithm labeled 3-7, the Markov kernel j corresponds to a random sweep using the single-spin-flip Metropolis algorithm. The Swendsen-Wang algorithm is used mainly to provide benchmark results. It is possible to combine resampling MCMC or parallel tempering with a cluster algorithm, with the Markov kernel j defined as the cluster update (e.g., Bittner and Janke1 2011).
The initial value x 10 = · · · = x 50 is generated by randomly setting each spin. For equi-energy sampling, the cutoff energy levels (U 1 , U 2 , . . . , U 5 ) are set to (−800, −606, −459, −348, −264) such that U h+1 /U h ≈ 0.76 for h = 1, . . . , 5 with U 6 = −200. Such choices seem appropriate given that (T −1 1 , . . . , T −1 m ) are close to each other. The sample sizes are set as described in Section 3.1 with n = 400, 000 and b = 40, 000. But, to obtain reasonably accurate results, we increase the number of overall iterations by 10 times and then perform subsampling for all the seven algorithms under study. By the discussion at the end of Section 3.2.3, we employ different subsampling strategies for different algorithms as follows.
• For generalized resample-move labeled 3, redefine the kernel j by 10 random sweeps using the single-spin-flip Metropolis algorithm, without increasing the number of overall iterations in i. Take j = 10, 100, or 1000.
• For Metropolis sampling, Swendsen-Wang algorithm, resampling MCMC labeled 4-6 and parallel tempering labeled 7, use the single-sweep kernel j , increase the number of overall iterations in i by 10 times, and then perform subsampling at the rate of 1 in 10. For the latter four algorithms, take α j = 10%, 1%, or 0.1%, corresponding to j = 10, 100, or 1000.
For convenience, the strategies are referred to as ITER and SUB. The strategy ITER can also be seen as subsampling. For j = 100, ITER is equivalent to increasing j by 10 times to 1000 with the same k j and then subsampling.

Results With Five Tempering Distributions.
Consider simulations from the Potts model at the five temperatures specified earlier. Different algorithms appear to have their own best performances, using different choices of j or α j . Figure 4 shows the Monte Carlo means and standard deviations for the estimates of −U/K, the internal energy per spin, and C/K, the specific heat per spin multiplied by T 2 , for each algorithm with two best-performing choices of j or α j under the strategy ITER or SUB. The results are based on 100 repeated simulations. For resampling MCMC including generalized resample-move labeled 3-6, the most accurate estimates are obtained using j = 1000 or α j = 0.1%, whereas the estimates with the choice j = 100 or α = 1% have both greater variances and absolute biases. For parallel tempering, the most accurate estimates are obtained for α j = 10%, and the estimates are only slightly less accurate for α j = 1%.
Among the algorithms labeled 3-7, parallel tempering appears to yield the most accurate estimates, especially at temperatures T 1 (not shown), T 2 , and T 3 . The four resampling MCMC algorithms are ranked in their performances roughly as independence Metropolis resampling MCMC, generalized resample-move, equi-energy sampling, and importance resampling MCMC. The single-spin-flip Metropolis algorithm performs reasonably well, even better than the four resampling MCMC algorithms with the choice j = 100 or α = 1%. But Metropolis sampling yields less accurate estimates, most noticeably at temperatures T 4 and T 5 , than independence Metropolis resampling MCMC and generalized resamplemove with j = 1000 or α = 0.1%. A partial explanation is that the mode located at about −1.7 in the energy histogram becomes nonnegligible at temperature T 3 or lower. Then importance weights between the successive Potts distributions are sufficiently controlled, so that variance can be reduced by combining importance resampling and shorter than complete Markov moving. See the discussion on the choice of in Section 2.2.

Results With More Tempering Distributions.
Similarly to the Gaussian mixture example, we study the effect of using more finely spaced tempering distributions than the five specified ones, with a comparable total number of iterations (after subsampling described in Section 3.3.2): • Nine distributions at temperatures (t 1 , t 3 , t 6 , t 8 , t 10 , t 12 , t 14 , t 17 , t 19 ), with the number of iterations (n + 9b, n + 8b, . . . , n + b) for static algorithms 3-6, where n = 10b and b = 20,000 is the nearest integer that is divisible by 1000 and no smaller than 40,000 × 5 j =1 (10 + j )/ 9 j =1 (10 + j ) ≈ 19,260.
The total number of iterations for parallel tempering is set to match that of the other algorithms.  1.423833, 1.426). These temperatures are specified manually, but mimicking an adaptive approach for choosing distributions in SMC (e.g., Jasra et al. 2011). For example, the nine temperatures are chosen by examining the relative variances of importance weights (described below) from the repeated simulations with five temperatures and placing additional temperatures between those associated with large relative variances.
For the generalized resample-move algorithm, the distance between two successive distributions, P j −1 and P j , can be assessed by the relative variance of the unnormalized importance weights: . The relative variance is directly related to the effective sample size commonly used Del Moral, Doucet, and Jasra 2006) Moreover, there is an interesting interpretation for summing RV j over j. In fact,w (j −1) serves as an estimator of Z j /Z j −1 , where Z j −1 and Z j are the normalizing constants of q j −1 (·) and q j (·) respectively. By the delta method, if {x j −1,1 , . . . , x j −1,n j −1 } were independent from P j −1 , then RV j /n j −1 is asymptotically the variance of logw (j −1) . Therefore, if the dependency were ignored in and between the samples, then RV 2 /n 1 + · · · + RV j /n j −1 would approximately be the variance of logw (1) + · · · + logw (j −1) , an estimator of log(Z j /Z 1 ). Figure 5 shows the cumulative relative variances RV 2 + · · · + RV j and the effective sample fractions ESS j /n j −1 , with RV j averaged from 100 repeated simulations. Compared with the case of five temperatures, using 9, 12, or 18 temperatures significantly reduces the relative variances and leads to effective sample fractions 80% or higher. However, the part of Monte Carlo variability from importance resampling, represented by v 1 [r{h − E 2 (g)}] in (1), can still be large. See Appendix IV for discussion. Figure 6 shows the Monte Carlo means and standard deviations for the estimates of −U/K and C/K at temperature T = 1.413 −1 , starting from which the mode corresponding to the ordered phase becomes non-negligible for the spin system (see Figure 3). The algorithms are used with their best strategies of subsampling studied in Section 3.3.3: ITER with j = 1000 for generalized resample-move, SUB with j = 1000 for resampling MCMC labeled 4-6, and SUB with j = 10 for parallel tempering. The results are based on 100 repeated simulations. Among the algorithms labeled 3-7, parallel tempering appears to yield the most accurate estimates, for m = 5 as seen in Section 3.3.3 and for larger m = 9, 12, 18. In fact, parallel tempering performs almost the same, in terms of both Monte Carlo means and variances, for m ranging from 5 to 18. The performances of the four resampling MCMC algorithms including generalized resample-move show no significant improvement or even deteriorate with greater variances as m increases from 5 to 18. Therefore, it seems not worthwhile using more tempering distributions, at the cost of a smaller sample size per distribution, in this example.
In Appendix IV, we provide additional discussion on the comparison of algorithms between the Potts model and the Gaussian mixture example.

SUMMARY
There are three main factors for both implementation and performances of resampling MCMC and parellel tempering: tempering distributions, Markov kernels, and resampling rates or swapping probabilities. Our basic analysis and simulation studies suggest a number of interesting findings in the difficult situation where new modes emerge in the tails of previous tempering distributions.
If the Markov kernels lead to fast mixing or even locally so toward restricted distributions, then generalized resample-move tends to perform the best among all the algorithms studied. If the Markov kernels lead to slow mixing, without even converging fast to restricted distributions, then parallel tempering tends to perform the best. The performance of independence Metropolis resampling MCMC is often a close second best. Importance resampling MCMC and equi-energy sampling perform similarly to each other, often worse than independence Metropolis resampling MCMC.
To improve accuracy in the case of slow-mixing Markov chains, different strategies of increasing iterations and subsampling seem to be effective for different algorithms. For resampling MCMC algorithms, it is effective to keep the size of resampling, increase the lengths of Markov moving, and then perform subsampling for resampling MCMC algorithms. For parallel tempering, it is effective to keep the probability of swapping between adjacent chains, say 10% in our examples, increase the number of overall iterations, and then perform subsampling.
There are various open problems raised from our work. First, for generalized resamplemove, further research is needed to develop adaptive choices of tempering distributions, Markov kernels, and resampling rates, using techniques from adaptive MCMC and SMC. In fact, there are parallel subchains simulated for each tempering distribution. This unique, rich structure remains to be fully exploited. Second, for resampling MCMC, it is interesting to study possible ways to exchange information between tempering distributions as in parallel tempering, instead of feeding information unidirectionally. Finally, different algorithms seem to have advantages in different settings as in our simulation studies. It is important to further study this phenomenon and develop appropriate guidelines. Computer codes: C codes to perform the simulations on the multivariate Gaussian mixture in Section 3.2 and the Potts model in Section 3.3. Documentations of the codes are also provided. (RMCMC-codes.tar.gz, GNU zipped tar file)