An Adaptive Exchange Algorithm for Sampling From Distributions With Intractable Normalizing Constants

Sampling from the posterior distribution for a model whose normalizing constant is intractable is a long-standing problem in statistical research. We propose a new algorithm, adaptive auxiliary variable exchange algorithm, or, in short, adaptive exchange (AEX) algorithm, to tackle this problem. The new algorithm can be viewed as a MCMC extension of the exchange algorithm, which generates auxiliary variables via an importance sampling procedure from a Markov chain running in parallel. The convergence of the algorithm is established under mild conditions. Compared to the exchange algorithm, the new algorithm removes the requirement that the auxiliary variables must be drawn using a perfect sampler, and thus can be applied to many models for which the perfect sampler is not available or very expensive. Compared to the approximate exchange algorithms, such as the double Metropolis-Hastings sampler, the new algorithm overcomes their theoretical difficulty in convergence. The new algorithm is tested on the spatial autologistic and autonormal models. The numerical results indicate that the new algorithm is particularly useful for the problems for which the underlying system is strongly dependent. Supplementary materials for this article are available online.


INTRODUCTION
Sampling from the posterior distribution for a model whose normalizing constant is intractable is a long-standing problem in statistical research. Formally, this problem can be posed as follows: Suppose that we have a dataset Y generated from a model with the likelihood function given by where θ denotes the parameter vector of the model, and κ(θ ) is the normalizing constant that depends on θ and is unavailable in closed form. Examples of such models include the Ising and Potts models used in image analysis (Hurn, Husby, and Rue 2003), the autologistic and autonormal models used in spatial data analysis (Besag 1974), and exponential random graph models used in social network analysis (see, e.g., Snijders et al. 2006), among others. Let π (θ ) denote the prior density of θ . The posterior distribution of θ is then given by Because κ(θ ) is intractable, sampling from π (θ | y) has posed a great challenge on existing statistical methods. It is known that the Metropolis-Hastings (MH) algorithm cannot be directly applied to sample from π (θ | y), as its acceptance probability would involve an unknown normalizing constant ratio κ(θ )/κ(θ ), where θ denotes the proposed value. To tackle this problem, various methods have been proposed in the literature. These methods can be classified into two categories according to the strategies employed by them, the approximation-based methods and the auxiliary variable-based methods.
The methods in the first category are to approximate the likelihood function, the normalizing constant κ(θ ), or the normalizing constant ratio κ(θ )/κ(θ ) using various approaches. Besag (1974) proposed the so-called maximum pseudo-likelihood estimator (MPLE) method, in which the likelihood function is approximated by a product of a series of conditional likelihood functions. Since this approximation ignores certain dependence within the components of y, the performance of MPLE is often unsatisfactory, especially when the dependence between the components of y is strong. Geyer and Thompson (1992) proposed an importance sampling-based method to approximate κ(θ ). Let θ (0) be an initial estimate of θ and let x 1 , . . ., x m denote a set of random samples drawn from f (x|θ (0) ), which can be obtained via Markov chain Monte Carlo (MCMC) simulations. The log-likelihood function can then be approximated by which approaches log f ( y|θ ) as m → ∞, and the resulting estimator θ = arg max θ log f m ( y|θ ) is called the Monte Carlo maximum likelihood estimator (MCMLE). The performance of this method depends on the choice of θ (0) : If θ (0) is near the true maximum likelihood estimator, it can produce a good estimate of θ ; otherwise, it may converge to a suboptimal solution. Liang (2007) proposed an alternative Monte Carlo method to approximate κ(θ ), where κ(θ ) is viewed as a marginal density function of the unnormalized distribution ϕ( y, θ) and estimated using an adaptive kernel smoothing method with Monte Carlo draws. A similar method for approximating κ(θ ) can be found in Atchade, Lartillot, and Robert (2013). Toward sampling from the posterior π (θ | y), Liang and Jin (2013) proposed the so-called Monte Carlo MH (MCMH) algorithm, which is to approximate the normalizing constant ratio κ(θ )/κ(θ ) at each iteration using samples simulated from either f (x|θ ) or f (x|θ ) through a finite run of Markov chain. Since the convergence of a finite run of Markov chain cannot be guaranteed, the algorithm is only approximately correct, at least, in theory. The Bayesian stochastic approximation Monte Carlo (SAMC) algorithm suggested in Jin and Liang (2014) and the marginal particle Markov chain Monte Carlo (PMCMC) algorithm suggested in Everitt (2012) suffer from a similar problem-a large number of samples needs to be simulated at each iteration to ensure its convergence.
The methods in the second category aim to have the normalizing constant ratio κ(θ )/κ(θ ) canceled in simulations by augmenting appropriate auxiliary variables to the target distribution and/or the proposal distribution. Along this direction, Møller et al. (2006) proposed an algorithm that arguments both the target and proposal distributions, and Murray, Ghahramani, and MacKay (2006) proposed the so-called exchange algorithm that arguments only the proposal distribution. These two algorithms are usually termed as auxiliary variable MCMC algorithms in the literature. Although the underlying idea is very attractive, these algorithms require the auxiliary variables to be drawn using a perfect sampler (Propp and Wilson 1996). Since perfect sampling can be very expensive or impossible for many models with intractable normalizing constants, the applications of these algorithms are highly hindered. To address this issue, Liang (2010) proposed the double Metropolis-Hastings (DMH) sampler, where each auxiliary variable is drawn through a short run of the MH algorithm initialized with the observation y. As noted in Liang (2010), initializing the auxiliary MH chain with the observation y leads to improved convergence of the algorithm. Similar algorithms have been applied to social network analysis by Caimo and Friel (2011) and Everitt (2012). Since, in these algorithms, a finite MCMC run has to be used for generating auxiliary samples at each iteration, the resulting estimates are only approximately correct no matter how long these algorithms are run. A brief review of the exchange algorithm and the DMH algorithm is given in Section 2.
In this article, we propose a new algorithm, the adaptive auxiliary variable exchange algorithm, or, in short, the adaptive exchange (AEX) algorithm, for sampling from the posterior π (θ | y). AEX is an adaptive Monte Carlo version of the exchange algorithm, where the auxiliary variables are generated via an importance sampling procedure from a Markov chain running in parallel. The convergence of the algorithm is established under mild conditions. Compared to the existing auxiliary variable MCMC algorithms, AEX removes the requirement of perfect sampling and thus can be applied to many models for which perfect sampling is not available or very expensive. Compared to the DMH sampler, AEX overcomes its theoretical difficulty caused by inconvergence of finite MCMC runs. The new algorithm is tested on spatial autologistic models and autonormal models. The numerical results indicate that the new algorithm is particularly useful for the problems for which the underlying system is strongly dependent.
The remainder of this article is organized as follows. In Section 2, we describe the AEX algorithm. In Section 3, we present some theoretical results on the convergence of AEX. In Section 4, we test AEX on a spatial autologistic model along with extensive comparisons with the exchange algorithm. In Section 5, we conclude the article with a brief discussion.

THE ADAPTIVE EXCHANGE ALGORITHM
In this section, we first give a brief review for the exchange and approximate exchange algorithms, and then describe the AEX algorithm.

The Exchange and Approximate Exchange Algorithms
Let θ t denote the draw of θ at iteration t. One iteration of the exchange algorithm consists of the following steps: Exchange Algorithm 1. Propose a candidate point θ from a proposal distribution denoted by q(θ |θ t ). 2. Generate an auxiliary variable x ∼ f (x|θ ) using a perfect sampler (Propp and Wilson 1996).
and set θ t+1 = θ t with the remaining probability.
This algorithm is called the exchange algorithm because of the similarity of (4) with the acceptance probability of the swapping operation of exchange Monte Carlo (Geyer 1991;Hukushima and Nemoto 1996). The exchange algorithm is different from the conventional MH algorithm in that the proposal q(θ |θ t )f (x|θ ) consists of a randomization component that involves a random draw of x. To see why the exchange algorithm works, we define s(θ, x, θ ) = α(θ, x, θ )q(θ |θ )f (x|θ ), and then it is easy to see that which is symmetric about θ and θ . This implies that where P (θ, dθ ) denotes the Markov transition kernel of the exchange algorithm; that is, Therefore, the exchange algorithm defines a valid Markov chain for simulating from π (θ | y).
As aforementioned, to ease sampling of auxiliary variables, the DMH sampler (Liang 2010) proposed to draw the auxiliary variable x through a finite run of the MH algorithm initialized at the observation y. Under the assumption of equilibrium, the acceptance probability for the candidate point θ is reduced to (4). However, since the equilibrium can only be approximately reached for a large number of iterations, the DMH sampler is only approximately correct. It is worth mentioning that DMH can generally perform much better than its competitor, MPLE, in parameter estimation.

The Adaptive Exchange Algorithm
To overcome the theoretical difficulty of DMH in convergence, we propose the AEX algorithm. The basic idea of AEX can be loosely described as follows: AEX consists of two chains running in parallel. The first chain is auxiliary, which is run in the data space X ( y ∈ X ) and aims to draw samples from a family of distributions f (x|θ (1) ), . . . , f (x|θ (m) ) defined on a set of prespecified parameter values θ (1) , . . . , θ (m) . The second chain is the target chain, which makes use of the auxiliary chain and aims to draw samples from the target posterior π (θ | y). The target chain is run in the parameter space (θ ∈ ). For a candidate point θ , an auxiliary variable x is resampled from the past samples of the auxiliary chain via an importance sampling procedure. Here we assume that the neighboring distributions f (x|θ (i) )'s satisfy the following condition: (A 0 ) The sample spaces of neighboring f (x|θ (i) )'s have a reasonable overlap and the parameters {θ (1) , . . . , θ (m) } have covered the major part of the support of π (θ | y), for example, C θ π (θ |x)dθ > 0.9999, where C θ denotes the convex hull formed by θ (1) , . . . , θ (m) .
These assumptions ensure that the auxiliary chain can mix reasonably fast and thus the target chain can converge to the right posterior distribution π (θ | y) as the number of iterations becomes large. Actually, this is the key to the success of AEX. How to choose the auxiliary parameters {θ (1) , . . . , θ (m) } will be described in Section 2.3. To draw samples from the family of distributions f (z|θ ), θ ∈ {θ (1) , . . . , θ (m) }, we adopt the SAMC algorithm (Liang, Liu, and Carroll 2007). SAMC ensures that each of the distributions, f (z|θ (1) ), . . . , f (z|θ (m) ), can be drawn with a prespecified frequency, while overcoming the local-trap problem that the simulation can get trapped at a single or few distributions. We note that some other MCMC algorithms, such as the reversible jump MCMC algorithm (Green 1995) and evolutionary Monte Carlo (Liang and Wong 2001), can also be used here to draw samples from the family of distributions, but they need to deal with the local-trap problem. When evolutionary Monte Carlo is used, θ (i) 's can be treated as different temperatures. To implement the SAMC algorithm, we define p = (p 1 , . . . , p m ) to be the desired sampling frequencies from respective distributions f (z|θ (1) ), . . . , f (z|θ (m) ), where 0 < p i < 1 and m i=1 p i = 1; and specify a positive, nonincreasing sequence {a t }, the socalled gain factor sequence, which satisfies the condition: In this article, we set p 1 = · · · = p m = 1/m and for some known constant t 0 > 1. Let w (i) t denote an abundance factor attached to the distribution f (z|θ (i) ) at iteration t, let w t = (w (1) t , . . . , w (m) t ), and let W denote the sample space of w t . Let {K s , s ≥ 0} be a sequence of compact subsets of W such that where int(A) denotes the interior of set A. Let X 0 be a subset of X , and let T : X × W → X 0 × K 0 be a truncation function that is measurable and maps a point in X × W to a random point in X 0 × K 0 . Let σ t denote the number of truncations performed until iteration t. Let z t denote the samples generated by the auxiliary chain at iteration t, and let ϑ t denote the parameter value associated with z t . Let S t denote the set of auxiliary samples collected by iteration t. The AEX algorithm starts with a random point in X 0 × K 0 and then iterates in the following steps:
For this algorithm, we have a few remarks: • Part I of the algorithm is to use the SAMC algorithm to draw samples from the mixture distribution meanwhile providing estimates for the normalizing constants κ(θ (1) ), . . . , κ(θ (m) ). As shown in (14), w t will converge to (κ(θ (1) )/p 1 , . . . , κ(θ (m) )/p m ) (up to a multiplication factor) almost surely as t → ∞. Part II is essential to run the exchange algorithm for simulation of the posterior π (θ | y), where the auxiliary variable is drawn via a dynamic importance sampling procedure. The dynamic importance function used at step t is proportional to where I (·) is the indicator function. The validity of this procedure has been established in Lemma 3.1. Since the underlying true proposal distribution for generating auxiliary variables in part II is changing from iteration to iteration, the new algorithm falls into the class of adaptive MCMC algorithms (for which the proposal distribution is changing from iteration to iteration). For this reason, we call the new algorithm the AEX algorithm. • To prepare a good collection of auxiliary samples, one may first run the auxiliary chain for a large number of iterations, and then run the auxiliary and target chains in parallel.
Certainly, this will improve the convergence of the target chain. • The proposal q(θ |θ t ) can depend on y; that is, it can be written in the form q(θ |θ t , y). For simplicity, we notationally depress the dependence of the proposal on y. • On the choice of {a t } and convergence of AEX. In this article, we set the gain factor in the form (6) with a free parameter t 0 . As discussed by Liang, Liu, and Carroll (2007), a large value of t 0 will force the sampler to reach all distributions f (z|θ (i) )'s quickly. Therefore, t 0 should be set to a large number for a complex problem. In this article, t 0 is set to 20,000 for the U.S. cancer mortality data example and 25,000 for all other examples. In general, the choice of t 0 should be associated with the choice of N, the total number of iterations of a run. The appropriateness of their choices can be diagnosed by checking the convergences of the auxiliary and target chains. The convergence of the auxiliary chain can be checked through an examination for the realized sampling frequencies (p 1 , . . . ,p m ), wherep i denotes the realized sampling frequency from the distribution f (z|θ (i) ). If (p 1 , . . . ,p m ) is not close to (p 1 , . . . , p m ), the auxiliary chain should be diagnosed as nonconvergent.
In this case, the algorithm should be rerun with a larger value of N or a larger value of t 0 or both. Note that for the convergence diagnosis of the auxiliary chain, multiple runs are not necessary under the scenario considered in the article, as it is known that each of the distributions f (z|θ (i) )'s is valid. However, to check the convergence of the target chain, multiple runs are still necessary.

Choice of Auxiliary Parameters
To choose the auxiliary parameters {θ (1) , . . . , θ (m) }, we suggest a fractional DMH algorithm-based procedure. Let θ t denote the draw of θ at iteration t. One iteration of the fractional DMH algorithm can be described as follows: Fractional DMH Algorithm 1. Propose a candidate point θ from a proposal distribution denoted by q(θ |θ t ). 2. Generate an auxiliary variable x through a finite run of the MH algorithm that admits f (x|θ ) as the invariant distribution and is initialized at the observation y.
If ζ = 1, the algorithm is reduced to the DMH algorithm. If 0 < ζ < 1, the algorithm can result in an expanded support of the true posterior, as taking a fraction of r(θ t , x, θ ) reduces the rejection rate and thus can accept some samples with low posterior density values. Hence, when ζ is small, the samples {θ t , t = 1, 2, . . .} generated by this algorithm are expected to cover the major part of the support of the posterior π (θ | y). In this article, we set ζ = 0.5 for all examples.
The standardization in Step 1 ensures that each dimension of θ contributes equally to the distance function d st , and thus SAMC can mix equally in all dimensions of θ . The max-min rule used in Step 4 ensures that the selected samples {θ (1) , . . . , θ (m) } and the full samples {θ t , t = 1, . . . , n} have about the same convex hull when m is reasonably large, and thus {θ (1) , . . . , θ (m) } are distributed over the support of {θ 1 , . . . , θ n }, that is, condition (A 0 ) is satisfied. Note that in the max-min rule, θ (2) is always the furthest one from θ (1) among the samples {θ 1 , . . . , θ n }.
The number m can be determined by trial-and-error; that is, choosing m such that the auxiliary chain can converge reasonably fast, for example, within 10 3 m ∼ 10 4 m iterations. How to diagnose the convergence of the auxiliary chain has been discussed at the end of Section 2.2.
Besides the auxiliary parameters and the gain factor sequence, the convergence of the auxiliary chain can depend on the proposal distributions used in simulations. In our simulations, we usually set T 2 (·|·) as the Gibbs sampler for updating auxiliary variables, and set T 1 (·|·) as the uniform distribution over a prespecified nearest neighbor for each θ (i) . In this article, we set m = 100 and set the neighboring size m 0 = 10 in all simulations. Different settings have been tried, such as m = 50 and 200 and m 0 = 5; the results are similar. Since the auxiliary parameters are essentially generated from the same posterior distribution, the neighboring distributions f ( y|θ (i) )'s are always reasonably overlapped. Note that they all share, at least approximately, the same sample-the observation y. This ensures a smooth transition of the auxiliary chain between different auxiliary parameters. More importantly, this property holds independently of the dimension of θ . Hence, AEX can work well for the problems for which the dimension of θ is high. Here we note that when choosing the values of m and m 0 , the multimodality of the posterior π (θ | y) needs to be checked. If multiple modes exist, special cares need to be taken in choosing m and m 0 such that the resulting auxiliary chain is irreducible.
Finally, we note that there can be many variations for the auxiliary parameter selection procedure proposed above. For example, the fractional DMH algorithm can be replaced by any other algorithm that can result in an expanded support of the target posterior π (θ | y), for example, the approximate Bayesian computation (ABC) algorithm (Beaumont, Zhang, and Balding 2002). For the models for which the dimension of θ is low, say, the spatial autologistic model studied in Section 4, {θ (1) , . . . , θ (m) } can be set to some grid points around the MPLE of θ . In the max-min procedure, the standardization can be replaced by a regular normalization procedure, that is, settingθ i = −1/2 n (θ i −θ n ), whereθ n and n denote, respectively, the sample mean and covariance matrix of {θ 1 , . . . , θ n }.

CONVERGENCE OF THE ADAPTIVE EXCHANGE ALGORITHM
As aforementioned, AEX falls into the class of adaptive MCMC algorithms. Since the transition kernel of the exchange step may admit different stationary distributions for different iterations, the convergence theory developed for conventional adaptive MCMC algorithms (see, e.g., Andrieu and Moulines 2006;Roberts and Rosenthal 2007) is not applicable here. The ergodicity theory developed by Fort, Moulines, and Priouret (2011) for adaptive MCMC allows to change stationary distributions at different iterations and is indeed applicable to AEX. However, the strong law of large numbers (SLLN) established therein requires some strong conditions that cannot be verified for AEX. For this reason, we develop some theory for adaptive MCMC, including ergodicity and weak law of large numbers (WLLN), which are applicable to AEX.
The remainder of this section is organized as follows. In Section 3.1, we prove two theorems, ergodicity and WLLN, for general adaptive MCMC algorithms with changing stationary distributions. In Section 3.2, we consider the weak convergence of auxiliary variables drawn from the auxiliary chain. In Section 3.3, we establish the ergodicity and WLLN for AEX.

Convergence of Adaptive MCMC With Changing Stationary Distributions
To facilitate our study, we first define, following Roberts and Rosenthal (2007), some notations for adaptive Markov chains. Consider a state space (X, F), where F = B(X) denotes the Borel set defined on X. Let X t ∈ X denote the state of the Markov chain at iteration t, and let P γ t denote the transition kernel at iteration t, where γ t is a realization of a Y -valued random variable t . In simulations, γ t is updated according to specified rules. Let G t = σ (X 0 , . . . , X t , 0 , . . . , t ) be the filtration generated by denote the t-step transition probability for the Markov chain with the fixed transition kernel P γ and the initial condition denote the t-step transition probability for the adaptive Markov chain with the initial conditions X 0 = x and 0 = γ . Let denote the total variation distance between the distribution of the adaptive Markov chain at time t and the target distribution π (·). It is said the adaptive Markov chain is ergodic if lim t→∞ T (x, γ, t) = 0 for all x ∈ X and γ ∈ Y . Theorem 3.1 concerns the ergodicity of an adaptive chain with changing stationary distributions, whose proof is given in the supplementary material of the article.
Theorem 3.1. (Ergodicity) Consider an adaptive Markov chain defined on the state space (X, F) with the adaption index (a) (Stationarity) There exists a stationary distribution π γ t (·) for each transition kernel P γ t , where γ t denotes a realization of the random variable t . (b) (Asymptotic Simultaneous Uniform Ergodicity) For any > 0, there exists a measurable set E 1 ( ) in the probability space such that P (E 1 ( )) ≥ 1 − and on this set E 1 ( ), for > 0, there exist constants K( ) > 0 and N( ) > 0 such that Theorem 3.1 is a slight extension of theorem 1 of Roberts and Rosenthal (2007), where the ergodicity is established for an adaptive Markov chain for which all transition kernels admit the same stationary distribution and are simultaneously uniformly ergodic. Note that the conditions (a) and (b) of Theorem 3.1 imply that on the set E 1 ( ) for t > K( ) and n > N( ); that is, as t → ∞, π t (·) converges to π with probability greater than 1 − . Hence, in the limit case, Theorem 3.1 is reduced to the ergodicity theorem of Roberts and Rosenthal (2007). The proof of Theorem 3.1 is also based on the coupling theory as in Roberts and Rosenthal (2007). We note that under a slightly weaker condition of (b), Fort, Moulines, and Priouret (2011) established the same ergodicity result as Theorem 3.1. In this sense, Theorem 3.1 is redundant. Since part of its proof will be used in the proof of the next theorem, it is presented in the supplementary material. Fort, Moulines, and Priouret (2011) also established SLLN for an adaptive Markov chain with changing stationary distributions, but under rather strong conditions. For example, it requires an explicit decreasing rate of D t . However, figuring out the decreasing rate of D t is very difficult, if not impossible, for AEX. To address this issue, we develop the next theorem, where D t can converge to zero in probability at any rate. The proof of Theorem 3.2 can be found in the supplementary material of the article. g(X t ) → π (g), in probability, as n → ∞, where π (g) = X g(x)π (dx).

Weak Convergence of Auxiliary SAMC Samples
To study the weak convergence of the auxiliary samples collected via SAMC, we first describe a general SAMC algorithm.
denote the distribution that we want to draw samples from, whereX denotes the sample space of π (x). Let E 1 , . . . , E m denote m nonoverlapping subregions, which form a partition ofX , that is, denote the vector of abundance factors attached to the m subregions at iteration t, and let W denote the sample space of w t . Let {K s , s ≥ 0} be a sequence of compact subsets of W as defined in (7). LetX 0 be a subset ofX , and let T :X × W →X 0 × K 0 be a truncation function that is measurable and maps a point inX × W to a random point inX 0 × K 0 . Let σ t denote the number of truncations performed until iteration t. Let S denote the collection of the indices of the subregions from which a sample has been proposed; that is, S contains the indices of all subregions that are known to be nonempty. The algorithm starts with a random point drawn inX 0 × K 0 and then iterates in the following steps: A General SAMC Sampler (a) (Sampling) Simulate a sample x (t+1) by a single MH update with the target distribution given by (a.1) Generate y according to a proposal distribution q(y|x t ). If J (y) / ∈ S, then set S ← S + {J (y)}, where J (y) denote the index of the subregion that the sample y belongs to. (a.2) Calculate the ratio (a.3) Accept the proposal with probability min(1, r). If it is accepted, set x (t+1) = y; otherwise, set x (t+1) = x (t) . (b) (Abundance factor updating) For all i ∈ S, set log(w (i) where a t+1 is as defined in (A 1 ), and e t+1,i = 1 if x t+1 ∈ E i and 0 otherwise. If σ t denotes the ith dimensional space of K σ t , then set (w t+1 , z t+1 ) = (w t+1/2 , z t+1 ) and σ t+1 = σ t ; otherwise, set (w t+1 , z t+1 ) = T (w t , z t ) and σ t+1 = σ t + 1.
It is easy to see that the SAMC algorithm falls into the class of varying truncation stochastic approximation MCMC algorithms. The convergence of the varying truncation stochastic approximation MCMC algorithms has been studied in Andrieu, Moulines, and Priouret (2005) and Andrieu and Moulines (2006), where it is assumed that the Markov transition kernel P w , which is induced by (13) and depends on w, satisfies some drift and minorization conditions such that the resulting Markov chain is V-uniform ergodic. The function V :X → [1, ∞) is called the drift function. For SAMC, since the function H (w, x) is bounded and thus the mean field function and observation noise are bounded, we can set V (x) ≡ 1. Therefore, the resulting Markov chain is uniformly ergodic. For this reason, we assume that the Markov transition kernel P w satisfies the Doeblin condition: (A 2 ) (Doeblin condition) For any given w ∈ W, the Markov transition kernel P w is irreducible and aperiodic. In addition, there exist an integer l, 0 < δ < 1, and a probability measure ν such that for any compact subset K ⊂ W, where BX denotes the Borel set ofX ; that is, the whole supportX is a small set for each kernel P w , w ∈ K.
To verify (A 2 ), one may assume thatX is compact, ψ(x) is bounded away from 0 and ∞ onX , and the proposal distribution q(y|x) satisfies the local positive condition: There exists δ q > 0 and q > 0 such that, for every Then the condition (A 2 ) holds following from theorem 2.2 of Roberts and Tweedie (1996), where it is shown that if the target distribution is bounded away from 0 and ∞ on every compact set of its supportX , then the MH chain with a proposal satisfying (Q) is irreducible and aperiodic, and every nonempty compact set is a small set. The proposals satisfying the local positive condition can also be easily designed for both continuous and discrete systems. For continuous systems, q(y|x) can be set to a random walk Gaussian proposal, y ∼ N (x, σ 2 I d x ), where σ 2 can be calibrated to have a desired acceptance rate, for example, 0.2 ∼ 0.4. For discrete systems, q(y|x) can be set to a discrete distribution defined on a neighborhood of x. Besides the singlestep MH move, the multiple-step MH move, the Gibbs sampler, and the Metropolis-within-Gibbs sampler can also be shown to satisfy (A 2 ) under appropriate conditions; see for example Rosenthal (1995) and Liang (2009) for the proofs. Note that to satisfy (A 2 ),X is not necessarily compact. Rosenthal (1995) gave one example for which the sample space is unbounded, yet the Markov chain is uniformly ergodic.
Under conditions (A 1 ) and (A 2 ), in a similar way to Liang, Liu, and Carroll (2007), we can verify all the conditions given in Andrieu, Moulines, and Priouret (2005) for the convergence of varying truncation stochastic approximation MCMC algorithms. Hence, we claim that for SAMC, the number of truncations is almost surely finite and for all i ∈ S, as t → ∞, where C denotes a constant, ξ i = E i ψ(x)dx, and ν = j ∈{i:ξ i =0} p j /(m − m 0 ) and m 0 is the cardinality of the set {i : ξ i = 0}. The existence of empty subregions, that is, the subregions with ξ i = 0, is due to an inappropriate partition of the sample space, but SAMC does allow for the existence of empty subregions. Further, we can show that SLLN holds for SAMC: If ψ(x) is bounded away from 0 and ∞ onX andX is compact, then for any bounded measurable function G(x, w), where G is Lipschitz continuous with respect to w , that is, |G(x, w 1 ) − G(x, w 2 )| ≤ L w 1 − w 2 for some L > 0, we have 1 n n t=1 G(X t , w t ) → π * (G), a.s., as n → ∞, where π * (·) denotes the limit distribution of π t (·), and π * (G) denotes the expectation of G(·) with respect to π * (·). A proof of (15) is given in the supplementary material (Lemma A.1). A similar result has been established in Atchade and Liu (2010, theorem 4.1) for a slightly different version of SAMC, where the gain factor sequence is self-adapted. However, verification of their conditions on stopping time is difficult.
For the auxiliary chain of AEX, we have X t = (z t , ϑ t ), X = X × {θ (1) , . . . , θ (m) }, and E i = X × {θ (i) }. In addition, ν, defined in (14), is equal to 0, as X ϕ(x, θ (i) )d x > 0 for each i. Let N denote the total number of iterations of an AEX run. Then, it follows from (14) and (15) as N → ∞, provided that X is compact and thus the ratio ϕ(z, θ )/ϕ(z, θ (i) ) is bounded for all z ∈ X . Note that as t → ∞, the marginal distribution of z t converges to the mixture distribution p(z) = m i=1 p i f (z|θ (i) ). Since here we have implicitly assumed that the distributions f (z|θ (i) ), . . . , f (z|θ (m) ) share the same sample space X , the condition (A 0 ) is not necessary for, but does benefit, the convergence of (16). Recall that the major role of the auxiliary parameters is to be used for construction of a good trial distribution for drawing the auxiliary variables used in the target chain, and a good trial distribution can always improve the convergence of the importance sampling procedure. Similarly, for any Borel set A ⊂ X , we have Putting (16) and (17) together, we have, as N → ∞, which, by Lebesgue's dominated convergence theorem, implies that P (x ∈ A|θ ) = E P (x ∈ A|z 1 , ϑ 1 , w 1 ; . . . ; z N , ϑ N , w N ; θ ) where (z 1 , ϑ 1 , w 1 ; . . . ; z N , ϑ N , w N ) denotes a set of samples generated by the auxiliary chain, and x denotes a sample resampled from (z 1 , ϑ 1 , w 1 ; . . . ; z N , ϑ N , w N ). Furthermore, if the parameter space is compact and the function ϕ(z, θ) is upper semicontinuous in θ for all z ∈ X , then the convergence in (18) is uniform over . Refer to Ferguson (1996, theorem 16(a)) for a proof of uniform SLLN. In summary, we have the following lemma: Lemma 3.1. Assume that the conditions (A 1 ) and (A 2 ) are satisfied, and X is compact. ϕ(x, θ) is bounded away from 0 and ∞ on X × . Let {z 1 , ϑ 1 , w 1 ; . . . ; z N , ϑ N , w N } denote a set of samples generated by SAMC in an AEX run, and let x 1 , . . . , x n be distinct samples in z t . Resample a random variable/vector X from {z 1 , ϑ 1 , w 1 ; . . . ; z N , ϑ N , w N } such that then the distribution of X converges to f (·|θ ) almost surely as N → ∞. Furthermore, if is compact and the unnormalized density function ϕ(z, θ) is upper semicontinuous in θ for all z ∈ X , then the convergence is uniform over .
As aforementioned, since f (z|θ (i) )'s share the same sample space X , the condition (A 0 ) is not necessary for, but does benefit, the convergence stated in the lemma. For the efficiency of the algorithm, (A 0 ) is generally required to be satisfied.

Convergence of AEX
To study the convergence of AEX, we first note that {θ t } forms an adaptive Markov chain with the transition kernel given bỹ where α(θ, x, θ ) is defined in (10), l denotes the cardinality of the set of auxiliary variables collected from the auxiliary Markov chain, that is, l = |S l |, and ν l (x|θ ) denotes the true distribution of x resampled from S l . For mathematical simplicity, we assume that the parameter space (of θ ) and the sample space X (of x) are both compact. Although this makes our theory a little restrictive, it is quite common in the study of adaptive Markov chain theory, say for example, Haario, Saksman, and Tamminen (2001). Under this assumption, we have the following lemma, whose proof can be found in the supplementary material.
Lemma 3.2. Assume that conditions (A 1 ) and (A 2 ) are satisfied, both X and are compact, and the unnormalized density function ϕ(x, θ) is continuously differentiable in θ for all x ∈ X and bounded away from 0 and ∞ on X × . Furthermore, assume π (θ ) and q(θ |θ ) are continuously differentiable in θ and θ as well. DefineD l = sup θ∈ P l (θ, ·) − P (θ, ·) , where P (θ, ·), as defined in (5), denotes the transition kernel of the exchange algorithm with perfect auxiliary variables.
It is known that P (θ, dθ ) can induce a Markov chain that is irreducible, aperiodic, and admits π (θ | y) as the invariant distribution, provided an appropriate proposal q(·|·) has been used therein. Define Then it is easy to see that and the Markov chain defined by the acceptance rule min{1, r v (θ, x, θ )} is irreducible, aperiodic, and admits the invariant distribution π (θ | y), provided an appropriate proposal q(·, ·) has been used therein. To ensure the convergence of the Markov chain, ν l (x|θ ) is not necessarily to have a support as large as X . In fact, its support can be only a subset of X . Let P v (θ, dθ ) denote the transitional kernel of the Markov chain induced by the acceptance rule min{1, r v (θ, x, θ )}. Lemma 3.3 shows thatP l (θ, dθ ), defined in (19), is also irreducible and aperiodic and admits an invariant distribution. This is done by showing that the accessible sets of P v are included in those of P l . The details are given in the supplementary material. (19), is irreducible and aperiodic, and hence there exists a stationary distributionπ l (θ |x) such that for any θ 0 ∈ , lim k→∞ P k l (θ 0 , ·) −π l (·| y) = 0.

Lemma 3.3. Assume (i) the conditions of Lemma 3.2 hold and (ii) P is irreducible and aperiodic and admits an invariant distribution. ThenP l , defined in
If the proposal q(·, ·) satisfies the local positive condition (Q), then the ergodicity is uniform over .
Lemma 3.4 concerns the simultaneous uniform ergodicity of the kernelP l 's, whose proof can be found in the supplementary material.

SPATIAL AUTOLOGISTIC MODEL
The spatial autologistic model (Besag 1974) has been widely used in spatial data analysis (e.g., Preisler 1993;Sherman, Apanasovich, and Carroll 2006). Let y = {y i : i ∈ D} denote the observed binary data, where y i is called a spin and D is the set of indices of the spins. Let |D| denote the total number of spins in D, and let n(i) denote the set of neighbors of spin i. The likelihood function of the model is given by where the parameter α determines the overall proportion of y i = ±1, the parameter β determines the intensity of interaction between y i and its neighbors, and κ(α, β) is the intractable normalizing constant defined by (22) An exact evaluation of κ(α, β) is prohibited even for a moderate system. To conduct a Bayesian analysis for the model, a uniform prior on is assumed for the parameters, which restricts to be a compact set. Since D is finite, the sample space X (of y) is also finite.

U.S. Cancer Mortality Data
U.S. cancer mortality maps have been compiled by Riggan et al. (1987) for investigating the possible association of cancer with unusual demographic, environmental, and industrial characteristics, or employment patterns. Figure 1(a) shows the mortality map for cancer of the liver and gallbladder (including bile ducts) cancers in white males during the decade 1950-1959, which indicates some apparent geographic clustering. See Sherman, Apanasovich, and Carroll (2006) for more descriptions of the data. Following Sherman, Apanasovich, and Carroll (2006), we modeled the data with a spatial autologistic model. The total number of spins is |D| = 2293. A free boundary condition is assumed for the model, under which the boundary points have fewer neighboring points than the interior points. This assumption is natural to this dataset, as the lattice has an irregular shape.
The AEX algorithm was first applied to this example. It was run for 10 times independently. Each run consisted of three stages. The first stage was to choose the auxiliary parameters {θ (1) , . . . , θ (m) }. For this purpose, the fractional DMH algorithm was run for N 1 = 5500 iterations and 5000 samples of θ were collected after a burn-in period of 500 iterations, and then m = 100 auxiliary parameters were selected from the 5000 collected samples using the max-min procedure. The second stage was to build up an initial database of auxiliary variables. In this stage, only the auxiliary chain was run. The auxiliary chain started with the observed data y and w 0 = (1, . . . , 1), and consisted of N 2 = 1.1 × 10 6 iterations. The first 10 5 iterations were discarded for the burn-in process, and then a database of 50, 000 auxiliary variables was collected from the remainder of the run at an equal time space of s 0 = 20 iterations. In the simulations, we set X 0 = X and K i = [0, 10 100+10i ] for i = 0, 1, 2, . . . . In the third stage, the auxiliary chain and the target chain were run simultaneously. The auxiliary chain was run for N 3 = 4 × 10 5 iterations and the samples generated by it were continuously collected (at a time space of s 0 = 20 iterations) and added into the database, and the target chain was run for 20,000 iterations. The target chain started with the average of the θ samples collected from the fractional DMH run, and was run for one iteration once a new auxiliary variable was added to the database. The CPU time for the whole run is 107 sec, which is measured on a single core of Intel R Xeon R CPU E5-2690 (2.90 Ghz) (the same computer was used for all simulations of this article). The samples generated in the target chain were used for Bayesian inference of the model. For both the fractional DMH chain and the target chain, a Gaussian random walk proposal distribution with the covariance matrix 0.03 2 I 2 was used, where I 2 denotes the 2-by-2 identity matrix. The overall acceptance rate of the target chain is 0.21, and that of the fractional DMH chain is 0.33. For the auxiliary chain, the proposal distribution T 2 (·|·) was set to a single cycle of Gibbs updates. Figure 2(a) shows the scatterplot of the fractional DMH samples and the selected auxiliary parameters. The scatterplot of the selected auxiliary parameters is also shown in Figure 2(b). As expected, the selected auxiliary parameters are distributed over the space of fractional DMH samples, and the convex hulls formed by them are about the same. Figure 2(c) shows the histogram of the 100 auxiliary parameters achieved by the auxiliary chain in a run with the uniform desired sampling distribution. The flatness of the histogram implies that the auxiliary chain has converged.
A MCMC convergence diagnosis based on Gelman-Rubin's shrink factor (Gelman and Rubin 1992), as shown in Figure 3, indicates that for this example AEX can converge very fast, usually within a few hundred iterations. Figure 3 was generated using the R package coda (Plummer et al. 2012) based on 10 independent runs of AEX. As for conventional MCMC algorithms, the samples generated in the early stage of these runs can be discarded for the burn-in process. Table 1 summarizes the parameter estimation results for the model based on the 10 independent runs.
To assess the validity of the AEX algorithm, the exchange algorithm was applied to this example with the same proposal distribution as used in the target chain of AEX. As previously mentioned, the exchange algorithm is an auxiliary variable MCMC algorithm that requires a perfect sampler for generating auxiliary variables, and it can sample correctly from the posterior distribution when the number of iterations is large. Hence, the estimates produced by the exchange algorithm can be used as the test standard for assessing the performance of AEX. For this Figure 1. U.S. cancer mortality data. Left: The mortality map of liver and gallbladder cancers (including bile ducts) for white males during the decade 1950-1959. Black squares denote counties of high cancer mortality rate, and white squares denote counties of low cancer mortality rate. Right: Estimated cancer mortality rates using the autologistic model with the model parameters being replaced by their approximate Bayesian estimates. Gray level of the corresponding square represents the cancer mortality rate of each county. example, the auxiliary variables were generated using the summary state algorithm (Childs, Patterson, and MacKay 2001), which is suitable for high-dimensional binary spaces. The exchange algorithm was run for 10 times independently. Each run consisted of 20,500 iterations, where the first 500 iterations were discarded for the burn-in process, and the remaining 20,000 iterations were used for estimation of θ . Therefore, in each run, the exchange and AEX algorithm produced the same number of samples. Each run of the exchange algorithm costs about 106 sec CPU time, and the overall acceptance rate was 0.21, which indicates the algorithm has been implemented efficiently. The numerical results are summarized in Table 1. The estimates produced by the AEX and exchange algorithms are also very close to those reported in the literature. Liang (2007) analyzed these data using a contour Monte Carlo method and produced the estimate (−0.3008, 0.1231). The contour Monte Carlo method approximates the normalizing constant function on a given region and then estimates the parameters based on the approximated normalizing constant function. As reported by Liang (2007), this algorithm took several hours of CPU time to approximate the normalizing constant function. Sherman, Apanasovich, and Carroll (2006) analyzed the data using the MCMLE method (Geyer and Thompson 1992), and obtained the estimate (−0.304, 0.117), which is different from other Bayesian estimates. This may reflect the difference between the posterior mean and posterior mode.
For a thorough comparison, we have also applied the DMH and MCMH algorithm to this example. As aforementioned, the DMH sampler is the same with the exchange algorithm except that it draws the auxiliary variable x at each iteration through a finite run of the MH algorithm initialized at the observation y. In our implementation, we set the length of the finite MH run to be K = 50. DMH was run for 10 times. Each run consisted of 20,500 iterations and cost about 142 sec, where the samples generated in the last 20,000 iterations were used for parameter estimation. The results are summarized in Table 1.
The MCMH algorithm works in a different idea from DMH. It is not to cancel the unknown normalizing constant ratio using auxiliary variables, but to approximate the normalizing constant ratio using auxiliary variables. Let θ t denote the sample of θ drawn at iteration t, and let x (t) 1 , . . . , x (t) K denote the auxiliary samples simulated from the distribution f (x|θ t ). One iteration of the MCMH algorithm consists of the following steps: Monte Carlo MH algorithm: 1. Draw θ from a proposal distribution q(θ |θ t ).
3. Set θ t+1 = θ with probability min{1,r(θ t , x 1 , . . . , x K , θ )} and set θ t+1 = θ t with the remaining probability. We note that the MCMH algorithm-an early version of this algorithm is presented in the book by Liang, Liu, and Carroll (2010)-is similar to the marginal PMCMC algorithm suggested by Everitt (2012). The MCMH algorithm is to use auxiliary variables to approximate the unknown normalizing constant ratio, while the marginal PMCMC algorithm, based on the work by Andrieu, Doucet, and Holenstein (2010), is to use sequential Monte Carlo samples to approximate the unknown normalizing constant.
In our implementation of MCMH, we set K = 100 and the proposal q(·|·) to be a Gaussian random walk with the covariance matrix 0.03 2 I 2 . The algorithm was run for 10 times independently. Each run consisted of 20,500 iterations, where the first 500 iterations were for the burn-in process, and cost about 143 sec CPU time. The overall acceptance rate was about 0.35. The results are summarized in Table 1. Table 1 shows that all the algorithms-AEX, DMH, MCMH, and exchange-work well for this example. By treating the es-timates of the exchange algorithm as the testing standard, it is easy to see that the estimates resulting from all the other three algorithms are unbiased. This is because the underlying system of this example dataset is only weakly dependent, and thus independent auxiliary samples can be easily drawn at each iteration by running a short Markov chain. In terms of efficiency, MCMH is the best among the three approximate MCMC algorithms. This is reasonable, as in MCMH the auxiliary sample are only drawn when a new sample of θ is accepted and all the auxiliary samples are used in simulating new samples of θ . For this example, MCMH has about the same efficiency as the exchange algorithm after taking account of their CPU times and standard deviations of the estimates. Compared to MCMH, AEX, and DMH are less efficient in the use of auxiliary samples. AEX makes use of only a small proportion of the auxiliary samples and discards all the others. DMH is similar; it uses only the last sample generated by the short Markov chain at each iteration.
In Section 4.2, we present one example for which the underlying system is strongly dependent. Since, for such a system, independent auxiliary samples are difficult to draw with a short run of Markov chain, AEX outperforms the DMH and MCMH algorithms.

A Simulation Study
To assess the performance of the AEX algorithm for strongly dependent systems, we simulated one U.S. cancer mortality dataset using the summary state algorithm with α = 0 and β = 0.45. Since the lattice is irregular, the free boundary condition was again assumed in the simulation. Note that under the setting α = 0, the autologistic model is reduced to the Ising model. Hence, the underlying system for this simulated dataset is strongly dependent by noting that the interaction parameter β is greater than the critical value (≈ 0.44) of the Ising model.
AEX was first applied to this example. It was run as for the last example except that the simulations in the second and third stages have been lengthened. For this example, we set N 2 = 6 × 10 6 , s 0 = 50, and N 3 = 10 6 , where the first 10 6 iterations in the second stage were used for the burn-in process. AEX was run for 10 times independently, and each run costs about 327s. The numerical results are summarized in Table 2.
For comparison, we have also applied the exchange, DMH, and MCMH to this example. The exchange algorithm was run for 10 times. Each run consisted of 1200 iterations, where the first 200 iterations were discarded for the burn-in process and the samples generated in the remaining iterations were used for parameter estimation. The proposal distribution is the same as that used for the last example. For the exchange algorithm, due to its difficulty in generating perfect samples, the CPU time of each run can be very long although consisting of only 1200 iterations. In addition, the CPU times of different runs can be much different. Childs, Patterson, and MacKay (2001) studied the behavior of the perfect sampler for the Ising model. They fitted an exponential law for the convergence time, and reported that the perfect sampler may not work well when the value of β is close to the critical value (≈ 0.44). Their finding is consistent with our results reported in Table 2.
In applying DMH to this example, we tried different values of K, K = 10, 100, 500, and 1000, to assess the effect of mixing of the short Markov chain on parameter estimation. For MCMH, we have also tried the same settings of K. Both DMH and MCMH were run for 10 times independently, and each run consisted of 20,500 iterations with the first 500 iterations discarded for the burn-in process. The results are summarized in Table 2. As shown in Table 2, for the same value of K, MCMH generally costs less CPU times than DMH, as MCMH only generates auxiliary variables when a new value of θ is accepted, while in DMH this needs to be done at each iteration. Table 2 shows that both the DMH and MCMH estimates of β have a trend of decreasing toward the true value 0.45. However, even with K = 1000, at which both DMH and MCMH cost more CPU times than AEX, their estimates are still far from the true value. In contrast, even AEX costs much less CPU time than these two algorithms (with K = 1000); its estimate of β is much closer to the true value and also the estimate of the exchange algorithm.
To conduct a Bayesian analysis for the model, we assume the following priors: where I (·) is the indicator function. Under the free boundary condition for which the boundary pixels have fewer neighbors, we have the following posterior distribution Although σ 2 can be integrated out from the posterior, we do not suggest to do so. Working on the joint posterior will ease the generation of auxiliary variables in AEX.

Wheat Yield Data
The wheat yield data were collected on a 20 × 25 rectangular lattice (Andrews and Herzberg 1985, table 6.1). The data are shown in Figure 5(a), which indicates positive correlation between neighboring observations. These data have been analyzed by a number of authors, for example, Besag (1974), Huang and Ogata (1999), and Gu and Zhu (2001). Following the previous authors, we subtracted the mean from the data and then fitted them by the autonormal model. In our analysis, the free boundary condition is assumed. This is natural, as the lattice for the real data is often irregular.
The AEX algorithm was applied to this example in a similar way to the cancer mortality data example. The algorithm was run for 10 times independently. Each run consisted of three stages, and their settings are the same as the runs for the simulated cancer mortality data except for the proposal distributions. For both the fractional DMH chain and the target chain, a Gaussian random walk proposal with the covariance matrix 0.01 2 I 4 was used, where I 4 denotes the 4-by-4 identity matrix. For the auxiliary chain, the proposal T 2 (·|·) was set to a single cycle of Gibbs updates: for i = 1, . . . , M and j = 1, . . . , N, where (β ht , β vt , β dt , σ 2 t ) denotes the value of θ at iteration t. Each run of AEX cost about 402 sec. Figure 5(b) shows the histogram of the 100 auxiliary parameters achieved by the auxiliary chain at the end of the second stage. It indicates that the auxiliary chain can mix very well for different auxiliary parameters. The parameter estimation results are summarized in Table 3.
For this example, the exchange algorithm is not applicable, as the perfect sampler is not available for the autonormal model. However, under the free boundary condition, the log-likelihood where S y , Y h , Y v , and Y d are as defined in (25). The Bayesian inference for the model is then standard, with the priors as specified in (24). For comparison, the MH algorithm has been applied to simulate from the resulting analytic posterior distribution. It was run for 10 times with each run consisting of 20,500 iterations, where the first 500 iterations were discarded for the burn-in process. The proposal distribution adopted here was a Gaussian random walk with the same covariance matrix as that used in AEX. The overall acceptance rate is about 0.45. The resulting parameter estimates, the so-called true Bayes estimates, are summarized in Table 3. The comparison indicates that AEX works well for this example. This example implies that AEX can work well for reasonably high-dimensional problems. As aforementioned, the key to the success of AEX is to be able to generate auxiliary variables at a set of auxiliary parameters that cover the support of the true posterior. In AEX, all the auxiliary parameters are selected from the samples generated by the fractional DMH algorithm that can result in an expanded support of the true posterior. This, together with the max-min procedure, ensures that the selected auxiliary parameters are able to cover the support of the true posterior. Since the auxiliary parameters are essentially generated from the same posterior distribution, the neighboring distributions f (x|θ (i) )'s are reasonably overlapped by noting that they all share, at least, the same sample-the observation Y . This ensures a smooth transition of the auxiliary chain between different auxiliary parameters and thus the success of AEX. More importantly, this property of AEX holds independent of the dimension of θ . For the wheat yield data example, the overall acceptance rate for the transitions between different auxiliary parameters is about 0.25.

CONCLUSION
We have proposed a new algorithm, the AEX algorithm or AEX in short, for sampling from distributions with intractable normalizing constants. The new algorithm can be viewed as a MCMC extension of the exchange algorithm, which gener-ates auxiliary variables via an importance sampling procedure from a Markov chain running in parallel. The convergence of the new algorithm, including ergodicity and WLLN, has been established under mild conditions. Compared to the exchange algorithm, the new algorithm removes the requirement of perfect sampling, and thus can be applied to many models for which perfect sampling is not available or very expensive. The new algorithm has been tested on the spatial autologistic and autonormal models. The numerical results indicate that the new algorithm can outperform other approximate MCMC algorithms, such as the DMH and MCMH algorithms, for strongly dependent systems.
We have also applied the AEX algorithm to some more challenging problems, such as exponential random graph models for social networks. Our numerical results indicate that the AEX algorithm can outperform other approximate algorithms for large social networks. When the network is large, generating independent auxiliary networks is often difficult with a short Markov chain. Due to the space limit, these results will be reported elsewhere.
Our implementation of AEX is plain. Its efficiency can be improved in various respects. For example, in the current implementation, the auxiliary parameters are selected using a maxmin procedure. In the future, a self-learning process may be introduced to the auxiliary Markov chain toward an optimal selection of the auxiliary parameters. To improve the mixing of the target chain, a population-based MCMC algorithm may be used, for example, adaptive direction sampling, parallel tempering, or evolutionary Monte Carlo. We note that AEX can slow down with iterations, because it needs to resample from an increasingly large database.
Finally, we would like to compare, mainly conceptually, the AEX algorithm and the Russian Roulette sampling algorithm (Lyne et al. 2014). The latter is developed based on the pseudomarginal MCMC approach of Andrieu and Roberts (2009), which requires an unbiased independent estimate of the intractable normalizing constant at each iteration. Hence, the algorithm can be extremely slow. For an Ising model example of 100 spins, where the normalizing constant is estimated using annealing importance sampling (Neal 1998) at each iteration, the algorithm costs over 100 CPU minutes (on the same computer as used by AEX for the previous examples) for running 1000 iterations under the authors' default setting. The AEX algorithm also relies on the estimates of intractable normalizing constants, but only on a number of prespecified points. Moreover, the estimates are obtained in an online manner and can be improved with iterations. Hence, the AEX algorithm can be much more efficient than the Russian Roulette sampling algorithm. This is evidenced by Table 2, where it is reported that AEX cost only 5.5 CPU minutes for running 20,000 iterations for a two-parameter spatial autologistic model of 2293 spins.
For large-scale problems that the dataset Y consists of a large number of observations, the difficulty arises from evaluation of the likelihood function. For the Russian Roulette sampling algorithm, its theory allows it to work with an unbiased estimate of the likelihood function. This is a strength of the Russian Roulette sampling algorithm, although achieving such an estimate at each iteration is again difficult. Sophisticated computational techniques, such as parallel computing and subsampling, may be needed there. For the AEX algorithm, to tackle this difficulty, it can be used with a split-and-merge strategy; that is, one can split the big dataset into many small subsets (e.g., through subsampling), perform AEX simulations for each subset data independently, and then combine the simulation results from each subset data to get the entire data-based inference. For example, if we are merely interested in parameter estimation, the estimates from different subset data can be combined using a meta-analysis method (see, e.g., Chang 2011). If we are interested in posterior samples, the combination might be done via a random set intersection method, which is, to some extent, related to Dempster's rule of recombination operation in the Dempster-Shafer theory of belief functions (see, e.g., Martin et al. 2010). This method has been explored in Lin (2014) for the case that the subset data are mutually independent. It would be of great interest to extend this method to the case that the subset data are generally dependent.