The Controlled Thermodynamic Integral for Bayesian Model Evidence Evaluation

ABSTRACT Approximation of the model evidence is well known to be challenging. One promising approach is based on thermodynamic integration, but a key concern is that the thermodynamic integral can suffer from high variability in many applications. This article considers the reduction of variance that can be achieved by exploiting control variates in this setting. Our methodology applies whenever the gradient of both the log-likelihood and the log-prior with respect to the parameters can be efficiently evaluated. Results obtained on regression models and popular benchmark datasets demonstrate a significant and sometimes dramatic reduction in estimator variance and provide insight into the wider applicability of control variates to evidence estimation. Supplementary materials for this article are available online.


Introduction
In hypothesis-driven research, we are presented with data y that is assumed to have arisen under one of two (or more) putative models m i characterized by a probability density p(y|m i ). Given a priori model probabilities p(m i ), the data y induce a posteriori probabilities p(m i |y) that are the basis for Bayesian model comparison. Since any prior probability distribution gets transformed to a posterior probability distribution through consideration of the data, the transformation itself represents the evidence provided by the data (Kass and Raftery 1995). For the simple case of two models, this transformation follows from Bayes' rule as p(m 2 |y) p(m 1 |y) posterior odds = p(y|m 2 ) p(y|m 1 ) Thus, the influence of the data on the posterior probability distribution is captured through that Bayes factor B 21 in favor of Model 2 over Model 1. Rearranging, we can interpret the Bayes factor as the ratio of the posterior odds to the prior odds. A natural approach to computation of Bayes factors is to directly compute the evidence provided by data y in favor of model m i , where θ are parameters associated with model m i . Yet for almost all models of interest, the evidence is unavailable in closed form and must be approximated. Numerous techniques have been proposed to approximate the model evidence (Equation (2)), a selection of which includes path sampling (Ogata 1989;Gelman and Meng 1998), thermodynamic integration (TI; Frenkel and Smit 2002), harmonic means (Gelfand and Dey 1994), Chib's method (Chib and tors of model evidence. The methodology applies whenever the gradient of the log-likelihood (and the log-prior) can be evaluated and therefore can be used "for free" when differential geometric sampling schemes are employed in construction of the Markov chain (Papamarkou, Mira, and Girolami 2014). Theoretical results are provided that guide maximal variance reduction in practice. Results on popular benchmark datasets demonstrate a substantial reduction in variance compared to existing estimators and the method is shown to be exact in the special case of Bayesian linear regression. The article proceeds as follows: Section 2 recalls key ideas from TI and ZV that we use in our methodology. In Section 3, we derive control variates for TI and provide theoretical results that guide maximal variance reduction in practice. Section 4 compares the proposed methodology to the state-of-the-art estimators of model evidence applied to popular benchmark datasets. Section 5 investigates scenarios where the proposed methodology is likely to fail. Finally, Section 6 provides more general insight into the use of control variates in estimation of model evidence, drawing an important distinction between "equilibrium" and "nonequilibrium" estimators that determines whether or not control variates may be applicable.

Thermodynamic Integration
Path sampling and the closely related technique of TI emerged from the physics community as a computational approach to compute normalizing constants (Gelman and Meng 1998). Recent empirical investigations, including Vyshemirsky and Girolami (2008) and Friel and Wyse (2012), have revealed that TI is among the most promising approach to estimation of model evidence. Below we provide relevant background on TI, referring the reader to Calderhead and Girolami (2009) for further details. TI targets the model evidence directly; in what follows we therefore implicitly condition upon a model m and aim to compute the evidence p(y) = p(y|m) provided by data y in favor of model m. Following the presentation of Friel and Pettitt (2008), the power posterior is defined as p(θ|y, t ) = p(y|θ) t p(θ)/Z t (y), where the normalizing constant is given by Here, t is known as an inverse temperature parameter and by analogy the process of increasing t is known as annealing. Note that p(θ|y, t = 0) is the density of the prior distribution, whereas p(θ|y, t = 1) is the density p(θ|y) of the posterior distribution. Varying t ∈ (0, 1) produces a continuous path between these two distributions and in this article it is assumed that all intermediate distributions exist and are well defined. The normalizing constant Z 0 (y) is equal to one and Z 1 (y) is equal to p(y), the model evidence that we aim to estimate.
The standard thermodynamic identity is log(p(y)) = 1 0 E θ|y,t log(p(y|θ))dt, where the expectation in the integrand is with respect to the power posterior whose density is given above. The correctness of Equation (3) is established in, for example, Friel and Pettitt (2008). In TI, this one-dimensional integral is evaluated numerically using a quadrature approximation over a discrete temperature ladder, whereas in the related approach of path sampling this integral is evaluated using MCMC. Note that the use of quadrature methods introduces bias into the estimator of model (log-)evidence; it is therefore important to select an accurate quadrature approximation (Appendix A).

Control Variates and The ZV Technique
Control variates are often employed when we aim to estimate, with reduced variance, the expectation E π [g(θ)] of a function g(θ) of a random variable θ that is distributed according to a (possibly unnormalized) density π (θ). In this article, we focus on real-valued θ ∈ ⊆ R d and we aim to approximate The generic control variate principle relies on constructing an auxiliary functiong ] for the variance of the function g(θ) of a random variable θ whose (unnormalized) density is π (θ). In many cases, it is possible to choose h(θ) ], leading to a reduction in Monte Carlo variance. Intuitively, greater variance reduction can occur when h(θ) is negatively correlated with g(θ) under π (θ), since much of the randomness "cancels out" in the auxiliary functioñ g(θ). In classical literature, h(θ) is formed as a sum φ 1 h 1 (θ) + · · · + φ m h m (θ), where the h i (θ) have zero mean under π (θ) and are known as control variates, while φ i are coefficients that must be specified. For estimation based on Markov chains, Andradóttir, Heyman, and Ott (1993) proposed control variates for discrete state spaces. Later Mira, Tenconi, and Bressanini (2003) extended this approach to continuous state spaces, observing that the optimal choice of h(θ) is intimately associated with the solution of the Poisson equation h(θ) = E π [g(θ)] − g(θ) and proposing to solve this equation numerically. Further work on constructing control variates for Markov chains includes Hammer and Tjelmeland (2008) for Metropolis-Hastings chains and Dellaportas and Kontoyiannis (2012) for Gibbs samplers.
In this article, we consider the particularly tractable class of ZV control variates that are expressed as functions of the gradient ∇ θ log π (θ) of the log-target density (i.e., the score function).
Here ∇ θ = [∂/∂θ 1 , . . . , ∂/∂θ d ] T . More specifically, Mira, Solgi, and Imparato (2013) proposed to use where θ = ∇ θ · ∇ θ is the Laplacian operator, the trial function Q(θ) is a polynomial in θ and is the score function. (In Mira, Solgi, and Imparato (2013) an unnecessary constant of proportionality was used, that we omit here.) In this article, we adopt the convention that both θ and u(θ) are d × 1 vectors. The trial function Q(θ) should be viewed as approximating the function of interest g(θ) based on the value of the score u(θ). An optimal choice for the trial function Q is given by the solution of the Poisson equation For arbitrary functions g(θ) and densities π (θ), the solution of Equation (7) will not be available (since the right-hand side is unknown), but we may be able to obtain a good polynomial approximation to the solution; this is the basis of the ZV methodology.
The thermodynamic identity (Equation (3)) is based on expected values of log-likelihoods log(π (θ)). Since u(θ) is closely related to log(π (θ)), ZV control variates appear as a natural strategy to achieve variance reduction in TI. As shown in Mira, Solgi, and Imparato (2013), ZV control variates arise naturally in certain Gaussian models, leading, in some cases, to exact (i.e., deterministic) estimators that have zero variance. Intuitively, any density π (θ) that approximates a Gaussian forms a suitable candidate for implementing the ZV scheme. Theoretical conditions for asymptotic unbiasedness of ZV have been established (Appendix B).
ZV control variates are particularly tractable for two reasons: (i) for many models of interest it is possible to obtain a closed-form expression for Equation (5), compared to alternatives that require numerical solution of the Poisson equation and (ii) as recently noticed by Papamarkou, Mira, and Girolami (2014), the ZV technique can be applied essentially "for free" inside differential-geometric MCMC sampling schemes for which the score function is a prerequisite for sampling .

Methodology
In Section 3.1, we develop a control variate scheme for the estimation of model evidence, taking TI as our base estimator whose variance we propose to reduce. The main methodological challenge in this setting is the elicitation of both the optimal control variate coefficients φ and the optimal temperature ladder that underlies TI. In Section 3.2, we derive optimal expressions for both these quantities and in Section 3.3 we describe how coefficients and temperature ladders are selected in practice.

The Controlled Thermodynamic Integral
Taking the target density π (θ) to be the power posterior p(θ|y, t ), it follows from Equation (6) that The ZV control variates (Equation (5)) are then where u(θ|y, t ) is as defined in Equation (8). Here, the notation Q(θ|φ(y, t )) indicates that the polynomial Q has coefficients φ ≡ φ(y, t ) that will in general depend on both the data y and inverse temperature t. Integrating these control variates into TI, we obtain the "controlled thermodynamic integral" (CTI) To use CTI to estimate the model (log-)evidence, we need to specify both (i) polynomial coefficients φ(y, t ) and (ii) an appropriate discretization 0 = t 0 < t 1 < · · · < t m = 1 (the temperature ladder) of the one-dimensional integral. Specification of both polynomial coefficients and temperature ladder should be targeted at minimizing the variance of CTI (see below).

Optimal Coefficients and Ladders
We derive the jointly optimal, variance-minimizing, polynomial coefficients, and temperature ladder. For the latter, note that there is a surjective mapping from partitions 0 = t 0 < t 1 < · · · < t m = 1 to probability distributions on [0, 1] with density function p(t ) that is given by For the development below, it is convenient to focus on optimizing the density p(t ), mapping back to the temperature ladder during implementation (see Section 3.3). For clarity of the exposition, write (θ) = log(p(y|θ)) where we temporarily suppress dependence on both data y and model m. The CTI identity can be rewritten as where the final expectation is taken with respect to the distribution with density p(θ, t|y) = p(θ|y, t )p(t ). Under an approximation that Monte Carlo samples are obtained independently, so-called "perfect transitions, " the variance of the estimator of model (log-)evidence is given by where N is the number of Monte Carlo samples. The optimal choice of polynomial coefficients φ(t ) and temperature ladder p(t ) are defined as the pair that jointly minimize Equation (12). Specifically, we seek to minimize the Lagrangian where e is the dimension of φ and depends on the degree of the polynomial Q(θ|φ) that is being employed. Here, λ is a Lagrange multiplier that will be used to ensure p(t )dt = 1. Below we consider degree 1 polynomials Q(θ|φ) = θ T φ so that h(θ|t ) = φ(t ) T u(θ|t ) but the derivation applies analogously to higher degree polynomials, as explained in Appendix C. The solution (p * , φ * ) of the Lagrangian optimization problem (Equation (13)) is where V θ|y,t [u(θ)] and E θ|y,t [ (θ)u(θ)] denote, respectively, variance and cross-covariance matrices (since E θ|y,t [u(θ)] = 0). Notice that the optimal temperature ladder for CTI is not the same as the optimal ladder for standard TI, which is given by It can be shown (Rubinstein and Marcus 1985) that this choice of polynomial coefficients φ = φ * is characterized as the minimizer of the variance ratio and at this minimum so that greater variance reduction is expected in the case where a linear combination of the elements of the vector u(θ) is highly correlated with the target function (θ).

Implementation
For most models of interest both Equations (14) and (15) do not possess closed-form expressions and it becomes necessary to employ estimates or approximations to the optimal values. We begin by noting that Equation (14) actually defines the optimal, variance-minimizing, coefficients independently of the choice of temperature ladder p(t ); this is directly verified from the Euler-Lagrange equations applied to φ : [0, 1] → R e , where p(t ) is held fixed. This observation allows us to discuss these two aspects of the implementation separately.

.. Polynomial Coefficients
Optimal coefficients for control variates are typically estimated based on the same sequence of MCMC samples that will subsequently be used to compute the controlled expectations (Robert and Casella 2004). Specifically, to estimate the optimal control variate coefficients φ * (t ) we exploit MCMC samples to estimate both the covarianceV θ|y,t [u(θ)] and the cross-covariancê E θ|y,t [ (θ)u(θ)]. These estimates are then plugged directly into Eq. (14) to obtain an estimate for the optimal coefficients. Further discussion of "plug-in" estimators for control coefficients can be found in Dellaportas and Kontoyiannis (2012).

.. Temperature Ladder
For estimating the optimal temperature ladder of Equation (15), one obvious numerical approach would be to first estimate p * (t ) up to proportionality over a uniform grid {t i }, using a preliminary MCMC run to estimate both E θ|y,t [ (θ) 2 ] and the covariance and cross-covariance matrices V θ|y,t [u(θ)] and E θ|y,t [ (θ)u(θ)]. Then nonparametric density estimation could be applied to obtain an estimate for the optimal ladder {t i }. However this two-step procedure is computationally burdensome. Neal (1996) showed that a geometric temperature ladder is optimal for annealing on the scale parameter of a Gaussian and Behrens, Friel, and Hurn (2012) extended this result to target distributions of the same form as (θ), which includes Gaussians. In this article, we fix a quintic temperature ladder t i = (i/50) 5 for use in all applications; this ladder is widely used in the TI literature and has demonstrated strong performance in empirical studies (e.g., Calderhead and Girolami 2009;Friel, Hurn, and Wyse 2014a). The question of how to select appropriate temperature ladders in practice is an ongoing area of research and the recent contributions of Miasojedow, Moulines, and Vihola (2012), Behrens, Friel, and Hurn (2012), Zhou, Johansen, and Aston (2013), and Friel, Hurn, and Wyse (2014a) are compatible with our methodology.

.. Quadrature
The second-order quadrature method of Friel, Hurn, and Wyse (2014a), described in Appendix A, requires us also to estimate the variance V θ|y,t [ (θ)] at each step in the temperature ladder.
In experiments below, we use ZV control variates to estimate this variance, using the identity and applying control variates in the estimation of each of these expectations.

Applications
We present several empirical studies that compare CTI to the state-of-the-art TI estimators of Friel, Hurn, and Wyse (2014a). In all applications below, we base estimation on the output of a population MCMC sampler (Jasra, Stephens, and Holmes 2007) limited to N iterations at each of the 51 rungs of the temperature ladder (a total of 51 × N evaluations of the likelihood function). In brief, the within-temperature proposal was provided by the manifold Metropolis-adjusted Langevin algorithm (mMALA) of Girolami and Calderhead (2011), while the between-temperature proposal randomly chooses a pair of (inverse) temperatures t i and t j , proposing to swap their state vectors with probability given by the Metropolis-Hastings ratio (Calderhead and Girolami 2009). To ensure fairness, the same samples were used as the basis for all estimators of model evidence, ensuring that all estimators require essentially the same amount of computation (since the score function is computed as a matter of course in mMALA). Moreover, to explore the statistical properties of the estimators themselves, we generated 100 independent realizations of the population MCMC and thus 100 realizations of each estimator. Full details are provided in the supplementary materials.

.. Known Precision
We begin with an analytically tractable problem in Bayesian linear regression. The (log-)likelihood function is given by where y is n × 1, X is n × d, and β is d × 1. In simulations below, we took σ = 1, d = 3, β = [0, 1, 2]. The design matrix X was populated with n = 100 rows by drawing each entry independently from the standard normal distribution and then data y were generated from N(Xβ, σ 2 I n×n ); both X and y were then fixed for all experiments below. From the Bayesian perspective, we take a conjugate prior β ∼ N(0, ζ 2 I d×d ) with ζ = 1.
In this section, we assume σ is fixed and known, but we relax this assumption in the next section. Thus, the unknown model parameters here are θ = β ∈ R d and we aim to compute the evidence p(y) by marginalizing over these parameters. This example is an ideal benchmark since it is possible to obtain many relevant quantities in closed form; see Appendix D.1 for full details.
Before applying CTI, we are required to check that the sufficient conditions for the unbiasedness of ZV estimators are satisfied (see Appendix B). This amounts to noticing that the tails of the power posterior p(β|y, t ) decay exponentially in β (Appendix D.1). Using the plug-in estimates (Equation (18)), we obtain estimates for the optimal coefficients φ * , which are shown in supplementary Figure S6. For degree 2 polynomials, we see that the plug-in estimator is deterministic. Indeed, by direct calculation we see that u(β|y, t ) is an invertible affine transformation of the parameter vector where (t ) = ( t σ 2 X T X + 1 ζ 2 I) −1 . This allows us to intuit that CTI based on degree 2 polynomials will produce an exact estimate of the (log-)evidence (up to quadrature error), as we explain below. Indeed, by another invertible affine transformation, we can map u(β|y, t ) → y − Xβ, which, when multiplied by the polynomial Q(β|φ) = (y − Xβ) T produces a quantity (y − Xβ) T (y − Xβ) that is perfectly correlated with the loglikelihood under the power posterior. It then follows from Equation (16) that CTI will possess zero variance. This argument is made rigorous in the supplementary materials.
In supplementary Figure S7, we plot 100 independent estimates of the integrand E β|y,t [ (β)] at each of the 51 temperatures in the ladder for polynomial trial functions of degree 0 (i.e., standard TI), 1, and 2. It is apparent that estimator variance is greatest at lower values of t; this motivates the heavily skewed temperature ladder used by ourselves and others, as we wish to target our computational effort on this high-variance region. We quantify the reduction in estimator variance at an (inverse) temperature t using the variance ratio R(t ) as estimated from the MCMC samples. Figure 1 shows that degree 1 polynomials achieve (on average) variance reduction at all temperatures, with the greatest reduction occurring in the region where t is small. This is encouraging as the region where t is small is most important for variance reduction of TI, as discussed above. For degree 2 polynomials, we have R(t ) = 0 for all t, which recapitulates the exactness of the CTI estimator in this example.
Finally, we explore the quality of the estimators of model evidence themselves. For this model the (log-)evidence is available in closed form (Appendix D.1) and this allows us to compute the mean square error (MSE) over all 100 independent realizations of each estimator. Results, shown in Table 1, demonstrate that CTI with degree 2 polynomials achieves a two-fold reduction in MSE compared to standard TI when both estimators are based on first-order quadrature. However, first-order quadrature is known to lead to significant estimator bias (Friel, Hurn, and Wyse 2014a) and when estimators are based instead on more accurate second-order quadrature, CTI is seen to be approximately 10, 000× better that TI in terms of MSE; a dramatic difference. We also compared TI approaches against annealed importance sampling (AIS; Neal 2001), as described in the supplementary materials. In this case, CTI (degree 2) is over 10,000× more accurate compared to AIS (supplementary Figure S9(a)).

.. Unknown Precision (Radiata Pine)
We now relax the assumption of known precision τ = 1/σ 2 ; we will see that in these circumstances CTI is no longer exact. Specifically, we consider data from Williams (1959) on n = 42 Table . Bayesian linear regression with (a) known precision and (b) unknown precision. Mean square error (MSE) for estimates of the log-evidence in (a) and the log-Bayes factor in (b), based on  independent runs of population MCMC, along with estimates for standard errors (SE). Here, deg(Q) is the dimension of the ZV polynomial Q(θ), with  denoting standard TI. Quadr. is the order of numerical quadrature scheme. N is the number of MCMC iterations. specimens of radiata pine. This dataset is well known in the multivariate statistics literature and was recently used by Friel and Wyse (2012) and Friel, Hurn, and Wyse (2014a) to benchmark estimators of model evidence. Data consist of the maximum compression strength parallel to the grain y i as a function of density x i and density adjusted for resin content z i . Both density x i and adjusted-density z i have comparable standard deviations of 4.5445 and 4.6593, respectively. It is wished to determine whether the density or resin-adjusted density is a better predictor of compression strength parallel to the grain. Following Friel, Hurn, and Wyse (2014a), we consider Bayesian model comparison between a pair of competing models: Model 1: Model 2: Herex andz are the sample means of the x i and z i , respectively. The priors for (α, β ) and (γ , δ) are both Gaussian with common mean B 0 = [3000, 185] T and precisions τ Q 0 , λQ 0 where Q 0 = diag(0.06, 6). Both τ and λ were assigned gamma priors with shape 6 and rate 4 × 300 2 . To compare between these models we consider estimates for the log-Bayes factor log(B 21 ) that are obtained as the difference between independent estimates for the log-evidence of each model. This application is interesting for two reasons: First, one can directly calculate the log-Bayes factor for this example as log(B 21 ) = 8.7086, so that we have a gold standard performance benchmark. Second, when the precision τ (or λ) is unknown, ZV methods are no longer exact. We therefore have an opportunity to assess the performance of CTI in a nontrivial setting.
Formulas in Appendix D.2 demonstrate that the sufficient condition for unbiasedness of ZV methods is satisfied. Results in Figure 2 show that CTI (degree 1) achieves a modest reduction in variance across temperatures t, whereas CTI (degree 2) achieves a massive variance reduction. Computing the MSE relative to the true log-Bayes factor, we see that CTI (degree 2) is over 500× more accurate compared to TI, though the variance of the estimator is not identically equal to zero in this case (Table 1). As before, MSE is further reduced as a result of applying secondorder quadrature. AIS performed slightly worse than the methods based on TI in this example (supplementary Figure S9(b)).

Bayesian Logistic Regression (Pima Indians)
Here, we examine data that contain instances of diabetes and a range of possible diabetes indicators for n = 532 women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona. This dataset is frequently used as a benchmark for supervised learning methods (e.g., Marin andRobert 2011). Friel, Hurn, andWyse (2014a) considered five predictors of diabetes recorded for this group; number of pregnancies (NP); plasma glucose concentration (PGC); body mass index (BMI); diabetes pedigree function (DP), and age (AGE). Diabetes incidence y i in person i is modeled by the binomial likelihood where the probability of incidence p i for person i is related to the covariates Consider Bayesian comparison between the two candidate models Model 2: logit(p) = β 0 + β 1 NP + β 2 PGC + β 3 BMI + β 4 DP + β 5 AGE (27) subject to the common prior belief β ∼ N(0, τ −1 I). Following Friel and Wyse (2012), we set τ = 0.01. The unbiasedness criterion in Appendix B is seen to be satisfied and we have where the ith row of X is x i,• . In Figure 3, we see that degree 1 ZV methods achieve a greater variance reduction at smaller t, but moreover we see that degree 2 ZV methods continue to achieve a substantial variance reduction at all temperatures. In Table 2, we display the mean of each estimator, computed over all 100 independent runs of population MCMC, together with the The best performing estimators are highlighted in bold.

Limitations of CTI
We have demonstrated, using standard benchmark datasets, that CTI is well suited to Bayesian model comparison between regression models. Regression analyses continue to be widely applicable in disciplines such as econometrics, epidemiology, political science, psychology, and sociology, so that these findings have significant implications. Nevertheless in many disciplines such as engineering, geophysics, and systems biology, statistical models are significantly more complex, often based on a mechanistic understanding of the underlying process. Below we provide such an example and find that CTI offers little improvement over TI; this allows us to explore the limitations of our approach and, conversely, to understand in what circumstances it is likely to be successful.

A Negative Example (Goodwin Oscillator)
We consider nonlinear dynamical systems of the form We assume only a subset of the variables is observed under noise, so that x = [x a , x b ] and y is a d by n matrix of observations of the variables x a . Model comparison for systems specified by nonlinear differential equations is known to be profoundly challenging, even when the number of models is small . Write s 1 < s 2 < · · · < s n for the times at which observations are obtained, such that y(s j ) = y •, j . We consider a Gaussian observation process with likelihood where x a (s j ; θ, x 0 ) denotes the solution of the system in Equation (29). For the Gaussian observation model, it can be shown that a sufficient condition for unbiasedness of ZV is that the prior density p(θ) vanishes faster than r d+k−2 when r = θ 1 → ∞. Here d = dim and k = 1 is the degree of the polynomial that is being employed (see Appendix B). Assuming the sufficient condition for ZV is satisfied, we have where S i is a matrix of sensitivities with entries S i j,k = ∂x k ∂θ i (s j ). Note that in Equation (31), S k j,k ranges over indices 1 ≤ k ≤ dim x a corresponding only to the observed variables. In general, the sensitivities S i will be unavailable in closed form, but may be computed numerically by augmenting the system of ordinary differential equations (ODEs) in Equation (29) as described in Appendix E. Indeed, these sensitivities are already computed when differential-geometric sampling schemes are employed, so that the evaluation of Equation (31) incurs negligible computational cost.
We focus on a dynamical model of oscillatory enzymatic control due to Goodwin (1965), which was recently considered in the context of Bayesian model comparison by Calderhead and Girolami (2009). This kinetic model, specified by a system of g ODEs, describes how a negative feedback loop between protein expression and mRNA transcription can induce oscillatory dynamics as experimentally observed in circadian regulation (Locke, Millar, and Turner 2005). A full specification is provided in Appendix E. As shown in Calderhead and Girolami (2009), the Goodwin oscillator induces a highly multi-modal posterior distribution that renders estimation of the model evidence extremely challenging. We consider Bayesian comparison of two models; a simple model with one intermediate protein species (g = 3) and a more complex model with two intermediate protein species (g = 4). Figure 4 demonstrates that in this extremely challenging example the benefits of control variate schemes that we have previously observed are heavily reduced. Since the variance ratio R(t ) is related to the canonical correlation between control variates and the log-likelihood under the power posterior (Equation (17)), we hypothesize that the extreme multi-modality of the power posterior distribution is limiting the extent to which strong canonical correlation can be achieved. This is confirmed in Figure 5 where we plot values of the target function (θ) against the control variates h(θ) that are obtained from MCMC sampling in the posterior. We observe much reduced correlation in the case of the Goodwin oscillator that is a consequence of the complex nature of the likelihood surface. Turning to the Bayes factor itself, in Table 2 we display the mean of each estimator of the log-Bayes factor, together with the standard deviation of this collection of estimates. We find that CTI (degree 1) provides negligible reduction in variance and CTI (degree 2) provides an insignificant 15% reduction in variance.
In this example, AIS consistently produced lower estimates for log-Bayes factors (supplementary Figure S9(d)). This likely reflects the low number N of Monte Carlo iterations that are characteristic of such computationally demanding applications.

Discussion
To the best of our knowledge, this is the first article to consider the use of control variates for the purpose of estimating the model evidence. Motivated by previous empirical studies, we focused on TI estimators for the model (log-)evidence. However, in general, control variate techniques could be leveraged whenever estimators of the evidence (or Bayes factors) take the form of a Monte Carlo expectation. General control variate schemes for MCMC rely on the fact that the expectation of the control variates along the MCMC sample path will be approximately zero. We thus draw a distinction between "equilibrium" Monte Carlo estimators for the model evidence, such as TI and path sampling, which require the underlying Markov chain to have converged, and "nonequilibrium" estimators such as AIS and sequential Monte Carlo that do not require convergence. The former class are amenable to existing control variate schemes whereas the latter are not. This motivates the "equilibration" of these nonequilibrium estimators.
Given its close connection with TI (Gelman and Meng 1998), we considered whether an equilibrated version of AIS, which jointly samples from all rungs of the temperature ladder at once, would benefit from application of ZV control variates. In contrast to CTI, the controlled AIS estimator (CAIS) demonstrated an increase in variance compared to standard AIS. Full details are provided in the supplementary materials, in addition to results on each of the applications considered in this article. To understand these counter-intuitive results, notice that control variates must be constructed simultaneously over all m rungs of the temperature ladder, so that for degree 1 polynomials we have to jointly estimate md coefficients, where d denotes the number of model parameters, and for degree 2 polynomials we have to jointly estimate md(d + 3)/2 polynomial coefficients. To achieve this using the plug-in principle, we must estimate covariance matrices containing O(m 2 d 2 ) and O(m 2 d 4 ) entries, respectively. Our results are therefore consistent with the finding that poor estimation of the polynomial coefficients can actually increase estimator variance (Glasserman 2004). It remains unclear how to develop control variates for these nonequilibrium estimators.
We exploited the ZV control variate scheme due to Mira, Solgi, and Imparato (2013), which permits the automatic construction of control variates for any statistical model in which the gradient of the log-likelihood (and the log-prior) are available. More generally, we envisage that for models where these gradients are unavailable in closed form, the use of numerical approximations could provide a successful strategy (Calderhead and Sustik 2012). For a large class of intractable likelihood models, Friel, Mira, and Oates (2014b) showed how to obtain a stochastic approximation to ZV control variates that could be directly leveraged within our methodology. Results on benchmark datasets demonstrate that CTI outperforms standard TI, but that the difference in performance is reduced when the likelihood function is strongly multi-modal. A natural direction for further research is to explore whether alternative control variates are better suited to these challenging problems. The recent work of Oates, Girolami, and Chopin (2014) indicates that nonparametric alternatives to ZV control variates offer much potential.
CTI clearly inherits the theoretical and methodological challenges that are associated with control variates more generally. In particular ZV control variates are not parameterizationinvariant and it is unclear how to select an optimal varianceminimizing parameterization. Pertinent to CTI in particular, the optimal coefficients φ * (t ) will vary smoothly with (inverse) temperature t (supplementary Figure S6(b)), yet the conventional plug-in approach to estimation treats each rung t i of the temperature ladder independently, leading to rough trajectories (supplementary Figure S6(a)). It would therefore be interesting to design an information sharing scheme that jointly estimates all coefficients.
The development of low-cost computational approaches to Bayesian model comparison is necessary for the widespread adoption of Bayesian methodology in hypothesis-driven research. When the number of models is not large and model dimensionality is not high, control variate strategies offer a promising route toward achieving this goal. 1 2σ 2 (y − Xμ(t )) T (y − Xμ(t )) − 1 2σ 2 tr(X T X (t )) and the model evidence is p(y) = 1 (2π ) n/2 | | 1/2 exp − 1 2 y T −1 y , (C.2) where = σ 2 I + ζ 2 XX T .
Write X for the design matrix with ith row [1,x i ]. The model evidence, which is the object we wish to estimate, is given by where a n = a 0 + n 2 , Q n = Q 0 + X T X and B n = Q −1 n (X T y + Q 0 B 0 ). Derivations for Model 2 are analogous.

Appendix E: Formulas for Goodwin Oscillator
The Goodwin oscillator with g species is given by Here, x 1 represents the concentration of mRNA for a target gene and x 2 represents its corresponding protein product. Additional variables x 3 , . . . , x g represent intermediate protein species that facilitate a cascade of enzymatic activation that ultimately leads to a negative feedback, via x g , on the rate at which mRNA is transcribed. The solution x(s; θ, x 0 ) of this dynamical system depends upon synthesis rate constants a 1 , k 1 , . . . , k g−1 and degradation rate constants a 2 , α. The Goodwin oscillator permits oscillatory solutions only when ρ > 8. Following Calderhead and Girolami (2009), we therefore set ρ = 10 as a fixed parameter. A g-variable Goodwin model as described above therefore has g + 2 uncertain parameters (a 1 , a 2 , k 1 , . . . , k g−1 , α). The Goodwin oscillator does not permit a closed-form solution, meaning that each evaluation of the likelihood function requires the numerical integration of the system in Equation (E.4). Due to the substantive computational challenges associated with model comparison in this setting, we considered only 10 independent runs of population MCMC, each using only N = 1000 iterations.
In practice, we work with the log-transformed parameters θ. In particular, this allows us to verify that ZV methods are valid, since the tails of p(θ) vanish exponentially quickly. Sensitivities S i j,k , defined in the main text, satisfyṠ where ∂xk ∂θi = 0 at s = 0. Equation (E.5) provides a route to compute the sensitivities numerically, when they cannot be obtained analytically, by augmenting the state vector of the dynamical system to include the S i j,k .

Supplementary Materials
The supplementary materials contain additional theoretical details and empirical results.