Scalable Bayesian Model Averaging Through Local Information Propagation

This article shows that a probabilistic version of the classical forward-stepwise variable inclusion procedure can serve as a general data-augmentation scheme for model space distributions in (generalized) linear models. This latent variable representation takes the form of a Markov process, thereby allowing information propagation algorithms to be applied for sampling from model space posteriors. In particular, We propose a sequential Monte Carlo method for achieving effective unbiased Bayesian model averaging in high-dimensional problems, using proposal distributions constructed using local information propagation. The method—called LIPS for local information propagation based sampling—is illustrated using real and simulated examples with dimensionality ranging from 15 to 1000, and its performance in estimating posterior inclusion probabilities and in out-of-sample prediction is compared to those of several other methods—namely, MCMC, BAS, iBMA, and LASSO. In addition, it is shown that the latent variable representation can also serve as a modeling tool for specifying model space priors that account for knowledge regarding model complexity and conditional inclusion relationships. Supplementary materials for this article are available online.


Introduction
We consider Bayesian model averaging (BMA) (Hoeting et al., 1999) in Gaussian linear regression though the methodology developed in this work is directly applicable to generalized linear models.Suppose the data are n observations with p potential predictors and a response.Let Y = (y 1 , y 2 , . . ., y n ) denote the response vector and X j = (x 1j , x 2j , . . ., x nj ) the jth predictor vector for j = 1, 2, . . ., p.We consider linear regression models of the form where 1 n stands for an n-vector of "1"s; ǫ = (ǫ 1 , ǫ 2 , . . ., ǫ n ) is a vector of i.i.d.Gaussian noise with mean 0 and variance 1/ϕ; γ = (γ 1 , γ 2 , . . ., γ p ) ∈ {0, 1} p ≡ Ω M is the model (identifier) vector, that is, a vector of indicators whose jth element γ j = 1 if and only if the jth variable X j enters the model; X γ and β γ represent the corresponding design matrix and coefficients.Because γ and the model M γ are in one-to-one correspondence, we will use where θ γ = (α, β γ , ϕ), representing the parameters given model M γ , whereas p(θ γ |γ) is the prior on these parameters given M γ .This integral can be evaluated analytically for many common priors and can often be approximated by Laplace approximation otherwise (Liang et al., 2008).
BMA then concerns the prediction or estimation of some quantity of interest ∆, and suggests using the corresponding posterior mean as the predicted value.Extensive studies on BMA show that it has various desirable theoretical properties in terms of predictive performance, including minimizing the mean squared error and being optimal under the log-score criterion (Raftery and Zheng, 2003, Sec. 2).
In low-dimensional problems with less than 30 variables, it is possible to carry out BMA exactly by enumerating the model space and computing the model space posterior p(γ|D) exhaustively.Modern applications often involve many more variables, however, for which model space enumeration is infeasible.In such cases BMA is carried out through Monte Carlo and its efficacy relies critically on effectively sampling from the posterior on the large model space.
In problems with more than hundreds of variables, efficient exploration over the vast 2 p model space is particularly challenging.In such cases, MCMC-based methods often have difficulty in convergence and mixing.In recent years, several scalable stochastic search and adaptive sampling algorithms have been proposed to tackle this challenge-see for example Hans et al. (2007); Scott and Carvalho (2008); Clyde et al. (2011).Rather than sampling from the actual posterior, these methods aim at searching the parts of the model space deemed to contain the "best" models.The idea is quite natural-if but a tiny fraction of the models can be explored in practical time, one might as well focus on the models best supported by the data.However, because the models sampled by these methods do not arise from the actual posterior, sample averages are biased estimates for the posterior mean (Heaton and Scott, 2010;Clyde and Ghosh, 2012).
When prediction or estimation is the primary inferential goal, unbiased estimation of posterior means according to the BMA recipe is often desirable.However, for stochastic search and adaptive sampling algorithms, the sampling distribution of the models is typically unknown, and therefore one cannot easily correct the sampling bias (Clyde and Ghosh, 2012).
In this work we introduce a data augmentation scheme and use it to design a new sampling method for achieving (approximately) unbiased BMA in high-dimensional problems.
Our starting point, however, may seem implausibly simple: the familiar forward-stepwise variable selection procedure taught in every first course on regression.We show that a probabilistic version of this procedure provides a latent variable representation for all model space distributions.This representation takes the form of a Markov process, and thus it allows BMA to be carried out through information propagation (forward-summation-backwardsampling) algorithms (Liu, 2001, Sec. 2.4).In high-dimensional problems where exact information propagation is infeasible, we construct a sequential Monte Carlo (SMC) sampler (Liu, 2001;Del Moral et al., 2006;Cappé et al., 2007) using k-step local information propagation based proposals, which achieves very efficient scalable unbiased BMA.
As a desirable side-product, we show that the Markov latent representation can also be used as a modeling tool for specifying model space priors.In particular, we show that prior knowledge regarding model complexity and conditional inclusion relationships among the predictors can be more conveniently specified under this representation than directly on the original model space.We argue that proper specification on these aspects of the model space prior is important in high-dimensional settings to address multiple testing.
The rest of the work is organized as follows.In Section 2 we introduce the probabilistic forward-stepwise (pFS) representation as a data augmentation scheme for model space distributions.We show that all model space distributions can be augmented this way, and show how to use this representation as a tool for specifying model space priors and provide the reasons why it is useful for addressing multiplicity in high-dimensional settings.We establish the Markov nature of the pFS representation, and construct a forward-backward information propagation algorithm for finding pFS representation of model space posteriors.We then consider high-dimensional settings where exact information propagation is infeasible, and present an SMC sampler for BMA that operates in the augmented model space using local information propagation proposals.In Section 3 we apply our method to several real and simulated examples with dimensionality ranging from 15 to 1,000, and investigate its performance for estimation and prediction in comparison to Markov Chain Monte Carlo (Madigan et al., 1995;Brown et al., 1998;Fernández et al., 2001), Bayesian adaptive sampling (Clyde et al., 2011), iterated BMA (Annest et al., 2009), and LASSO (Tibshirani, 2011).We conclude in Section 4 with some discussion.
We close this introduction by relating the current work to some particularly relevant literature.It is worth noting that both the idea of data augmentation and that of SMC have been considered (separately) for Bayesian model choice and averaging by other authors.In particular, Ghosh and Clyde (2011) proposed a data augmentation technique to transform a general design matrix into an orthogonal one, which allows posterior sampling to be achieved much more effectively.Schäfer and Chopin (2013) proposed a general SMC algorithm for Bayesian variable selection that uses a geometric bridge (Gelman and Meng, 1998;Chopin, 2002) to create a path of target distributions, and use the binary nature of each dimension in the model space, through the "logistic conditionals" model, to construct proposals.In comparison, the data augmentation technique we introduce substantially simplifies the technicality of the SMC sampler.Finally, Annest et al. (2009) also considered the problem of scalable BMA.Their method-the iterated BMA-iteratively screens the predictors down to a small number (∼ 30), and carries out exact BMA on this reduced model space.

The probabilistic forward-stepwise representation
In the classical FS variable selection procedure, a regression model is built sequentially by adding one variable at a time until certain stopping criterion is met.At each step of the procedure, the statistician makes two decisions: (i) a stopping decision-whether to terminate the procedure and (ii) a selection decision-if we do not stop, which variable to include next.This procedure is deterministic-with the same stopping and selection criteria, two statisticians working on the same data will end up with exactly the same result.
Now we introduce a randomized procedure-or a probabilistic model-for generating regression models that imitates the classical FS procedure.This probabilistic forward-stepwise (pFS) procedure is obtained by randomizing the stopping and selection decisions.
The probabilistic forward-stepwise procedure.The procedure also starts with the null model and adds variables one-at-a-time until stopping.It consists of p steps of operation.
Let γ (t) = (γ (t),1 , γ (t),2 , . . ., γ (t),p ) ∈ Ω M denote the tentative model after the tth step.To begin, we have t = 0 and γ (0) = (0, 0, . . ., 0), the null model.At the end of the p steps, we have the final model γ (p) .Below is an illustration of the flow of the procedure Each of the p steps can be described inductively as follows.Suppose we have completed the first (t − 1) steps.Then in the tth step, if the procedure has not "stopped"-what "stopping" means will be described shortly, then first we draw a Bernoulli random variable where, as the notation indicates, the "success" probability ρ(γ (t−1) ) can depend on γ (t−1) .
If S t = 1, then the procedure stops in the tth step and we set γ (t) = γ (t−1) .Thus we call ρ(γ (t−1) ) the stopping probability.If instead S t+1 = 0, then the procedure is not stopped in the tth step.In this case, we randomly select a new variable to add into the model.More specifically, we draw a multinomial Bernoulli variable J t from {1, 2, . . ., p} J t ∼ Multi-Bern {1, 2, . . ., p}; λ(γ (t−1) ) , where λ(γ (t−1) ) = λ 1 (γ (t−1) ), λ 2 (γ (t−1) ), . . ., λ p (γ (t−1) ) represent the selection probabilities with λ j (γ (t−1) ) for each j = 1, 2, . . ., p being the probability for J t = j.The selection probabilities also can depend on the model at the previous step γ (t−1) .If J t = j, we add the jth predictor into the model, and get γ (t) = γ +j (t−1) .On the other hand, if the procedure has already stopped in the first t − 1 steps, then in the tth step the procedure remains stopped-we set S t = 1 and γ (t) = γ (t−1) .In this case, the selection variable J t does not matter at all, but for sake of completeness, we still assume that they are distributed as the multinomial Bernoulli given above.This completes our inductive description for the tth step of the procedure for t = 1, 2, . . ., p.
The final model γ (p) is determined by S 1 , J 1 , S 2 , J 2 , . . ., S p , J p , which we call the (latent) decision variables.A pFS procedure is specified by their distributions through two mappings where Ω M \{(1, 1, . . ., 1)} is the set of all non-full models.For each non-full γ, λ satisfies λ j (γ) ≥ 0 for all j = 1, 2, . . ., p, p l=1 λ l (γ) = 1, and λ j (γ) = 0 for all j such that γ j = 1.These constraints ensure that the multinomial Bernoulli distribution for each J t is welldefined, and that only variables not in the tentative models will be selected.
The following theorem establishes the Markov property of the pFS procedure.
Proof.See Online Supplementary Materials S1.
Let π be a probability measure on the model space Ω M .If there exists a pFS procedure such that π is the marginal distribution of the final model γ (p) , then the pFS procedure is said to be a pFS representation for π.Our next theorem shows that all model space distributions have pFS representations, and thus the pFS procedure can serve as a general data-augmentation tool, or latent variable representation, for model space distributions.
Proof.See Online Supplementary Materials S1.
This theorem implies that any model space prior can be augmented by a pFS procedure with appropriately chosen ρ and λ mappings.In the proof of Theorem 2, we provide a general recipe for finding a pFS representation for any model space prior.We will devote Section 2.2 to show how to use the pFS representation as a tool for choosing model space priors that take into account prior information regarding model complexity and conditional inclusion relationships, both of which are important in high-dimensional problems for addressing multiplicity.
The Markov property of the pFS representation has an important implication-inference on the posterior model space distributions can be carried out through information propagation, or forward-summation-backward-sampling algorithms (Liu, 2001, Sec. 2.4).Our next theorem presents the corresponding result in the current context-it shows how to find pFS representations for model space posteriors through information propagation. where is the Bayes factor (BF) of the model γ with respect to the null model 0 = (0, 0, . . ., 0), and φ : Ω M → R is a mapping defined recursively as follows Remark I: Note that the mapping φ is recursive in the sense that for each non-full model γ, φ(γ) is computed based on models including one more predictor.So once we have computed φ(γ) for all γ of size t ∈ {1, 2, . . ., p}, we can compute φ(γ) for all γ of size t − 1.
Remark II: The Bayes factor, BF 0 (γ), can be computed either exactly or through Laplace approximation for many common priors on regression parameters.We give explicit formulas for two such priors-Zellner's g (Zellner, 1981) and the hyper-g (Liang et al., 2008) in Online Supplementary Materials S2.
Remark III: The pFS representation and the above theorem work for generalized linear models as well.The only complication is that one will again need numeric methods such as Laplace approximation to compute the Bayes factors.
Proof of Theorem 3. See Online Supplementary Materials S1.
The computation of φ in the above theorem corresponds to the "forward summation" step of information propagation.On the other hand, the updating formulas for the posterior parameters correspond to the "backward sampling" stage of the information propagation.
The careful reader may notice that Theorem 3 by itself does not address the dimensionality problem because the recursion for φ requires an enumeration of the model space.However, as we will see in Section 2.3 that a local version of the information propagation algorithm can help us construct highly effective proposal distributions for sequential importance sampling of high-dimensional model space posteriors.

Specifying model space priors through pFS representation
While the main point of this work is to use the pFS representation to design a scalable method for carrying out BMA, we take a slight diversion in this section and show that the pFS representation can serve as a useful modeling tool for specifying model space priors as well.This is because the two mappings-ρ and λ-correspond to two essential aspects of model space priors.More specifically, ρ characterizes prior information on the model complexity, whereas λ the relative importance of the predictors.We next elaborate on each.
The mapping ρ encodes the prior information about model complexity.A convenient but flexible way to specify ρ is to let ρ(γ) depend on γ through the model size |γ|, that is, Intuitively, h(s) is the prior probability that the model is of size s given that its size ≥ s.
Any prior marginal distribution on the model size can be specified through proper choice of the function h.More specifically, suppose one wishes to specify a prior distribution (q 0 , q 1 , q 2 , . . ., q p ) on the model size where q s is the prior probability for |γ| = s for s = 0, 1, 2, . . ., p. Then one can let In particular, for Scott and Berger (2010) show that this choice help achieve effective control on "false inclusions".
The prior selection probability mapping λ encodes the prior information regarding the relative importance of the predictors.If no such prior information is available, a simple choice is the uniform selection probabilities, that is, to let for all j such that γ j = 0. the predictors.For example, in genetic applications where the predictors are gene markers, the biologist may know that some markers are more likely to be related to the response (e.g. a phenotype measurement) than others.For instance, based on past experience one may believe that the markers lying in genes are more likely to be associated with the response than those outside of genes.One can then choose λ such that where c ≥ 1 is a constant that specifies how much more likely are markers in genes to be included, k(γ) is a constant that ensures j λ j (γ) = 1.In this particular example, k(γ) depends on the number of markers in (and out) of genes that are not already in γ.

Sometimes one has prior knowledge regarding the importance of predictors in relation to
the other variables in the model.A predictor X i may be more (or less) likely to be in the model when another set of variables are also included.For example, gene markers lying in the same biological pathway may be expected to enter the model together.Incorporating such conditional knowledge is often difficult when specifying a model space prior directly, but is made straightforward under the pFS representation.If say, markers i, j, and k are in the same pathway then we can let for some c > 1, and again k(γ) is a known constant that ensures j λ j (γ) = 1.
A general challenge in high-dimensional inference is proper adjustment for multiplicity (or multiple testing).In the context of Bayesian model choice and averaging, multiplicity is reflected in the dilution of the model space prior over a tremendous number of models, many of which are "bad".In addressing multiplicity, it is important to control the rate of "false" inclusions of irrelevant predictors while maximizing the power or sensitivity for recovering "true" inclusions of relevant ones.It has been shown that simple strategies such as assuming a priori each predictor is included independently with equal probability does not properly control for false inclusion (Scott and Berger, 2010).Stronger control on the model complexity is necessary.At the same time, to improve inference, one should incorporate as much relevant background knowledge as possible, thereby concentrating more prior probability mass on models that are more likely to be true.In this regard, the pFS representation allows specifying prior model complexity and relative importance of predictors in a decoupled, flexible fashion, and so is particularly desirable in high-dimensional settings.
Specifying symmetric model space priors.We have seen that the pFS representation provides an intuitive framework for specifying model space priors.Of course things can go the other direction too-one can start from a given model-space prior chosen by some other means and find a corresponding pFS representation.In the proof of Theorem 2, we provide a general though complex recipe for finding a pFS representation for any model space distribution.For "symmetric" priors-those that assign equal probability to models of equal size, finding a pFS representation is extremely easy.Specifically, for any symmetric prior π we can let ρ(γ) = h(|γ|) with h(0) = π(0, 0, . . ., 0), and inductively set where γ is any model of size t.In addition, we let λ j (γ) = 1/(p − |γ|) • 1(γ j = 0).One can verify that the marginal distribution of the final model is exactly π.

Scalable BMA through LIPS
Next we shift from a modeling perspective to a computational focus.In particular, we use the pFS representation to construct a scalable algorithm for BMA.Our basic strategy is sequential importance sampling-drawing models from some sequentially constructed proposal distribution and using the (sequentially updated) ratio between the proposal and the actual model space posterior as weights to correct for the sampling bias.The effectiveness of the sampling depends critically on the choice of the proposal.In high-dimensional settings, choosing proposal distributions that are close to the model space posterior is especially important.To this end, the pFS representation provides a powerful and natural means to constructing effective proposals using local information propagation.
We now describe our main algorithm, called local information propagation based sampling, or LIPS, for drawing weighted samples from the model space posterior and constructing (approximately) unbiased BMA estimates.We first describe the general schema of LIPS, and then provide details for each of its main components-particle propagation, weight updates, and the construction of (approximately) unbiased estimates.Readers may directly refer to the box "Algorithm 1" for the pseudo-code of the entire algorithm.
First we introduce some more notation.Let π(•) be our model space prior which has a pFS representation with mappings ρ and λ, and π(• | D) the corresponding model space posterior.Next, for t = 0, 1, 2, . . ., p, we define π 0:t (•) as the model space prior induced by the pFS representation with mappings ρ 0:t and λ 0:t defined as follows for all γ ∈ Ω M .In other words, π 0:t (•) is the corresponding model space prior that one would get if we force the pFS procedure for π(•) to stop at the (t + 1)st step.The prior π 0:t (•) places all probability mass on models up to size t, and assigns probability in the same way as π(•) to models of size up to t − 1, while concentrating the mass that π(•) places on models of size ≥ t on those of size t.In particular, we have π 0:p (•) = π(•).We use these weighted samples to carry out BMA by computing weighted averages in the form of Horvitz-Thompson (H-T) estimators (Horvitz and Thompson, 1952;Clyde et al., 2011).
Next we elaborate on each component of LIPS.Due to constraint of space, we do not provide all the relevant background on SMC.Interested readers may refer to Liu (2001) for an excellent coverage on the topic.
Particle propagation.We propagate N particles in parallel by simulating N pFS procedures.More specifically, for the ith particle where i = 1, 2, . . ., N, we simulate a sequence of tentative models The "|D" notation indicates that the proposal mappings may depend on the data.
From Theorem 1, we know that the sequence γ i (0) , γ i (1) , . . ., γ i (p) form a Markov chain whose transition probability in the tth step is given by kernel.The choice of the proposal kernel, or equivalently that of ρ and λ, is critical to ensuring the efficiency of the algorithm.
Local information propagation.The pFS representation allows us to use local information propagation (LIP) to find effective proposals.To see how to carry this out in practice, first for each model vector ζ ∈ Ω M , we define a mapping . The larger k is, the more closely the proposal mappings approximate the actual posterior mappings.In particular, if k = p then the proposal mappings exactly recover the latter, though this is unachievable in highdimensional problems.We note that the above LIP proposal can be shown to be a special case of the so-called k-step look-ahead proposal (Lin et al., 2013).
Weight update.Next we derive the appropriate sequential weight updates for the particles.
To begin, set w i (0) = 1 for all i.Then for t ≥ 1, by Bayes' theorem where BF(γ i (t) , γ i ), and ), which is exactly the prior transition probability under the corresponding pFS representation for the prior as given in Theorem 1.Thus, by Eq. ( 2), the weight for the ith particle after the tth step is updated to Therefore, once we compute the BF between γ i (t) and γ i (t−1) , we can update the weights sequentially.We provide the specific form of this BF under two common priors on the regression coefficients-the g-prior and the hyper-g prior-in Online Supplementary Materials S2.
BMA through H-T estimation.At the end of each particle simulation, we end up with a final model γ i (p) and a weight w i (p) .Suppose now we are interested in estimating the posterior mean of a quantity ∆, that is E(∆ | D), and conditional on any given model γ, E(∆ | γ, D) can be evaluated.Then a consistent and approximately unbiased estimator, called the Horvitz-Thompson (H-T) estimator (Horvitz and Thompson, 1952;Clyde and Ghosh, 2012), for E(∆ | D) is given by .
, and Z = i Z i /N.Then we can estimate the variance of the δ HT by (Liu, 2001, p.35) When multiple CPU cores are available or when the available computing memory is not enough for storing all N particles, one can divide the N total number of particles into L "bags", each containing N/L particles.One can then compute the H-T estimate using the particles in each bag.We call the average of the L estimators a bagged (H-T) estimator.
Specifically, let δ HT,l be the H-T estimator for the lth bag, then the bagged estimator is given by δHT = L l=1 δ HT,l /L and its variance can be estimated by l (δ HT,l − δHT ) 2 /(L(L − 1)).The L bags can be generated either on L CPU cores in an (embarrassingly) parallel fashion, or sequentially on a single computing core using N/L fraction of the required memory for storing N particles.This bagging idea for variance reduction and parallelization is sometimes referred to as islanding in the literature.See for example Lakshminarayanan et al. (2013).
end for end for ⊲ Bayesian model averaging Compute Horvitz-Thompson estimate for E(∆ | D): .
In this section we apply LIPS to several real and simulated data sets.In these examples, we use LIPS to carry out two types of BMA estimation.The first is estimating the posterior inclusion probability (PIP) of the predictors, which is to set ∆ = 1(γ j = 1) for j = 1, 2, . . ., p.
The other is for out-of sample prediction, which is to set ∆ = Y new where Y new is the response of a new observation with predictor values X new .
We start from a low-dimensional example (US crime data) in which the actual model space posterior can be computed through enumeration so that comparison to the exact truth can be made.We then investigate a moderate-dimensional example with 88 predictors (protein activity data).In this case, it is no longer possible to evaluate the exact posterior analytically.Instead, we apply an MCMC-based method to approximate the "truth"-by running a very long chain.We compare the estimated PIPs from produced by LIPS and BAS to these "true" values.In addition, we compare the out-of-sample prediction of LIPS to those of three additional existing methods-BAS, iBMA, and LASSO.Our last two examples are based on simulation-they involve strongly correlated predictors and are of two different dimensionalities p = 100 and 1, 000.For p = 100, we again compare the results from our method to MCMC, BAS, iBMA, and LASSO.For the p = 1, 000 case, BAS cannot be applied and the MCMC approach has difficulty in exploring the vast model space effectively.We compare among LIPS, iBMA, and LASSO for prediction.The results show that LIPS is robust to both predictor correlation and increasing dimensionality.For simplicity, in all the examples, we adopt the g-prior for the regression coefficients with the shrinkage parameter g set to the number of observations.
Example 2 (US Crime data).We first apply the LIPS algorithm to a classical data set introduced in Vandaele (1978).It contains 15 variables and so an exhaustive computation of the marginal likelihood of all 2 15 models is possible.Following Clyde et al. (2011), we adopt a model space prior, called "Beta-Binomial(1,1) prior", that assigns equal total probability-1/16-to each model size, and evenly splits the mass among models of each size.
In practice, for such low-dimensional problems one can enumerate the model space in little time, and so sampling-based methods are not really necessary.We use this example as a first demonstration of our method where truth is known exactly.
In particular, we adopt LIPS with k = 4 and 5,000 particles to estimate the PIPs.To investigate the sampling distribution of the H-T estimators, we repeatedly applying LIPS 200 times.Figure 1 summarize the results, where we plot the mean and inter-quantile range of the each estimate over the 200 repeats.Note that each repeat is essentially a bag of 5,000 particles, and so the average over the repeats gives a bagged estimate δHT .We mark the true PIPs computed through model space enumeration.The (approximate) unbiasedness of the estimates is clearly demonstrated.In fact, each bag of 5,000 particles alone provide a fairly reliable estimate of the PIPs (with small variances), while the bagged estimates essentially recover the true values.
Example 3 (Protein activity data).Our second example is a protein activity data set from Clyde and Parmigiani (1998).Following those authors, we code the categorical variables by indicators and consider main effects and first-order interactions, resulting in a total of p = 88 predictors.For problems with this many predictors, it is no longer possible to compute the exact model space posterior through enumeration.To get a handle on the truth as a baseline for comparison, we apply a popular MCMC-based method for BMA-Markov Chain Monte Carlo Model Composition, or MC 3 , with add, delete, and swap moves (Madigan et al., 1995;Brown et al., 1998;Fernández et al., 2001) using the R package BMS.We estimate the PIPs by running a very long MC 3 chain (2 × 10 7 iterations with 1 × 10 7 burn-in steps), and use these as proxies to the truth, to which we compare the estimates obtained from LIPS.(We will see that the results from LIPS almost match these proxies perfectly, confirming that these proxies are probably very close to the truth and so the MC 3 chain is long enough.)We again adopt a Beta-Binomial(1,1) model space prior and apply LIPS with k = 3.
We generate 50,000 particles and repeat the computation 200 times, effectively creating 200 bags of particles.The sampling distribution of the 200 sets of esimates are given in Figure 2 (left).The bagged estimates, that is the average of the 200 repeats, match the estimates from MC 3 very closely.
For comparison, we apply another method, BAS, to estimate the PIPs.BAS samples models without replacement in an adaptive manner and carry out BMA based on the sampled models.Similarly, to get the sampling distribution of the BAS estimates, we repeatedly apply BAS (with 1 × 10 6 model draws) 200 times, and compute the PIP estimates for each simulation.Figure 2 (right) presents the results.
Comparing the results from LIPS and BAS, we find the two methods are complementary in several ways.First, LIPS gives unbiased esimates while BAS does not, but the LIPS estimates tend to be more variable than those from BAS.Moreover, the LIPS tends to provide most accurate estimates for PIPs that are very large or small, while giving the most variable estimates for PIPs that are close to 50%.In contrast, BAS gives least biased estimates get a model space posterior with which we predict the 6 testing observations using BMA.For comparison, we carry out the same prediction task using four methods-LIPS (with k = 3 and 20,000 particles), BAS (with 1 million model draws), iBMA, and LASSO.For iBMA, we use the default setting that preserves a maximum of 30 predictors after iterated screening.
For the LASSO, we use 10-fold cross-validation to select the shrinkage parameter.
We randomly divide the data into a training set and a testing set 200 times.For each random split, we apply the four methods, and to measure their performance, we compute the average squared error (ASE) for each method where Ŷtest,j is the predicted value for the jth testing observation Y test,j .In addition, we use the ASE averaged over the 200 random splits, denoted by ASE, as an overall performance metric for each method.Figure 3 presents a scatter plot matrix of the ASEs for the four methods LIPS, BAS, iBMA, and LASSO (lower left triangle) as well as the pairwise differences in ASE (upper right triangle).Overall, we see that the predictive performance of LIPS and that of BAS are very similar in this example-their difference is not statistically significant, and both perform substantially better than iBMA and LASSO for this example.
We know that LIPS is unbiased but more variable than BAS, and so here this bias-variance trade-off about breaks even.For this example, LIPS, BAS, and iBMA all outperform LASSO by a substantial margin.
Example 4 (Highly correlated predictors).In this example, we apply LIPS to a simulation scenario with a moderate number of dimensions (p = 100) and the predictors are strongly The difference ASE v − ASE h with the corresponding standard error given in parentheses, where ASE v is for the method on the vertical axis and ASE h the method on the horizontal axis.
correlated.The strong correlation makes the task of variable selection through estimating PIPs particularly challenging.Out-of-sample prediction, however, is not expected to be more difficult as one does not need to pin down the "causal" predictors to get good predictions.
We simulate a data set of 350 observations each with 100 predictors, X 1 , X 2 , . . ., X 100 , and a response Y .The predictors are simulated from a multivariate normal distribution, with marginal means all equal to 0 and marginal variances all equal to 1, along with the following correlation structure corr(X i , X j ) = (1 − 0.05|i − j|)1 |i−j|≤20 .So "close-by" predictors are highly correlated and the correlation decays with their "distance".Such a local correlation structure is common in applications.For example, in genomic studies where the predictors are gene markers, nearby markers on a chromosome are generally strongly correlated.
The response Y is simulated according to the following model where the errors are independent draws from a Normal(0, 10 2 ) distribution.We again adopt a Beta-Binomial(1,1) model space prior that places equal probability, 1/101, to each model size ranging from 0 to 100.
Again, we apply LIPS with k = 3 and 50,000 particles as well as BAS with 1 million model draws to estimate the PIPs, and compare the estimates to those obtained from a very long MC 3 chain with add, delete, and swap steps (2 × 10 7 iterations with 1 × 10 7 burn-in steps).The results are presented in with the corresponding standard error given in parentheses, where ASE v is for the method on the vertical axis and ASE h the method on the horizontal axis.
the previous example-a scenario with 1,000 potential predictors, X 1 , X 2 ,. . ., X 1000 .The predictors are again multivariate normal, with marginal means all equal to 0 and marginal variances all equal to 1, along with the same correlation structure as in the previous example corr(X i , X j ) = (1 − 0.05|i − j|)1 |i−j|≤20 .A response Y is simulated as where the errors are independent draws from a Normal(0, 10 2 ) distribution.We simulate such data sets of size 700.We adopt a symmetric model space prior that puts equal prior probability to each model size from 0 to 100.
As before, to evaluate the sampling properties of the estimates we carry out 200 runs of LIPS with k = 2 and 20,000 particles.The version of the BAS package in R cannot be run on problems with p ≥ n or p ≥ 1000.
We have tried MC 3 with add, delete, and swap moves on this example using the R package BMS as we did for the 100-dimensional example.However we could not get the chain to converge within reasonable time.With 100 million iterations and 50 million burn-in steps or about 5 days of running time on a single Intel Core-i7 3.6GHz CPU core, the estimated PIPs from the MCMC run still do not make sense and so are omitted.We do acknowledge that someone more experienced with MCMC will be able to construct a more effective MCMC that can converge and mix better, possibly using population-based strategies as those developed in Liang and Wong (2000); Jasra et al. (2007); Bottolo and Richardson (2010).
We again consider the problem of out-of-sample prediction and compare LIPS to iBMA and LASSO.We randomly simulate a training and a testing set with n train = 600 and n test = 100 for 200 times.For each simulated training and testing pair, we apply LIPS (with k = 2 and 20,000 particles), iBMA, and LASSO (with 10-fold CV). Figure 7 presents the pairwise comparison of the ASEs.LIPS shows the best predictive performance among the three, followed by LASSO, and then iBMA.While we show the results here for LIPS with 20,000 particles, interestingly we found that in this example LIPS with even just 200 particles outperforms the other two methods in predictive performance.On the other hand, with so few particles, the models that LIPS samples become very variable due to the high predictor correlation, and so the corresponding PIP estimates are unreliable.This is exactly what we expect-the strong predictor correlation makes estimating PIPs harder but not out-of-sample prediction.

Discussion
In this work we have showed that one can use a probabilistic version of the classical forwardstepwise variable selection procedure as a data-augmentation scheme for model space distributions, and that due to the Markov property of this representation it allows us to use information propagation methods to achieve scalable posterior sampling on model spaces.
One may consider building probabilistic versions of the backward-stepwise or bi-directional stepwise procedures as well, and those could also serve as latent variable representation for model space distributions.They also have the Markov property and therefore similar information propagation algorithms for BMA could be constructed.However the forward-stepwise procedure is a natural choice in that it starts from the null model and adds in variables, which is most suited when the underlying model is sparse, a common and often necessary assumption in many high-dimensional applications.When p is large, going backward from very large models is computationally impractical and statistically undesirable.In some cases, one may know that some subsets of the predictors are very likely to be in the model.Then in addition to incorporating such knowledge into the prior specification, one may consider using a bi-directional Markov augmentation that starts off from an initial model consisting of those predictors and then add and remove variables in the following steps from this baseline model.
This could reduce the Monte Carlo variability in the H-T estimators from LIPS relative to those from the forward-stepwise augmentation.
The complementarity of LIPS and BAS discovered in the numerical examples-that LIPS does best for estimating very large and very small PIPs while BAS does best at moderate PIPs-suggests a potentially powerful hybrid method for estimating PIPs that uses both methods.Finally, another method that we considered in the comparative study is ODA, which also uses data augmentation to achieve computationally efficient BMA.However, the current implementation of ODA can only be applied to model space priors that assume independent inclusion of the predictors, which is inappropriate for high-dimensional problems due to the multiplicity issue.One author of ODA informed us through personal communications that new software for ODA is under development to incorporate more general classes of priors such as the Beta-Binomial(1,1) prior.
Proof of Theorem 3. Let S 1 , J 1 , S 2 , J 2 , . . ., S p , J p be the latent decision variables of the pFS representation of π under consideration.We let (Ω d , F d ) be the probability space on which these decision variables are jointly defined.The sequence of models γ (1) , γ (2) , . . ., γ (p) are functions of the decision variables and thus also measurable with respect to (Ω d , F d ).
Fixing the data D, the marginal likelihood under the final model, p(D | γ (p) ), is also a random variable on (Ω d , F d ).For any γ ∈ Ω M , we define an event U γ on (Ω d , F d ) that γ is a submodel of the final model γ (p) , that is, γ (p) contains all of the predictors included in γ.

S2. Bayes factors under g and hyper-g priors
For many common priors on the regression coefficients, the BF term in the weight update can be computed either in closed form or well approximated numerically.Here let us consider two popular priors-the g-prior and the hyper-g prior.
Given a particular model γ, Zellner's g-prior in its most popular form is the following prior on the regression coefficients and the noise variance p(ϕ) ∝ 1/ϕ and β γ | ϕ, γ ∼ N(β 0 γ , g(X T X) −1 /ϕ) where β 0 γ and g are hyperparameters.Following the exposition in Liang et al. (2008), we assume without loss of generality that the predictor variables X 1 , X 2 , . . ., X p have all been mean centered at zero.Then we can place a common non-informative flat prior on the intercept α for all models.So p(α, ϕ) ∝ 1/ϕ.Under this prior setup, one can show that the BF for a model γ versus the null model is given by BF 0 (γ) = (1 + g) (n−1−|γ|)/2 1 + g(1 − R 2 γ ) where R 2 γ is the coefficient of determination for model M γ .
To avoid undesirable features of the g-priors such as Barlett's paradox and the information paradox (Berger and Pericchi, 2001), Liang et al. (2008) proposed the use of mixtures of gpriors.In particular, they introduced the hyper-g prior, which puts the following hyperprior on g: g 1 + g ∼ Beta(1, a/2 − 1).
This prior also renders a closed form representation for the model-specific marginal likelihood, and thus for the corresponding BFs.In particular, Liang et al. (2008) showed that the BF (1 − tz) a dt.
Therefore, with either the g-prior and the hyper-g prior the BF in the weight update can be computed as .
from a pFS procedure with mappings ρ(• | D) and λ(• | D).We call ρ(• | D) and λ(• | D) the proposal mappings.The "hat" notation reflect the fact that they typically are not exactly mappings for the actual posterior π(• | D), but they can be chosen to approximate the latter.

Figure 1 :
Figure 1: Left: Sampling distribution of the estimated PIPs by LIPS with k = 4 and 5,000 particles for the US crime data.Gray band: The inter-quantile range of 200 estimates; red triangle: the bagged estimate-the average of 200 estimates; blue cross: true PIPs obtained through model space enumeration.Right: the bagged estimates for PIPs vs the true PIPs.The right line is the 45 degree line.

forFigure 2 :
Figure 2: Left: Sampling distribution of the estimated PIPs by LIPS with k = 3 and 50,000 particles for the protein activity data.Right: The sampling distribution of estimated PIPs by BAS with 1 million model draws.Gray band: the inter-quantile range of 200 estimates; red triangle: the average of 200 estimates; blue cross: PIPs estimated from a long MC 3 chain.

Figure 3 :
Figure 3: Pairwise comparison of ASEs under 200 random splits of training and testing data for the protein activity example.Lower left triangle: Scatterplot matrix of ASEs for the four methods.The 45 degree lines are indicated by red dashes.Upper right triangle:The difference ASE v − ASE h with the corresponding standard error given in parentheses, where ASE v is for the method on the vertical axis and ASE h the method on the horizontal axis.

Figure 4 .Figure 5 :
Figure 5: Pairwise comparison of ASEs under 200 simulations of training and testing data for Example 4. Lower left triangle: Scatterplot matrix of ASEs for the four methods.The 45 degree lines are indicated by red dashes.Upper right triangle: The difference ASE v − ASE hwith the corresponding standard error given in parentheses, where ASE v is for the method on the vertical axis and ASE h the method on the horizontal axis.

Figure 6 :
Figure6: Sampling distribution of estimated PIPs by LIPS with k = 2 and 20,000 particles in the high-dimensional example.Gray band: inter-quantile of 200 estimates; red triangle: the bagged estimate-the average of 200 estimates.The MC 3 with add, delete, and swap fails to produce meaningful estimates after 100 million iterations with 50 million burn-ins and so the corresponding results are not plotted.

Figure 7 :
Figure 7: Pairwise comparison of ASEs under 200 simulations of training and testing data for the high dimensional example.Lower left triangle: Scatterplot matrix of ASEs for the four methods.The 45 degree lines are indicated by red dashes.Upper right triangle: The difference ASE v − ASE h with the corresponding standard error given in parentheses, where ASE v is for the method on the vertical axis and ASE h the method on the horizontal axis.