Forecasting in Large Macroeconomic Panels Using Bayesian Model Averaging

This paper considers the problem of forecasting in large macroeconomic panels using Bayesian model averaging. Practical methods for implementing Bayesian model averaging with factor models are described. These methods involve algorithms that simulate from the space defined by all possible models. We explain how these simulation algorithms can also be used to select the model with the highest marginal likelihood (or highest value of an information criterion) in an efficient manner. We apply these methods to the problem of forecasting GDP and inflation using quarterly U.S. data on 162 time series. Our analysis indicates that models containing factors do outperform autoregressive models in forecasting both GDP and inflation, but only narrowly and at short horizons. We attribute these findings to the presence of structural instability and the fact that lags of the dependent variable seem to contain most of the information relevant for forecasting.


Introduction
There has been a great deal of work in recent years involving large macroeconomic panels. The models used and the macroeconomic issues addressed differ across papers. However, the basic structure of these models is that there are one or a few variables of interest and a huge number of other variables which may have explanatory power for the variable(s) of interest. The information in these other variables is extracted using factor analysis.
For instance, Stock and Watson (2002a) carry out a forecasting exercise for a few key variables (i.e. industrial production, personal income, etc.) using up to 215 predictors. Most of the information in these predictors is extracted into a small number of factors which they call diffusion indexes. Another example is Bernanke, Boivin and Eliasz (2002) who take a VAR involving a standard set of variables (i.e. the log differences of prices and industrial production as well as the federal funds rate) and augment it with factors based on 120 other macroeconomic time series to create a so-called Factor Augmented VAR or FAVAR.
This model is used to estimate the effects of monetary policy. The motivation for such a model rises from the fact that monetary shocks should reflect the actions of the Fed and Fed decisionmaking is based on information sets covering many variables other than those in a standard VAR. The fact that standard VARs often yield "price puzzles" (e.g. a finding that a contractionary monetary shock causes prices to rise) is thought to arise from such omitted variables and FAVARs help surmount this problem.
These two examples illustrate how many of these things empirical macroeconomists want to do (e.g. forecast or identify monetary policy shocks) involve a large number of potential explanatory variables.
A key empirical finding is that these potential explanatory variables tend to be highly correlated with one another and, hence, a few factors can extract most of the information contained in them (see, e.g., Giannone, Reichlin and Sala, 2002). A theoretical literature has emerged which discusses the statistical properties of various factor-based estimators (see, among many others, Bai and Ng, 2002, Boivin and Ng, 2002, Forni, Hallin, Lippi and Reichlin, 2000, Knox, Stock and Watson, 2002and West, 2002. The purpose of the present paper is to implement alternative approaches to modeling with large macroeconomic panels by drawing on ideas from the Bayesian model averaging (BMA) literature. The empirical work we do involves macroeconomic forecasting, but the basic methods described in this paper have relevance for any similar macroeconometric problem.
The standard approach in the relevant literature is to choose a single model and present empirical results based on this model. For instance, of the dozens or hundreds of factors created using the large number of available predictors, it is common to choose only a few using sequential testing procedures or information criteria. There are two potential problems with this approach. First, when the researcher selects a single model, statistical evidence from other plausible models is ignored. In the context of sequential hypothesis testing procedures, the pre-test problem is well-understood (see, e.g., Poirier, 1995, pages 519-523). Draper (1995) and Hodges (1987) are also important references which discuss the importance of proper treatment of model uncertainty for areas of policy analysis such as forecasting. We do not intend to survey this literature here. Suffice it to note that there are theoretical and practical reasons (some of which are discussed below) for basing inference not on a single model, but on averages across models. These have motivated an explosion of papers which use Bayesian model averaging in many fields of applied statistics (see the Bayesian Model Averaging website, http://www.research.att.com/~volinsky/bma.html, for links to many applications). However, there have been relatively few papers in econometrics which adopt Bayesian model averaging (Fernandez, Ley and Steel, 2001b is a notable exception) and, to our knowledge, none in the field of macroeconomic forecasting using large macroeconomic panels. Hence, the main purpose of the present paper is to draw on ideas from the statistical literature on Bayesian model averaging and apply them to the problem of macroeconomic forecasting with factor-based models.
The second problem with the traditional approach is that evaluating an information criterion for every possible model can be computationally prohibitive since, if K is the number of factors and models are defined by the inclusion or exclusion of each factor, then 2 K possible models exist. For K > 20 or so, direct computation of an information criterion for every model is cumbersome or impossible. Hence, it is common to order factors according to the size of their eigenvalues and just consider models where all of the first q factors are included. This reduces the size of the model space to K (and typically much less since the maximum value for information criteria often occurs for small values of q). However, in factor analysis, the size of eigenvalues is related to the amount of information extracted from the explanatory variables (not the dependent variable) and it is possible that some factors associated with large eigenvalues have no explanatory power while some with small eigenvalues do have explanatory power for the dependent variable.
By searching over models defined by the first q factors, the researcher risks including irrelevant factors and missing important ones associated with small eigenvalues. Thus, a search over models which allow for nonsequential factors is potentially important. One contribution of the present paper is to adapt ideas from the Bayesian model selection literature to develop a simulation algorithm which efficiently searches over such high-dimensional model spaces to find the model with the highest marginal likelihood (or the highest information criteria). Intuitively, posterior simulation methods, which take draws of the parameters from the posterior density, are popular in Bayesian econometrics. Model selection can be done using analogous methods which simulate from model space instead of parameter space. The algorithm is constructed so as to focus on models of high probability. Thus, it is not necessary to evaluate the marginal likelihood (or an information criteria) for every model. This simulation algorithm can also be used to implement Bayesian model averaging.
We apply our Bayesian model averaging and selection methods to the problem of forecasting GDP and inflation using quarterly data on 162 time series from 1959Q1 through 2001Q1. We compare the real time forecasting performance of our methods to forecasts provided by an AR(p) and a model which simply includes the first q factors (where q is selected using marginal likelihoods). For both GDP and inflation, we find that the models which contain factors do out-forecast the AR(p), but only by a relatively small amount at short horizons. We attribute these findings to the presence of structural instability and the fact that lags of dependent variable seem to contain most of the information relevant for forecasting. Relative to the small forecasting gains provided by including factors, the gains provided by using Bayesian model averaging over forecasting methods based on a single model are appreciable. However, we draw attention to some important issues regarding the issue of prior elicitation over model space.

The Model
The basic model considered in this paper is: (2.1) for t = 1, .., T where y t is a scalar dependent variable, w t is a k w vector of explanatory variables and α (L) and γ (L) are polynomials in the lag operator of dimension p 1 and p 2 . In many macroeconomic applications, standard methods for statistical inference in (2.1) are inappropriate since the number of explanatory variables is so high. In (2.1) we have p 1 + p 2 × k w explanatory variables. In the application in Stock and Watson (2002a), k w = 215, in the application in Boivin and Ng (2002) k w is as high as 147 (although this latter paper shows how setting k w as low as 40 actually leads to better forecasting performance). Other applications have a similarly high number of explanatory variables. In such cases, directly estimating (2.1) will yield very imprecise estimates since many of the explanatory variables will be insignificant. Further, sequential testing procedures, designed to reduced the dimensionality of the problem, will be particularly unattractive due to the enormous number of tests which must be done. In frequentist statistical language, presenting results from one final model based on a series of preliminary tests will run into serious pre-testing problems and apparently significant empirical results may merely be due to data mining. In Bayesian language, such a strategy ignores model uncertainty and features such as posterior standard deviations will provide the researcher with a spurious over-confidence in empirical results. Put more informally, it is unwise to use sequential testing procedures to select a single model for two reasons.
First, the model selected might not be the one which is best (where the definition of "best" depends on the metric chosen by the researcher). Second, even if the model selected is the best one, it is rarely optimal to ignore the evidence from other "not quite so good" models.
The standard approach in the macroeconometric literature (whether Bayesian or non-Bayesian) is to replace the K explanatory variables in (2.1) with a much smaller set of factors. For instance, Stock and Watson (2002a) replace (2.1) with: where f t is an q−vector of factors generate according to: where w it is the i th element of w t (for i = 1, .., k w ) and λ i (L) is a polynomial in the lag operator.
Applications with such a specification show that some increases in forecasting performance can be achieved over ARs or low-dimensional VARs even if only a very small number of factors are used.
In this paper we consider a competitor to the dynamic factor approach described by (2.2) and (2.3). It addresses the pre-test criticisms in a sensible manner using Bayesian model averaging (or BMA, see, e.g., Hoeting, Madigan, Raftery and Volinsky, 1999). In the following sections we describe BMA and show how it can be implemented in the context of macroeconomic forecast using high-dimensional panels.

Bayesian Model Averaging and Selection
The literature on Bayesian model averaging and selection has burgeoned in recent years (see Hoeting et al, 1999, or Chipman, George andMcCulloch, 2001, for recent surveys). The basic idea behind Bayesian model averaging can be explained quite simply: Suppose the researcher is entertaining R possible models, denoted by M 1 , ..., M R , to learn about a quantity to be forecast y T +h . If we treat y T +h and M r as random variables, the rules of conditional expectation imply that: This same logic applies to functions of y T +h so, for instance, we can use: to help us calculate the posterior variance of y T +h , which can then be used to calculate predictive standard deviations and quantify uncertainty about y T +h . For the models considered in this paper, p (M r |Data) can be calculated for r = 1, .., R. As we shall see, it is also straightforward to calculate E (y T +h |Data, M r ) in every model.
An alternative to Bayesian model averaging is Bayesian model selection which involves simply choosing M * which has the maximum value for p (M r |Data). Point forecasts can then be based on E (y T +h |Data, M * ).
This is analogous to the standard frequentist approach of selecting a single model using, e.g., an information criterion and then forecasting using this model.
Bayesian model averaging has been frequently used with the linear regression model where there are a large number of potential explanatory variables. The researcher expects that only a few of these are important, but does not know which these few are. This is precisely the sort of issue which arises in macroeconomic forecasting with large panels. Roughly speaking, papers such as those mentioned in the introduction include virtually every variable which has available data and could conceivably be relevant. Stock and Watson (2002a), for instance, use 26 different real output and income variables, 28 relating to employment, 9 variables relating to retail and manufacturing sales, 22 variables relating to housing starts or sales, etc.. Most of these variables are likely unimportant and/or are highly correlated with other variables in the model. However, the researcher does not know beforehand which of these potential variables is unimportant.
In theory, if you treat models as random variables, model averaging is the correct thing to do in the sense that (3.1) follows from the rules of probability. In addition, papers such as Min and Zellner (1993) and Raftery, Madigan and Hoeting (1997) show how model averaging is optimal for forecasting in decision theory problems. In practice, Bayesian model averaging does not suffer from the criticisms associated with sequential testing procedures, since it formally includes model uncertainty in the statistical procedure.
Furthermore, by putting little weight on implausible models it surmounts the problems that arise when enormous numbers of explanatory variables are directly used in a regression.
Two important issues arise when implementing Bayesian model averaging. First, the prior for the parameters must be a valid probability density function in order to yield meaningful values for p (M r |Data).
This rules out the use of many noninformative priors which tend to be improper. Hence, several papers (see, among many others, Fernandez, Ley and Steel, 2001a) describe proper priors which do not require substantive amounts of subjective prior elicitation by the researcher. In this paper we use such so-called objective or benchmark priors as well as an empirical Bayesian approach which uses the data to estimate a key prior hyperparameter. Second, it is often the case that the number of models, R, is enormous and, hence, it is not possible to evaluate p (M r |Data) and E (y T +h |Data, M r ) for every model. In the present application, we will define models based on the inclusion/exclusion of each explanatory variable. Thus, we have R = 2 K models where K is the number of potential explanatory variables (including lags). For instance, if K = 30 then we have 2 30 > 10 9 models. Even if the computer could analyze each model in 0.001 of a second, it would take almost two years to analyze all the models. In response to this problem, a literature has developed that devises various ways of overcoming this problem through simulation over the space defined by the various models. In this paper we use the algorithm described in Clyde (1999) (which is similar to the importance sampling algorithm described in Clyde, DeSimone and Parmigiani, 1996) which should be particularly well-suited to the problem at hand. Intuitively, this is an algorithm which draws models from p (M r |Data) . In this way, the algorithm attaches more weight to the models with high probability (which are drawn more often) and less weight to implausible models. Even this algorithm is computationally quite intensive and, hence, it is important to stay within a class of models where the marginal likelihood can be calculated analytically. 1 For this reason, in our empirical work we stay within the framework of the Normal linear regression model with natural conjugate prior. The remainder of this section describes the details involved when adopting this approach.
In this paper, we always condition on the first max (p 1 , p 2 ) observations which we denote by 0, −1, .., 1− max (p 1 , p 2 ). With this convention, we define y = (y 2 , .., y T +1 ) 0 , ε = (ε 1 , .., ε T ) 0 , X to be the T × K matrix containing all potential lagged explanatory variables and write (2.1) as: We stress that the t th row of y contains data available at time t + 1 while the t th row of X contains data available at time t. Following standard practice (see, e.g., Stock and Watson, 2002a) all variables are transformed to stationarity (see the Data section and Data Appendix for more details). When forecasting h periods in the future, y is redefined as y = (y 1+h , .., y T +h ) 0 .
In our application, we wish to include variables which are common to every model (i.e. an intercept and lags of the dependent variable). An attractive way of treating such variables (see, e.g., Chipman, George and McCulloch, 2001) is to integrate them out using a noninformative prior. This is equivalent to removing the linear effect of these common variables on the dependent and other explanatory variables. To be precise, if y * and X * are the original dependent and potential explanatory variables and X * * contains a set of 1 The marginal likelihood, which is p (Data|Mr), is a key component of p (Mr|Data) since the rules of probability imply p (Mr) is the prior model probability discussed below.
explanatory variables common to all models, then in (3.3) we work with y = where X i and X * i are the i th columns of X and X * , respectively.
We assume We use a natural conjugate prior: denotes the Gamma distribution with mean s −2 and degrees of freedom ν (see Poirier, 1995, page 100). The prior hyperparameters β, B, s −2 , ν will be discussed below.
The algorithm described in Clyde (1999, section 3) requires the explanatory variables to be orthogonal to one another. Accordingly , we use an orthogonal transformation of (3.3) of the sort used in the factor analysis literature. That is, if we define Z = XW where W is a nonsingular K × K matrix chosen so that the columns of Z are orthogonal we can write (3.3) as There are many ways of choosing a nonsingular W such that Z = XW and the columns of Z are orthogonal. However, in the present application, a very logical choice suggests itself to us. We construct W using principal components, implying a factor structure similar to that used by others in the field (e.g. Stock and Watson, 2002). That is, W is simply the matrix of eigenvectors of X 0 X. With macroeconomic panels, it is common for K ≥ T and the last K − T + 1 columns of Z are equal to zero. If this occurs, then we delete these last columns of Z. In our empirical work, the condition K ≥ T usually holds and, hence, In the remainder of the paper we use K to denote the number of potentially explanatory variables actually included in the model and, thus, it is typically the case that The prior for σ −2 is unaffected by the transformation from (3.3) to (3.7) and the prior for the regression coefficients becomes Formulae for the posterior and marginal likelihood for this model are available in any standard Bayesian textbook (e.g. Poirier, 1995, pages 526 and 543). The predictive density is discussed in the Technical Appendix.
Different models are defined through a K × 1 vector γ with all elements equalling either zero or one. If the j th element of γ is zero, then the j th column of Z is deleted from the model (as is the j th element of α and the j th row and column of A). There are 2 K possible configurations for γ and this defines the set of all possible models. Clyde (1999) shows how p (γ|y) takes a simple form and, thus, Monte Carlo methods used to take posterior draws of γ if σ 2 is known. When σ 2 is unknown a Gibbs sampler can be set up which sequentially draws from p ¡ γ|y, σ 2 ¢ and p ¡ σ 2 |y, γ ¢ . This algorithm (which hinges on the fact that the explanatory variables are orthogonal), is very efficient in that p ¡ γ|y, σ 2 ¢ is directly drawn from. This contrasts with other common algorithms, such as Markov Chain Monte Carlo Model Composition (MC 3 ) ., γ K ¢ 0 and typically produce a highly correlated sequence of drawn models.
The output from our algorithm can be used either to carry out Bayesian model averaging or model selection. In the latter case, the model with the highest value for p (M r |y) (or, equivalently, the highest value for p (γ|y)) is selected. For every drawn model, the predictive mean and second moment are obtained using the formulae in the Technical Appendix. These can then be averaged in the same way as posterior is drawn (for s = 1, .., S) based on Data containing data through period τ, then ill converge to E (y τ +1 |Data) as S goes to infinity.
We do not provide all the technical details here since our implementation is the same as that described in Clyde (1999, section 3). Briefly, an analytical form for p ¡ γ|y, σ 2 ¢ can be derived given a prior, p (γ), using the fact that p ¡ y|γ, σ 2 ¢ is simply the marginal likelihood for the Normal linear regression model defined by γ. p ¡ σ 2 |y, γ ¢ takes the usual inverted-Gamma form for the Normal linear regression model defined by γ.
It remains to specify the prior model probability, p (M r ) (or, equivalently, a prior for γ) and choose β, B, s −2 , ν and W . A standard choice is where θ j for j = 1, .., K is the prior probability that each potential explanatory variable enters the model.
A common noninformative benchmark case sets θ j = 1 2 , a value which implies p (M r ) = 1 R for r = 1, .., R. However, other priors are possible. For instance, one might expect factors corresponding to higher eigenvalues of X 0 X to be more relevant than factors with low eigenvalues. We can incorporate this by allowing θ j to depend on v j , the j th largest eigenvalue of X 0 X. Thus, some of the empirical work we do assumes: (3.10) We also work with what we call a 99.9% prior which sets θ j = 1 2 for j = 1, .., K 99.9 , where 99.9% of the variation in X is contained in the first K 99.9 factors. Thus, this third prior discards the factors associated with very small eigenvalues but is otherwise noninformative.
The remaining prior hyperparameters are chosen using a strategy suggested in, among other places, Fernandez, Ley and Steel (2001a). This is based on the insight that using a noninformative, improper, prior over parameters common to all models is an acceptable practice (see, e.g., Kass and Raftery, 1995, page 783). Hence, we choose a noninformative prior for σ −2 (i.e. ν = 0 and, with this choice, s −2 does not enter the marginal likelihood or posterior). For β and B it is not acceptable to make noninformative choices since Bayes factors are either degenerate or depend on arbitrary normalizing constants. Following Fernandez, Ley and Steel (2001a), we center the prior for these regression coefficients over zero (i.e. β = 0 K ) and use a g-prior form for B (see Zellner, 1986). The g-prior reduces the choice of the K × K prior covariance matrix B to a single scalar hyperparameter g by setting: The use of a g-prior is common in Bayesian model averaging and justification for a prior of this choice is given in many papers (e.g. Steel, 2001a or Zellner, 1986). In essence, this allows prior information to have the same scale as likelihood information. In the present context, the g-prior cannot be used if the original number of explanatory variables exceeds T since then X 0 X is singular. We get around this problem by working directly with a g-prior for (3.7) and, thus, We should digress for a moment to discuss some issues relating to the g-prior (and indirectly the priors over model space given by 3.10 and the 99.9% prior). In the Normal linear regression model, having the prior depend on explanatory variables is acceptable. That is, the likelihood function and posterior are defined conditionally upon Z (and, hence, X), so having X in the prior does not violate the rules of conditional probability. Having the prior depend on y would violate the rules of conditional probability, but our prior does not have this property. Furthermore, the structure of the prior covariance matrix in (3.10) is similar to the observed information matrix (for stationary data). It is common to elicit priors related to the information matrix. Thus, there is a sense in which the g-prior incorporates the time series structure implied by inclusion of lagged explanatory variables. We use a noninformative prior for the lagged dependent variables, which are included in all models.
It remains to specify g. Fernandez, Ley and Steel (2001a) investigate the properties of many possible choices for g and show that some of them yield posterior model probabilities which have properties similar to commonly-used information criteria. In an objective Bayesian spirit, we focus on such values for g. In particular, we consider three different values: and (3.15) Fernandez, Ley and Steel (2001a) show that the values for g in (3.13) and (3.14) yield consistent model selection criteria (i.e. asymptotically the model which maximizes p (M r |y) will be the correct one, insofar as there is a single correct model). Furthermore, logs of Bayes factors obtained using (3.13) and (3.14) behave asymptotically like the Schwarz and Hannan-Quinn (with C HQ = 3) criteria, respectively. The value for g given in (3.15) implies the Risk Inflation Criterion (RIC) of Foster and George (1994) who also provide motivation for this choice. Further motivation for (3.13) can be found in Kass and Wasserman (1995) who refer to a similar prior as a "unit information prior" and relate it to the intrinsic and fractional Bayes factors literature (see Berger and Pericchi, 1996). It is worth stressing that Bayesian model selection done using these values for g is comparable to traditional model selection approaches using information criteria. In fact, a sensible, albeit ad hoc, non-Bayesian model averaging procedure could be done based on information criteria with p (M r |Data) in (3.1) replaced by an information criteria (normalized so that the weights used in the model averaging procedure sum to one).
In addition to such information criteria-based choices for g, we also consider choosing g in a data based manner. Such an empirical Bayesian methodology can be criticized on the grounds that allowing a prior to depend on y violates the rules of conditional probability. Nevertheless, empirical Bayesian methods are popular with many practical econometricians (see, e.g., Knox, Stock and Watson, 2002). There are various approaches which use the label "empirical Bayes". The most common empirical Bayesian methods choose the value of a single prior hyperparameter (here this is g) which maximizes the marginal likelihood. Here we adopt such an approach.

Data
We use 162 U.S. time series from 1959Q1 through 2001Q1 which are essentially the same as those listed in Stock and Watson (2002b). Our paper differs with related work (e.g. Stock and Watson, 2002a) in that we use quarterly data which allows us to forecast the series of most popular interest, GDP. Most previous work has used monthly data and focussed on series such as industrial production or personal income (partly because GDP is difficult to forecast). In addition to forecasting GDP, we also forecast inflation. Note that other researchers have found that it is harder to forecast price variables than real variables, so that the two series we have chosen should be ones that are difficult to forecast. We do this to ensure that our results are not perceived as due to our searching over 162 variables in order to find some to forecast which support our case.
The complete list of variables, along with brief descriptions and DRI acronyms, are given in the Data Appendix. Here suffice it to note that our quarterly GDP measure is GDPQ and we use PUNEW, the CPI (all items), as our price measure. In order to avoid modeling issues relating to unit roots and cointegration (and following standard practice see, e.g., Stock and Watson, 2002a), we induce stationarity in all our variables. The exact transformation used for every series is given in the Data Appendix. For our forecasted series, we take first and second differences, respectively, of the logs of GDPQ and PUNEW. Thus, in our empirical work, including the forecasting exercise, we are working with the GDP growth and the growth of inflation.

Empirical Results
We investigate three types of econometric procedure: Bayesian model averaging, Bayesian model selection using the model which maximizes p (M r |Data) and a conventional model selection procedure where we simply consider models with the first q factors included by choosing the value of q which maximizes the marginal likelihood (we search over values of q up to 20). Given the relationship between marginal likelihoods and information criteria for the benchmark priors used in this paper (see the discussion after equation 3.15), our two model selection procedures are comparable to traditional approaches using information criteria. Note, in particular, that even for the reader uninterested in model averaging, our computational algorithm provides for a very efficient search over models with non-sequential factors (i.e. there will be 2 K such models and evaluating an information criteria for each will be computationally infeasible if K is at all large).
With regards to the prior for the parameters of the models, we use the g-prior of (3.12) with the three values for g given in (3.13) through (3.15). In addition, we choose g using empirical Bayesian methods. We use three priors over model space (see equation 3.9). One is noninformative over all factors (i.e. θ j = 1 2 for j = 1, .., K), one is noninformative over the first q factors, where q is chosen such that 99.9% of the variation in X is included (we refer to this as the 99.9% prior) and the other is informative with probability allocated to the coefficient on each factor proportional to the size of its eigenvalue (i.e. θ j given in equation 3.10 for j = 1, .., K).
Throughout we compare our results to an AR(p) base case (using the standard noninformative prior).
We choose the value of p which yields the lowest forecast root mean squared error. In this way, we are allowing the benchmark AR(p) to perform in the most favorable manner. For both GDPQ and PUNEW we find p = 2. We also use two lags of all potential explanatory variables when we construct the factors.
In the following tables, results relating to GDPQ are presented in tables with "a" suffixes, while results for PUNEW have "b" suffixes.   Note that the results for the case g = 1 K 2 are a little hard to interpret since K is so different across models (i.e. K = T − 1 for two of the Bayesian model averaging cases, whereas K = q for the models where the first q factors are chosen). Note also that, especially for PUNEW and regardless of which prior we use, the relationship between q and the marginal likelihood is quite nonlinear.  Tables   2a and 2b presents the comparable quantity, E ³ P K j=1 γ j |Data´. These tables shed a great deal of light on the results of Tables 1a and 1b. When we use a flat prior over the model space (i.e. θ i = 1 2 ), then, a priori, it is just as likely that a factor associated with a small eigenvalue enters as one with a large eigenvalue.

Estimation and Model Comparison using the Entire Sample
Using either BMA or Bayesian model selection, very many factors are chosen to enter the model. The one exception to this is when g = 1 K 2 . Since K = T − 1, this value implies very large prior variances on the coefficients. It is well-known that (see, e.g., Poirier, 1995), in the Normal linear regression model with natural conjugate prior, as the prior variance of a coefficient goes to infinity, the Bayes factor in favor of the coefficient equalling zero goes to infinity. In other words, by increasing the prior variance, the reward for parsimony is increased. This accounts for the fact that so few factors are chosen when g = 1 K 2 . In case the reader is puzzled by the fact that 100 or more factors are usually chosen when the prior is flat over model space, it is worth mentioning that non-Bayesian results indicate a similar pattern. For instance, a simple OLS regression of GDP growth on the first 100 factors yields t-statistics on 46 coefficients which are greater than one and 15 which are greater than two (in absolute value). A regression containing the first 10 factors has R 2 = 0.232, while the regression containing the first 100 factors has R 2 = 0.777 (even though the ratio of the 100th to the first eigenvalue is 1.04 × 10 −7 ). Thus, it appears that, although most of the information in X is contained in the first few factors, many of the factors associated with lower eigenvalues do have significant in-sample explanatory power for real GDP growth. Presumably, there is so much information contained in X that even a very small amount of it can have useful explanatory power. Similar statements can be made for PUNEW.
The use of the priors over model space given in (3.10) and the 99.9% prior effectively rule out most of the factors associated with small eigenvalues and, hence, the marginal likelihood results of Tables 1a and 1b are more similar to those for a traditional factor model with q selected using, e.g., an information criteria. For instance, for PUNEW with the 99.9% prior and g chosen to maximize the marginal likelihood, the Bayesian model selection procedure chooses six non-sequential factors (the factors associated with the eigenvalues ranked 1, 2, 3, 5, 6 and 11th). The results in Table 1b indicate this model is to be preferred over a model which simply chooses, e.g., the first 6 factors.  In this sub-section, we have shown how the Bayesian model selection procedure can be used to find models with much higher marginal likelihoods than a traditional strategy of simply choosing q. Furthermore, BMA is averaging over models which include many more factors than are traditionally chosen in comparable forecasting exercises. However, in-sample performance does not necessary imply good forecasting performance and it is to this we now turn.

Simulated Forecasting Exercise
The forecasting exercise we carry out has a similar structure to that done by others in the literature (e.g. Stock and Watson, 2002a). That is, we do simulated real time forecasting from 1970Q1 through 2000Q4 of y τ +h for horizons h = 1, 4 and 8. Unlike Stock and Watson (2002a) we focus on the growth rate at period τ + h rather than the cumulated growth from period τ to period τ + h. This allows us to more clearly identify the horizons for which we have forecasting power.
We conduct Bayesian model averaging and selection using the model given in (3.7) for every period from 1970Q1 to the end of the sample. At any point forecast time, τ , the matrix of factors contains data through the time the forecast is made. We will denote this matrix of factors as Z τ where τ is the time the forecast is being made (i.e. the last row of Z τ contains data from period τ). The t th row of y contains the value of GDP growth in period t + h. Thus, for each variable, we carry out nearly 500 BMA exercises for each choice of g (i.e. roughly 160 choices for τ and three choices for h). When calculating the optimal g, we do a grid search at each τ . Although computationally demanding, this is well within the power of modern personal computers. Note that in the rows labelled "Model with First q Factors Selected" we select a potentially different value for q for each value of τ and g.
Tables 3a and 3b present some diagnostics on the point estimates of the forecasts (i.e. the predictive means) using the root mean squared forecast error as a metric: To aid in interpretation, we normalize the forecast errors relative to that provided by an AR(2). Thus, e.g., an entry in the predictive follows a t-distribution with τ degrees of freedom. Since τ is 38 or more, the predictive should be roughly Normal and, hence, a two predictive standard deviation interval should approximate a 95% highest predictive density interval. When Bayesian model averaging is done, the predictive will be a mixture of t-distributions. Tables 5a and 5b present information on the number of factors found to be important (see the discussion of Tables 2a and 2b for additional explanation).      Overall, the forecasting results are less clearly favorable to the Bayesian model averaging and model selection methods that are the focus of this paper than were those obtained using the full sample. In general, none of the models which augment an AR(p) with factors exhibits an enormous improvement in forecasting over the AR(p) for either GDPQ or PUNEW. At medium to long forecast horizons (i.e. h = 4 or 8), there is virtually no evidence that a factor augmented model can beat an AR(2). As we shall discuss in more detail below, this is likely due to the facts that lags of the dependent variable contain most of the information relevant for forecasting our variables and that substantial time variation in the regression coefficients occurs. Thus, at longer forecasting horizons, this instability in coefficients comes to dominate the small amount of information in the factors and the factor-augmented models cannot beat the AR(2).
In the short run (i.e. h = 1), however, there is evidence that incorporation of the information in the factors allows for some moderate improvement in forecasting performance. For this reason, we will focus most of the following discussion on results for h = 1.
If we focus on short run forecasting and the 99.9% prior, Table 3a indicates that incorporating the information in the factors can reduce RMSEs for GDPQ by roughly 5% while  Tables 3a and 3b where BMA yields RMSEs which tend to be at least 1% lower than the model selection methods (a decrease which is substantive when one considers that all the information in the factors is only improving RMSEs by 5% or 10%). It can also be seen in Tables 4a   and 4b where BMA predictive intervals (i.e. the predictive mean +/-two predictive standard deviations) exhibit slightly better coverage than predictive intervals based on a single model. The reason for this is that, by incorporating model uncertainty as well as parameter uncertainty, predictive standard deviations are slightly larger for BMA than with model selection methods.
It is also worth noting that forecasting results are fairly insensitive to the choice of g with the computationally demanding empirical Bayesian methodology performing roughly as well as simpler approaches.
Hence, at least for h = 1 and the 99.9% prior, our findings are quite encouraging for Bayesian model averaging. It produces point forecasts which are more accurate than those produced using model selection methods and predictive intervals that have better coverage. However, all these findings hold for the prior over model space which is noninformative over the subset of the factors which contain 99.9% of the information in X. The other priors over model space yield worse forecasting performance. The prior given in (3.10) performs only somewhat worse than the 99.9% prior, however the forecasting performance of the completely noninformative can be dire. It is to this issue of sensitivity to prior over model space that we now turn.
Tables 5a and 5b shed light on the poor forecasting performance which occurs when noninformative priors are used. Many factors associated with very small eigenvalues are included when the priors with θ j = 1 2 and g = 1 T or g = 1 [ln(T )] 3 are used, and these have dire consequences for forecasting. Results in Tables 1a and 1b show that including many factors greatly improves measures of in-sample fit (such as the marginal likelihood). Before seeing the forecast results, we had thought this plausible. That is a few factors associated with small eigenvalues might be useful in explaining rare events (e.g. business cycle turning points). However, this good in-sample performance and plausible story do not translate into good forecasting performance. The only models which forecast better than an AR(2) are those which rule out many of the factors. Note that the factors associated with smaller eigenvalues can either be ruled out by directly attaching low prior weight to their entering the model (i.e. through the 99.9% prior or a prior such as 3.10), or by using a relatively flat prior for the regression coefficients (i.e. through choosing g = 1 K 2 , a value which implies a strong reward for parsimony) or by simply not including these factors (i.e. in the conventional approach where we simply include the first few factors and ignore the rest).
The use of empirical Bayesian methods which, for every forecast horizon, choose the prior which maximizes the marginal likelihood does not substantively improve forecast performance. This finding supports the story that successful in-sample performance (as measured by marginal likelihoods) does not map into successful out-of-sample forecasting performance.
Our results are consistent with those of Knox, Stock and Watson (2002) who consider forecasting using a large number of monthly time series using various estimation methods, including empirical Bayesian methods of a different sort from those used in our paper. These methods include many factors as explanatory variables and a data-based prior to carry out Bayesian inference. Knox, Stock and Watson (2002) find very good in-sample performance of their empirical Bayesian methods, but relatively poor forecasting performance in a simulated real-time forecasting exercise similar to that carried out in this paper. Despite the fact that they are using very different data from us, this pattern of good in-sample performance and bad out-of-sample performance holds in both of our exercises. Knox, Stock and Watson (2002) argue that parameter instability is probably the reason for this.
Another way of linking our findings to others is through the concept of shrinkage. There are many ways of shrinking forecasts (see, e.g., Giacomini, 2002, who finds so-called "Bayesian shrinkage estimators" to perform best in a forecasting exercise). In our framework, shrinkage can either occur through priors on the parameters (i.e. through g) or through priors on model space. In the latter case, omitting an explanatory variable is the ultimate way of shrinking its effect. An advantage of Bayesian model averaging is that it is much more flexible in the way it can handle this second shrinkage effect. That is, BMA can put weights on factors between zero and one (see equation 3.1) which effectively shrinks their effect. Hence, one way of looking at the poor results using the noninformative prior over model space, is that it simply does not shrink forecasts enough.
The one case where the noninformative prior does yield reasonable forecasts is when g = 1 K 2 (i.e. the most noninformative value we consider). This case illustrates an important point regarding the role of g in shrinkage. In the context of a single model, decreasing g will make the prior for the coefficients less informative and, thus, decrease shrinkage of the forecasts. However, in a BMA exercise, decreasing g will also increase shrinkage due to the reward for parsimony associated with less informative priors. In Tables 5a and 5b, when g = 1 K 2 it can be seen that very few factors are included (which shrinks forecasts relative to the case where more factors are included). However, the fact that g is so small means that the coefficients are not shrunk and will be very close to OLS estimates. Thus, decreasing g can either increase or decrease shrinkage. The general point we draw from this is that, unlike Bayesian estimation in the context of a single model, priors can matter a great deal and must be carefully selected. The specific point we draw is that it is best to choose the prior on model space (i.e. choose θ j ) to reflect prior information about the likely number of factors in the model and g to reflect prior information about the degree of shrinkage of regression coefficients (or use empirical Bayesian methods to estimate g). To attempt to use a single hyperparameter, g, to control both types of shrinkage is impossible. Results with the model-space prior given in 3.10 are much better than with the completely flat prior, but worse than the 99.9% prior. This is due to the former prior requiring the first factor to always be included and downweighting some factors associated with very small eigenvalues. Since including the first factor does not always improve forecasts, while including some factors with very small eigenvalues sometimes does, the prior given in 3.10 does not do quite as well as the 99.9% prior.
The previous discussion has focussed on short forecasting horizons since parameter instability precludes accurate forecasting at longer horizons. However, it is worth briefly noting that, when we use an informative prior over model space, the average number of factors included in each forecasting model tends to be very small. For instance, with the 99% prior the models selected using our Bayesian model selection procedure tend to have, on average, much less than one factor included. In other words, at many or most of forecasting points, no factors at all are included. This is getting very close to simply providing a trivial forecast of zero at every point in time (remember, our data have been de-meaned and stationarity induced). We take this as additional evidence that past information in our data set is not very useful in forecasting, especially when h = 4 or 8. The most likely reason for this is parameter instability.

Conclusions and Directions for Future Research
In this paper, we have used Bayesian model averaging to address the problem of forecasting in large macroeconomic panels. We have provided both theoretical and empirical justifications for such an approach, as opposed to selecting a single model, are given. We have shown how BMA can be implemented in factor models using algorithms which simulate from the space defined by all possible models. Such simulation algorithms can be used either to do Bayesian model averaging or Bayesian model selection in an efficient manner. We applied these methods to the problem of forecasting GDP and inflation using quarterly U.S. data on 162 time series. For both GDP and inflation, we found that the models which contain factors do out-forecast an AR(p), but only by a relatively small amount and only at short horizons. These findings can be attributed to the presence of structural instability and the fact that lags of dependent variable seem to contain most of the information relevant for forecasting. Relative to the small forecasting gains provided by including factors, the gains provided by using Bayesian model averaging over forecasting methods based on a single model are appreciable.
Our findings suggest several avenues for future research. We have found strong evidence of structural instability and/or nonlinearity in the time series we have considered. One way that this problem can partially be addressed is by using rolling forecast windows. The methods of Bayesian model averaging and selection described in this paper can be used directly to handle this modification. Furthermore, simple extensions of our framework could be done which would allow us to move towards addressing some of these data features (e.g. including dummy variables for structural breaks or including nonlinear transformations of the factors). However, it would be desirable to construct models which allow for the structural instabilities and nonlinearities which may exist. State space models provide a natural framework to deal with such issues. For instance, the class of latent factor regression models described in West (2002) is an attractive one when working with large macroeconomic panels. Indeed, the use of a latent factor representation (instead of using the actual factors as we have done), may help mitigate some of the problems we have found when doing Bayesian model averaging with the noninformative prior over model space (i.e. the latent factor representation removes some of the idiosyncratic variation in the factors and, hence, should reduce over-fitting problems). Extending the latent factor regression model to allow for structural instabilities and/or nonlinearities is relatively straightforward. However, in such a class of models, it would be difficult or impossible to carry out Bayesian model averaging over 2 K possible models since analytical expressions for predictive moments and marginal likelihoods do not exist. For this reason, in the present paper we have stayed within the framework of the Normal linear regression model with natural conjugate prior, despite its failure to properly model some key data features.
where Y τ = (y 1 , .., y τ ) 0 . Note that our orthogonalization strategy implies A is a diagonal matrix and, thus, A τ is a diagonal matrix of a simple form. This, and the fact we have assumed α = 0, simplifies computation.
The remaining feature to be evaluated is s 2 τ which is given by: These formulae hold for the full model with γ = ι K . Results for other values of γ are obtained by deleting columns/rows/elements of all vectors/matrices as appropriate.

Data Appendix
This Appendix provides a list of all variables used in the analysis. All data are quarterly from 1959Q1-2001Q1. With a few minor exceptions, we use the same variables and transformations as Stock and Watson (2002b). The original variables were taken from the DRI Basic Economics database and we use the DRI acronyms for these. Some transformations of these original variables are taken as noted below.