A Scalable Frequentist Model Averaging Method

Abstract Frequentist model averaging is an effective technique to handle model uncertainty. However, calculation of the weights for averaging is extremely difficult, if not impossible, even when the dimension of the predictor vector, p, is moderate, because we may have candidate models. The exponential size of the candidate model set makes it difficult to estimate all candidate models, and brings additional numeric errors when calculating the weights. This article proposes a scalable frequentist model averaging method, which is statistically and computationally efficient, to overcome this problem by transforming the original model using the singular value decomposition. The method enables us to find the optimal weights by considering at most p candidate models. We prove that the minimum loss of the scalable model averaging estimator is asymptotically equal to that of the traditional model averaging estimator. We apply the Mallows and Jackknife criteria to the scalable model averaging estimator and prove that they are asymptotically optimal estimators. We further extend the method to the high-dimensional case (i.e., ). Numerical studies illustrate the superiority of the proposed method in terms of both statistical efficiency and computational cost.


Introduction
Model averaging is a popular and effective approach in dealing with model uncertainty and improving prediction accuracy.Instead of picking a single "best" model according to some model assessment criterion in traditional model selection, the model averaging approach advocates the pooling of predictions by giving higher weights to better models.The approach often reduces the risk in regression analysis, as "betting" on multiple models prevent the case of a singly selected model being poor (Leung and Barron 2006).
In the existing literature, Bayesian model averaging has been well studied, and there is a large amount of literature on this approach; see Hoeting et al. (1999), Raftery et al. (1997), and the references therein.As an alternative, frequentist model averaging on which this article focuses has been gaining increasing attention.Buckland et al. (1997) suggested using weights based on exponential Akaike information criterion (AIC) (Akaike 1973) to combine estimates from different candidate models.Yang (2001) and Yuan and Yang (2005) proposed a mixing estimator.Hjort and Claeskens (2003) provided an asymptotic analysis of model average estimators in the likelihood-based framework.Hansen (2007) proposed selecting the optimal weights for model averaging by minimizing a Mallows criterion.Liang et al. (2011) developed a procedure for selecting optimal weights such that the resultant estimator has the minimum mean-squared CONTACT Hua Liang hliang@gwu.eduDepartment of Statistics, George Washington University, Washington, DC.Supplementary materials for this article are available online.Please go to www.tandfonline.com/UBES.
error.Hansen and Racine (2012) proposed a jackknife model averaging method using leave-one-out cross-validation.
When there are p predictors, there are 2 p candidate models to consider.Thus, the size of the candidate model set is usually enormous, which causes critical issues for obtaining optimal weights by minimizing a criterion that requires to estimate all candidate models.For example, given 20 predictors, there are more than one million candidate models.It would be computationally difficult to estimate one million models and to obtain the optimal weights for averaging over them.Calculating so many weights may also cause the loss of prediction efficiency due to computational error.Attempts have been made to reduce this burden in the literature.One approach is to consider a subset of all possible models by assuming a nested structure among candidate models (Hansen 2007;Hansen and Racine 2012).However, this nested model structure is not applicable in some practical problems.For example, in labor economics, it is not suitable to assume that possible predictors can be expressed in a nested way (Wooldridge 2003).
In this article, we develop a scalable frequentist model averaging method based on singular value decomposition (SVD), from which, we obtain the left singular vectors of the predictor matrix, fit regression models on these singular vectors separately to get the estimators, and then average these estimators to obtain a model averaging estimator.This strategy enables us to find optimal weights by considering at most p candidate models instead of 2 p candidate models, and greatly advocates the applicability of frequentist model averaging.Magnus and Durbin (1999) proposed the weighted-average least squares estimator which transformed covariates based on the relation between the least squares estimators of interested parameters and that of nuisance parameters.The idea of orthogonalization was also used in Clyde et al. (1996) for Bayesian model mixing.They developed a Bayesian framework for orthogonalized design matrix assuming that the normal linear regression model is a correct model.Similar strategy can also be found in Charkhi et al. (2016), Jolliffe (1982), andPark (1981).We adopt a pure frequentist approach and we do not assume that the full model is correct.Compared with the traditional model averaging based on the original covariates, the improvement of prediction efficiency from the scalable model averaging, as well as computational gain from orthogonalization, are remarkable.
We prove that the minimum loss of the scalable model averaging estimator is asymptotically equal to that of the traditional model averaging estimator.This indicates that the best performance of the scalable model averaging estimator is asymptotically as good as the best performance of the traditional model averaging estimator.
Furthermore, we use Mallows (Hansen 2007) and Jackknife (Hansen and Racine 2012) weight selections methods to illustrate the proposed scalable model averaging, and prove that the resulting scalable Mallows/Jackknife model averaging estimators are asymptotically optimal, an oracle property established in the literature (Hansen 2007;Hansen and Racine 2012;Ando and Li 2014).
Another contribution of our scalable model averaging is its applicability to high-dimensional data (p ≥ n, with n being the sample size).Claeskens (2012) suggested averaging estimators from penalization-based estimation approaches, but she did not investigate how to average these estimators.Ando and Li (2014) developed a model averaging approach for high-dimensional regression.They took an average over candidate models that are formed by grouping the covariates according to marginal correlations.In this article we also apply our strategy to the high-dimensional case, and provide three practical approaches.The first procedure is to reduce the dimensionality by screening the original covariates; the second procedure is to reduce the number of left singular vectors by removing the ones with small singular values; and the third procedure is to reduce the number of left singular vectors by the sure independent ranking and screening (SIRS) method (Zhu 2011).We thank an anonymous reviewer for suggesting the third idea.A comparison with Ando and Li's (2014) approach indicates that our method achieves a higher estimation efficiency with remarkably lower computational cost.
The reminder of the article is organized as follows.Section 2 proposes our scalable model averaging method, and show the asymptotic equivalence in the minimum loss to the traditional model averaging.Section 3 uses the Mallows and Jackknife criteria to illustrate the scalable model averaging and establishes the asymptotic optimality of the resulting estimators.Section 4 extends the scalable model averaging method to the highdimensional case.Sections 5 and 6 present numerical evidences from simulated and real datasets.Proofs of the Theorems and additional numerical results are presented in the supplementary materials.

Model Averaging Estimators
Let {(y i , x i ), i = 1, . . ., n} be a random sample, where y i is the response and x i = (x i1 , . . ., x ip ) is the predictor vector.Our working model is the linear regression model: where θ = (θ 1 , . . ., θ p ) is the regression coefficient vector and e i 's are uncorrelated and heteroscedastic model errors such that E(e i |x i ) = 0 and E(e 2 i |x i ) = σ 2 i .In matrix notation, where y = (y 1 , . . ., y n ) , μ = (μ 1 , . . ., μ n ) , e = (e 1 , . . ., e n ) , and X = (x 1 , . . ., x n ) is the n × p predictor matrix.In this and next two sections, we focus on the case of n p.When we are not sure which of the p predictor variables should be included in a model, there are up to 2 p candidate models to consider.The method of model averaging is to average all estimates from each individual candidate model.Let M be the number of candidate models to be considered for averaging.Since 2 p may be too large to handle computationally, all 2 p possible candidate models are not always included for averaging, for example, the nested model structure (Hansen 2007;Hansen and Racine 2012) only considered a small proportion of all the possible candidate models.Thus, in traditional model averaging methods, it is typically assumed that M 2 p .If all possible candidate models are included for averaging, then M = 2 p .For m ≤ M, let X m be the predictors matrix corresponding to the mth candidate model.The ordinary least square (OLS) estimator of θ is θm = (X m X m ) −1 X m y, and the estimator of μ from the mth candidate is μm = X m θ m .Let w = ( w1 , . . ., wM ) be the weight vector in the unit simplex in R M : ( 2 ) A model averaging estimator of μ is To choose the weight vector w, Mallows model averaging (Hansen 2007) and Jackknife model averaging (Hansen and Racine 2012) are commonly used and are proved to be efficient.Regardless of Mallows model averaging and Jackknife model averaging, the target function to optimize is a quadratic function of w, for which the numerical solutions have been thoroughly studied and algorithms are widely available.

Scalable Model Averaging Estimators
Now we introduce the scalable model averaging method to address the computational challenge caused by the exponential size of the candidate model set.The idea is to use SVD to convert the predictor matrix into a column-orthogonal predictor matrix.
Denote the SVD of matrix X as X = UDV , such that U U = V V = I p , where U is an n × p column-orthogonal matrix, D is a p × p rectangular diagonal matrix with nonnegative real numbers on the diagonal, V is a p×p orthogonal matrix, and I p is the p-dimensional identity matrix.The original model (1) can be represented as where β = DV θ .Let u ij be ijth element of U.All the information in the original predictors x xj 's for the responses is preserved in u ij 's.We can see this from the angle of mean prediction through the hat matrix, which is defined as H = X(X X) −1 X for the original predictors.This matrix converts the observed responses to the estimated mean responses.The hat matrix from U is identical to (5) Thus, the prediction from the transformed model ( 4) is equivalent to that from the original model (1).
The model ( 4) can be rewritten as Notice there are still 2 p candidate models in the transformed model (6).Let U m be the predictor matrix corresponding to the mth candidate model in the transformed model ( 6), and denote the predictor set of the mth candidate model as The OLS estimator βm of the regression coefficient from any candidate model with U m consists of the OLS estimators from univariate regressions of y on each column of U m .To see this explicitly, let βj be the OLS estimator of β j from the model with a single predictor u (j) , the jth column of U. We have On the other hand, the OLS estimator βm from the mth model with where j 1 , . . ., j p m are elements of the set S m , and p m is the size of S m .Since columns of U are orthogonal, the marginal estimator for each β j remains the same as the jth component of the full model estimator.Note that the transform between β and θ in ( 4) is one-on-one when X is column full rank, so we can transform regression coefficient estimates based on U m to that based on the original predictors to interpret the results.From ( 7), the OLS estimator of any candidate model consists of { βj : j ∈ S m }.Thus, the prediction based on the mth candidate model For a model averaging estimator, it can be written as a linear combination of these u (j) βj 's, namely, where I S m (j) is the indicator function of the set S m , that is, I S m (j) = 1 if j ∈ S m and 0 otherwise.From Equation ( 8), if we define to be the weight, we can obtain the model averaging estimators under the framework of model ( 4) by combining (averaging) the predictions from the p univariate regression models.Let w = (w 1 , . . ., w p ) be the weight vector in the simplex The constraint p j=1 w j = 1 is not needed in the converted problem because p j=1 w j = M m=1 p j=1 wm I S m (j) is typically not 1.Since the estimator of μ in the jth univariate regression model is μ(j) = u (j) βj , the scalable model averaging estimate of μ is A direct benefit is that we just need to choose the weights of size p rather than those of size 2 p in traditional model averaging.Since the left singular vector matrix has the same prediction ability as the original predictor matrix as shown in ( 5), our scalable model averaging may keep the efficiency of the traditional model averaging using much less computational cost.Meanwhile, reducing the size of the weights from 2 p to p may improve the efficiency of frequentist model averaging for practical application because the reduced weight size may reduce the numerical error in optimization.These are observed in numerical studies with synthetic and real datasets.
It is worthwhile to mention that an element of w is the weight of a candidate model in the original predictors, but an element of w does not have this interpretation.Unlike the case of the regression coefficient, the linear transformation from w to w in (9) is not invertible; we cannot obtain w from w.Thus, we lose the intuitive interpretability of the estimated weight on the original candidate models.However, such a loss is not a concern when the focus is on the accuracy of prediction or estimator.

Asymptotical Equivalence of the Minimum Loss
Intuitively, one may expect to pay a price in terms of predictive efficiency for the huge computational gain.Interestingly, our method does not pay this price in the large sample sense.We prove that the minimum loss of our scalable model averaging estimator is asymptotically equal to that of the model averaging estimator based on the 2 p candidate models with the original covariates.In contrast, existing model averaging methods often assume that the number of candidate models is much smaller than 2 p for concerning computational burden, but it is unclear whether this causes any inflation in the minimum loss.The minimum loss of model averaging estimators in a reduced candidate model space is in general larger unless excluded models are truly redundant.
Let L( w) = μ( w) − μ 2 , where w ∈ H, be the sum of squared loss of the model-averaging estimator μ in (3) using the original covariates, and L(w) = μ(w) − μ 2 , where w ∈ H, be the sum of squared loss of a scalable model averaging estimator μ(w) in (10).Denote ξ n = inf w∈H E{L(w)}.We require the following conditions.For some constants c 1 , c 2 , and c 3 , in probability, as n → ∞, where H2 p is the weight set defined in (2) with M = 2 p .
Theorem 1 shows that the minimum loss for the scalable model averaging estimator in ( 10) is asymptotically the same as that for the traditional model averaging estimator in (3) using the original covariates with 2 p candidate models.It provides a theoretical foundation on asymptotical comparison of the scalable model averaging estimator and the traditional model averaging estimator.For the traditional model averaging estimator, if M = 2 p , that is, all possible candidate models are included for averaging, we need to estimate 2 p candidate models and calculate 2 p weights, while for our scalable model averaging estimator we just need to fit p univariate regressions and calculate a weight vector of size p instead of 2 p .The only extra cost is to perform a SVD on the predictor matrix X.Thus, our method greatly reduces the computational cost.Furthermore, this can be done without paying any penalty in terms of the estimation efficiency.This result is intuitive: on the one hand, the left singular vectors of the SVD on the design matrix are linear combinations of predictors and there is no information loss in the transformation as shown in (5); on the other hand, the orthogonality of the singular vectors greatly reduces the weight size from 2 p to p.

Scalable MMA and JMA
Theorem 1 in the previous section shows that the scalable model averaging estimator and the traditional model averaging estimator achieve the same minimum loss asymptotically.We often obtain the weight vector w in practice by minimizing some criterion C(w), that is ŵMA = arg min w∈H C(w). ( We consider two specific choices of C(w) to illustrate our scalable estimator: the Mallows model averaging (MMA) (Hansen 2007) and the Jackknife Model Averaging (JMA) (Hansen and Racine 2012).We summarize the scalable model averaging approach in Algorithm 1.
The MMA is based on the Mallows criterion.Since the number of covariates used in univariate regression model is 1, the Mallows criterion of our scalable model averaging is where • 2 stands for the Euclidean norm and σ 2 = (n − p) −1 y − UU y 2 .From (13), the weight vector is obtained as ŵMMA = arg min w∈H M(w).
The JMA is based on the leave-one-out cross validation criterion.Let H j = u (j) u (j) , and h j,ii be the ith diagonal element of H j , that is, h j,ii = u 2 ij .Define D j to be the diagonal matrix with (1 − h j,ii ) −1 being its ith diagonal element, and let H JMA j = D j (H j −I n )+I n and H JMA (w) = p j=1 w j H JMA j .Following Hansen and Racine (2012), the Jackknife criterion is From ( 14), the weight vector is obtained as ŵMMA = arg min w∈H M(w).
Algorithm 1 Scalable model averaging 1. Obtain U, the matrix of left-singular vectors of X, by SVD decomposition: X = UDV .2. For each column of U, u (j) , calculate the least square prediction μj = u (j) βj 3. Calculate the scalable model averaging estimator by averaging the predictions μj , j = 1, . . ., p, where ŵ = ( ŵ1 , . . ., ŵp ) is the weight vector obtained from data.Two specific choices are ŵMMA or ŵJMA obtained from Mallows' model averaging or Jackknife model averaging.
Our scalable Mallows/Jackknife model averaging estimators are optimal in the sense that they asymptotically achieve the minimum loss of the infeasible best possible model averaging estimator using the original model ( 1).This can be seen by combining Theorem 1 (establishing the equivalence of the minimum losses from the original and scalable model averaging) and existing results that MMA and JMA achieve the minimum loss for a given set of candidate models (Hansen 2007;Hansen and Racine 2012;Ando and Li 2014).Nevertheless, the orthonormality of the left singular vectors can be used to simplify the proof and weaken the required assumptions specified below.
Remark.C.3 means that no individual element of u ij 's dominates all the others.Since n i=1 u 2 ij = 1 for all j, this assumption is also reasonable.C.4 is quite mild since the equation of in probability, as n → ∞.

Extension to High-Dimensional Data
In this section, we extend the application of the scalable model averaging method to the high-dimensional setting with p ≥ n.In order to extend the scalable model averaging method to high-dimensional data, we propose three practical procedures to address the curse of dimensionality.The first procedure is to reduce the dimension by screening the original predictors using the sure independent ranking and screening (SIRS) method (Zhu et al. 2011).The second procedure is to reduce the number of left singular vectors by removing the ones corresponding to small singular values.The third procedure is to perform the SIRS on left singular vectors.For our first procedure with highdimensional data, we prefer the SIRS over the sure independence screening (SIS) proposed by Fan and Lv (2008) because we assume that all candidate models are wrong, which is the case for most practical problems.The SIRS allows us to assume that no linear candidate model is correct, and we just need to assume that the distribution of the response depends only on some of the covariates (active covariate) and does not depend on other covariates (inactive covariates).Following Zhu et al. (2011), the SIRS screens the covariates based on the magnitude of the following statistics instead of the marginal correlation, Derivation and interpretation of this statistics can be found in Zhu et al. (2011).This procedure is similar to the SIS used in the first step of Ando and Li (2014).Theorems 2 and 3 in Zhu et al. (2011) indicate that with probability approaching one, the SIRS can reduce the dimensionality without losing any active covariate.Thus, using SIRS preprocessing allows us to keep all candidate models that involve any activate covariates in a reduced dimension.Once the dimension is reduced, the asymptotic results for the lowdimensional case in Section 2 hold.
The second procedure for high-dimensional data is to drop some left singular vectors with small singular values such that we keep k left singular vectors with largest singular values.We do so by following the factor model setting where the columns of U with small singular values are assumed to be independent on the response (Bai 2003).For real high-dimensional data, it is almost always that rank(X) = n, but it is also often true that many nonzero singular values are small or even close to 0. Hence, it is possible, though probably hard, to establish the asymptotic equivalence of the minimum loss for model averaging with the original covariates under the factor model setting, but it is out of the scope of this article and is a future investigation topic.Empirically, we show that this procedure works well for finite sample sizes in Sections 5 and 6.
In the last procedure for high-dimensional data, we screen left singular vectors by the SIRS and drop some left singular vectors that are least relevant to y.Again, it is difficult to establish asymptotic equivalence of the minimum loss, because we would have to guarantee that the loss due to SIRS on U is asymptotically zero.Empirically, we will show that this procedure works well in Sections 5 and 6.
We extend Algorithm 1 to account for high-dimensional data, and summarize it as Algorithm 2.
Algorithm 2 A scalable model averaging for high-dimensional data Step 1 (a) Choose k covariates by SIRS to get the subset X k from X, and then obtain the left singular vectors from the SVD of X k ; or (b) obtain U from a SVD and keep the its k columns with the largest k singular values; or (c) obtain U from a SVD and choose k columns of it using the SIRS method.
Step 2 For each left singular vector from Step 1, say u (j) , calculate the least square prediction μj = u (j) βj .Calculate the scalable model averaging estimator by averaging the predictions μj , j = 1, . . ., k, where ŵ = ( ŵ1 , . . ., ŵp ) is the weight vector obtained from data.Two specific choices are ŵMMA and ŵJMA obtained from MMA and JMA, respectively.

Simulation Studies
In this section, we assess the numerical performance of our scalable model averaging using three simulation experiments.First, we consider a case when the full model is correct.Note that this case violates Condition C.2.We aim to investigate the performance of our scalable model averaging when some theoretical conditions do not hold.Second, we consider a case that all candidate models are misspecified.Third, we evaluate the performance of the proposed method with high-dimensional data.We compare five model averaging methods: (a) smooth In the high-dimensional setting, we compare our scalable model averaging estimators with the approach proposed by Ando and Li (2014) as well as the high-dimensional model selection method.Their performance is evaluated in terms of the risk under the squared loss function L = μ − μ 2 .
Example 1 (Full model is correct).Data are generated from the model y i = p j θ j x ji + i .Rows of the n by p predictor matrix X is generated from a multivariate normal distribution with mean 0 and covariance matrix with the (i, j)th element σ ij = ρ |i−j| .The value of ρ is set to 0.6.The sample size n is set to 50 (small sample size), 100 (moderate sample size), and 500 (large sample size), and the dimension is set to p = 5.The true coefficients θ = (θ 1 , . . ., θ p ) are generated from a uniform distribution of (0, 2) and then fixed.For the errors i , two settings are considered: one is homoscedastic, with i ∼ N(0, σ 2 ); the other is heteroscedastic, with i ∼ 0.5N(0, σ 2 ) + 0.5N(0, 3σ 2 ), which means that for half data i 's are from N(0, σ 2 ) and for the other half i 's are from N(0, 3σ 2 ).The value of σ 2 is selected to control the R 2 to vary on a grind between 0.1 and 0.9, in which the R 2 is defined as To evaluate each estimator, we compute the empirical risk (expected squared error) by calculating the average loss across B = 1000 simulations.Each risk is rescaled by the risk of the full model.We present the results in Figure 1.From these figures, it is seen that, for both homoscedastic and heteroscedastic settings, SMMA and SJMA are much better than MMA and JMA for most R 2 values.Meanwhile, the performances of SMMA and SJMA get closer as n increases.The difference between the scalable and original model averaging estimators gets smaller as R 2 increases, and the scalable estimators sometimes are slightly worse than the original ones when R 2 is very high, because when R 2 is very high (especially when n is large as well), the original JMA and MMA methods can reach the corresponding optimal risk easily.In this situation, our method can also easily reach its optimal risk, and both risks are close to each other, as shown in Theorem 1.
We also consider the case that covariates contain both discrete and continuous variables to check the impact of different types predictor distributions.We put the results in Figure A.1 of the supplementary materials for similarity and saving space.
Computational cost.The computational times (in seconds) form different model averaging estimators with various values of p are reported in Table 1.From the table, the computational costs of JMA and MMA are growing exponentially as the predictor dimension p increases, and when p > 13, the computer could not handle the optimization due to the size of the weighting vectors.Even for S-AIC, the computer cannot handle the calculation with dimension p > 20 due to estimating so many candidate models.However, for our scalable methods, SJMA and SMMA, the required computational times are very short (close to 0 sec).Even when n = 500 and p = 50, the required The computation is not affordable for JMA and MMA when p > 13 and for S-AIC when p > 20.The optimization is performed by "solve.QP"function from "quadprog" package in R language on a Macbook Pro with 3 GHz intel i7 processor and 8 GB memory running OS X operation system."-"means that the calculation is computationally infeasible.
Figure 2. The risk performance Example 2. The homoscedastic setting is considered here.We normalize the risk by dividing it by the risk of the maximal model.Note the R 2 here is also normalized because of the existing modeling bias.
times of SJMA and SMMA are less than 0.35 sec and 0.18 sec, respectively.
Example 2 (Model misspecification).In this example, we consider the case of model misspecification, that is, all candidate models are wrong models.The model setup in this example is similar to that in Example 1, except that the true model used to generate data contains an additional variable.To be specific, data are generated from a linear model with p = 9 using the same setup as in Example 1.However, only the first eight predictors are used in candidate model construction and the ninth variable is not used in any working model.The coefficient a of the ninth variable is set to be 1 and 3 to represent different degrees of misspecification.Results for the homoscedastic setting are presented in Figure 2. The results for the heteroscedastic setting are similar and thus are omitted to save space.We normalize the risk by dividing it by the risk of the maximal model.From the figure, SMMA and SJMA are much better than MMA and JMA.
As a increases, the advantages of SMMA and SJMA become less significant, but they are still uniformly better than MMA and JMA, respectively.These observations show that our scalable methods have good performance for misspecified models.
Example 3 (High-dimensional data).In this example, we consider the case of high-dimensional data, and compare our scalable model averaging method with the approach proposed by Ando and Li (2014) (AL).We implement the three approaches discussed in Section 4. For the SJMA and SMMA after screening X by the SIRS, we refer them as SIRS-SJMA and SIRS-SMMA, respectively; for the SJMA and SMMA after screening U by the SIRS, we refer them as SIRSu-SJMA and SIRSu-SMMA, respectively; for the SJMA and SMMA by removing left singular vectors with small singular values, we refer them as α-SJMA and α-SMMA, respectively.Here α stands for the fact that we keep the singular vectors such that the summation of their singular values is 100α% of the summation of all singular values.We also consider the performance of the high-dimensional model selection methods, LASSO, MCP (Zhang 2010), and SCAD (Fan and Li 2001).The results for these three methods are similar so we report the results for the LASSO and omit results for the other two.We use R package program "ncvreg" (Breheny and Huang 2011) to implement the LASSO, and perform 5-fold cross-validation to select the penalty parameter.
We adopt the same linear model setting used in Ando and Li (2014).To be specific, p = 2000 predictors are generated from the multivariate normal distribution with mean 0 and covariance matrix with the (i, j)th element σ ij = ρ |i−j| and ρ = 0.6.Among the p = 2000 predictors, 50 of them are active predictors (with nonzero regression coefficients), and they are spaced evenly.That is, the true predictors are X j for j = 40(s − 1) + 1 with s = 1, 2, . . ., 50.The coefficients θ j 's for the true predictors are generated from a normal distribution with mean 0 and standard deviation 0.5.The sample size n is set to 50.Ando and Li (2014) fixed the variance of the error term to be σ 2 = 0.2 2 .Different from theirs, we choose multiple values of σ 2 to let R 2 vary on a grind between 0.1 and 0.9.
In this example, for α-SMMA and α-SJMA, α = 0.99 is used so that the sum of singular values for singular vectors to use is about 99% of the sum of all singular values.The number of predictors by SIRS is set to k = 48.We show the performance of α-SMMA, α-SJMA, SIRS-SMMA, SIRS-SJMA, SIRSu-SMMA, SIRSu-SJMA, and AL in term of risk for different values of R 2 in Figure 3.It is seen that the risks of all scalable model averaging estimators are much smaller than those of AL for all cases.Meanwhile, our scalable model averaging also outperforms the LASSO as R 2 > 0.5.It indicates that our scalable model averaging outperforms the AL approach in the highdimensional setting, and screening X or U, and dropping left singular vectors with small singular values are effective strate-gies to apply scalable model averaging in high-dimensional data.In addition, from the perspective of the computational costs, the AL approach, on average, takes 0.5 sec in this highdimensional setting, while our scalable approach, on average, just takes 0.024 sec.Therefore, we observe from this simulated high-dimensional experiment that our scalable model averaging methods has desirable performance for high-dimensional data.In addition, for JMA, the strategy by screening U performs very similar to that by screening X.For MMA, this strategy outperforms that by screening X when R 2 is large, and vice versa.

Real Data Examples
In this section, we analyze two real datasets to further evaluate the performances of our scalable model averaging method for both the p < n and p > n settings.The two datasets considered here are the engineer wage dataset (ENGIN) and the firm-level data dataset (CEOSAL2) in Wooldridge (2003).The performances of the proposed methods are evaluated on the testing set in terms of the average prediction error under the squared loss function L = n −1 t μ − y 2 , where n t is the size of the testing set, and μ and y are predicted values and observed values in the testing set, respectively.
Example 4 (ENGIN).The response variable is the log of monthly salary ("lwage"), and the features include dummy variables male, highgrad, college, grad, polytech, highdrop, and non-dummy variables educ, swage, exper, pexper, expersq, lswage, pexpersq, mleeduc, and mleeduc0.The intercept term is added in the model, so the dimension of the predictor vector is p = 16.Totally, there 403 observations in the dataset.
We consider five model averaging methods, S-AIC, MMA, JMA, SMMA, and SJMA.To evaluate the prediction performance of these estimators, we randomly select a training set of sample size n = 350 for model fitting and use the rest of n t = 403−n observations as testing data.We repeat this process 100 times to obtain the average performances.Although the dimension of the predictor vector is only moderate (p = 16), it is still computationally infeasible to obtain the weights for MMA and JMA without the nested-model assumption.Thus, to implement the MMA and JMA methods, candidate models are constructed with the nested structure by correlation ranking suggested in Ando and Li (2014).Figure 4 reports the results of these model averaging estimators by plotting the prediction errors under the squared loss in boxplots.It shows that our scalable model averaging estimators, SMMA and SJMA, are much better than the existing model averaging estimators.This example suggests that the scalable model averaging estimators are more efficient in prediction for moderate dimension.
Example 5 (CEOSAL2).We use the CEOSAL2 dataset to evaluate the performance of our scalable model averaging estimators in high-dimensional setting.The CEOSAL2 dataset is a dataset about firm's profits.The dependent variable is the profits as % of sales, and the predictors include dummy variables collage grad, and non-dummy variables salary, age, comten, centen, sales, mktval, lsalary, lsales, lmktval, comtensq, and cetensq.The original dataset contains 177 observations and 13 raw features.We add lsalary 2 , lsales 2 and lmktval 2 to extend the number of predictors.We use the 16 raw features to create p = 121 features by adding interaction terms and an intercept and removing the almost dependent terms (the correlation more than 0.995).
We compare our scalable model averaging estimators with AL (Ando and Li 2014), and the high-dimensional model selection method, LASSO.Note we implement MCP, SCAD as well and omit the results here because of their similarity to that of the LASSO.For this dataset, we consider the training data of sample size n = 100, 120, 150 and use the rest n t = 177−n observations as the testing data.We repeat this process 100 times to obtain the average performance.We adopt the three strategies discussed in Section 4. For α-SJMA and α-SMMA, we set α to be 0.95 and 0.99.For the SIR screening, we set the number of variables to keep by the SIRS to be k that corresponds to α = 0.99.The performances of the scalable model averaging estimators are reported in Table 2.
From the table, the scalable model averaging estimators produce smaller prediction errors than AL and LASSO in all cases.It suggests that the scalable model averaging estimators are more efficient in estimation for high-dimensional data.It seems sufficient to set α ≥ 0.95.We also observe that both strategies of screening X and screening U are efficient.Therefore, similar to the simulated experiment, this real data example also shows the applicability of the scalable model averaging method to highdimensional data.

Conclusion and Discussion
In this article, we have proposed a scalable frequentist model averaging method.This method is computationally efficient, as it only needs to average p candidate models instead of 2 p candidate models.Moreover, the computational efficiency is achieved without sacrificing the prediction efficiency.Actually, the empirical prediction efficiency is even improved due to the decreased size of weights.We have rigorously proved that the minimum loss of the scalable model averaging estimator is asymptotically equal to that for the traditional model averaging estimator, which may involve averaging up to 2 p candidate models.We further have established the asymptotic optimality of the scalable Mallows/Jackknife model averaging estimators.It is worthy to mention that our scalable model averaging estimators can easily be applied into high-dimensional data.Compared with existing methods, the advantages of our method in terms of computation and estimation efficiency are also demonstrated via extensive simulations and real data examples.Specifically speaking, the empirical results indicate that our method not only saves much computational time, but also produces more accurate predictions.For the high-dimensional case, a comparison with Ando and Li's (2014) approach indicates that our method achieves a higher estimation efficiency with remarkably lower computational cost.
There are important questions about the scalable model averaging method that remain for future research.In highdimensional setting, we propose three procedures to apply our scalable model averaging: screening the covariates, discarding singular vectors with small singular values, and screening the singular vectors.Empirically, these procedures have desirable performance.However, theoretically, whether the minimum loss from these approaches are equal to the minimum loss from the traditional model averaging with all candidate models is unclear.Ando and Li's investigation has this theoretical limitation as well, because the asymptotic optimality of their approach was established based on grouped variables rather than on considering all possible candidate models.This is a research topic for future investigation.Another interesting topic is how to extend the idea to more general settings such as generalized linear models and other nonlinear models.

Figure 1 .
Figure 1.The risk in Example 1 with dimension p = 5.We normalize the risk by dividing it by the risk of the full model.

Figure 3 .
Figure 3. Example 3: The risk performances.We normalize the risk by dividing it by the risk of the full model.We split the results into two figures for better presentation.

Figure 4 .
Figure 4. Performances of model averaging estimators for the ENGIN dataset.We randomly split the dataset as a training set of size n = 350 and a testing set of n t = 403 − n observations.The losses are summarized from 100 replications.

Table 1 .
Computation costs (seconds) of different frequentist model averaging estimators with different feature dimension p.

Table 2 .
Performances of model averaging estimators for the CEOSAL2 data.We randomly split the dataset as a training set of size n and a testing set of n t = 177 − n observations.The prediction errors are the arithmetic means from 100 replications.