Optimal Model Averaging of Mixed-Data Kernel-Weighted Spline Regressions

Abstract Model averaging has a rich history dating from its use for combining forecasts from time-series models (Bates and Granger) and presents a compelling alternative to model selection methods. We propose a frequentist model averaging procedure defined over categorical regression splines (Ma, Racine, and Yang) that allows for mixed-data predictors, as well as nonnested and heteroscedastic candidate models. We demonstrate the asymptotic optimality of the proposed model averaging estimator, and develop a post-averaging inference theory for it. Theoretical underpinnings are provided, finite-sample performance is evaluated, and an empirical illustration reveals that the method is capable of outperforming a range of popular model selection criteria in applied settings. An R package is available for practitioners (Racine).


Introduction
When conducting applied econometric and statistical analysis, the presence of model uncertainty is ignored at the practitioner's peril (Leamer 1983).The two dominant approaches for tackling this source of uncertainty when conducting regression analysis involve either model selection or model averaging (Claeskens and Hjort 2008).Model selection deals with model uncertainty by selecting one model from among a set of candidate models using a model selection criterion such as Akaike's information criterion (Akaike 1974, AIC) or the Schwarz-Bayes information criterion (Schwarz 1978, BIC), by way of illustration.That is, model selection takes a set of candidate models and assigns weight 1 to one model and 0 to all others.An alternative is to apply model averaging, which deals with model uncertainty by instead averaging over the candidate models using a model averaging criterion that assigns a vector of weights to the set of candidate models, typically resulting in nonzero weights for each of the candidate models.The goal in model averaging is to control misspecification bias while reducing estimation variance.
Model averaging incorporates the model selection uncertainty into inference (Buckland, Burnhamn, and Augustin 1997), and its improvability over model selection has been demonstrated (Peng and Yang 2022).
Practitioners who adopt model averaging methods often construct a weighted average defined over a set of parametric candidate models, and interest typically lies in one or more parameters common to all models.Despite their popularity and ease of interpretation, model averaging over parametric candidate models could produce unsatisfactory forecasting results due to the rigid unforgiving nature of the parametric CONTACT Li Zheng lizheng@jnu.edu.cnInstitute for Economic and Social Research, Jinan University, Guangzhou, Guangdong, China.Supplementary materials for this article are available online.Please go to www.tandfonline.com/UBES.
candidates.In light of this, we adopt a nonparametric model averaging approach which instead averages over nonparametric candidate models.By allowing for a flexible nonparametric relationship between the response and the predictors, nonparametric model averaging may yield improved prediction accuracy and lower model misspecification risk (Li et al. 2020).Therefore, nonparametric model averaging serves as a complement to parametric model averaging.
We also want our approach to be useful for practitioners and, in this spirit, to admit a range of predictor types including continuous and categorical and, for the latter, both unordered and ordered data types.For these reasons we consider averaging over a recently proposed regression spline technique that admits mixed-data predictors, that is, both categorical and continuous data (Ma and Racine 2013;Ma, Racine, and Yang 2015).The use of regression splines is appealing from a variety of perspectives.From the practical perspective, regression splines present global approximations that are computationally attractive since they require nothing more than solving a simple weighted least squares problem.From the theoretical perspective, the use of Bspline basis functions offers the maximally differentiable spline basis while its approximation capabilities have been widely studied and are well-established.Furthermore, for those accustomed to least-squares estimation methodology and the use of polynomials for modeling nonlinearities in a regression setting, they present a familiar yet powerful alternative to the use of classical parametric candidate models.
In addition to estimating an unknown regression function, researchers often need to conduct statistical inference.However, post-averaging inference is largely neglected in the model averaging literature, particularly when averaging over nonparametric models.In general, this inference problem is especially difficult because the model weights and the candidate models are estimated for the same set of samples and the resulting noise terms are embedded in these statistics in a very complicated manner.In this article, we provide a post-averaging inference result for our approach.
There is a longstanding literature on Bayesian model averaging; see Hoeting et al. (1999) for a comprehensive review.The past two decades have seen a rapidly-growing literature on frequentist model averaging that includes model averaging for least squares regression (Hansen 2007;Wan, Zhang, and Zou 2010;Liang et al. 2011;Hansen and Racine 2012;Liu 2015;Xie 2015), for constructing optimal instruments (Kuersteiner and Okui 2010), for quantile regression (Lu and Su 2015), for time series prediction (Hansen 2008;Cheng and Hansen 2015), and for treatment effect estimation (Kitagawa and Muris 2016), among others.In addition to the parametric model averaging approaches listed above, semiparametric and nonparametric model averaging have recently become much more popular.For example, Li, Linton, and Lu (2015) establish model averaging over nonparametric kernel estimators for time series forecasting, Liu (2018) proposes averaging over local constant and local linear kernel estimators, while Li et al. (2020) and Fang, Li, and Xia (2022) both consider semiparametric model averaging for discrete response models.As well, Li et al. (2018) and Zhang and Wang (2019) develop Mallows model averaging estimators for varying coefficient and partially linear models, respectively, while Zhu et al. (2019) construct a model averaging method for varying-coefficient partially linear models.
In this article, we establish a model averaging method for nonparametric spline candidate regression models allowing for mixed-data predictors.The proposed model averaging estimator is proven to be asymptotically optimal in the sense of minimizing predictive squared loss in large samples.Our theoretical results, which are based on the Mallows criterion (Mallows 1973), apply both to nested and nonnested candidate models, and we also allow for conditional heteroscedasticity.Liu and Okui (2013) and Liu, Okui, and Yoshimura (2016) consider model averaging for linear regression candidate models and generalized least squares models, respectively; the former allows for both nested and nonnested candidate models and the latter assumes the candidate models are nested.Hansen and Racine (2012) propose a jackknife model averaging procedure for linear regression candidates that is robust to heteroscedastic errors and nonnested models.There is also a related literature on model averaging of series candidate models.Liu and Tao (2016) study model averaging for nonparametric instrumental variable models based on sieve candidate models.Hansen (2014Hansen ( , 2015) ) considers averaging (nested) sieve regression models and establishes asymptotic optimality in terms of integrated mean squared error, while in this article we use a Mallows criterion to choose optimal weights and achieve asymptotic optimality with respect to predictive squared loss.It is worth noting that, by way of comparison, our article allows for nonnested candidate models and requires a weaker assumption on the underlying smoothing parameters (see Assumption 3 in Section 2), while methods based on integrated mean squared error are computationally more costly than ours.
This article also contributes to the literature on postaveraging inference of model averaging methods.In a parametric regression framework, Zhang and Liu (2019) study the problem of inference for nested least squares averaging estimators.Moreover, Zhang et al. (2020) investigate the asymptotic distribution of least squares averaging based on a modified Mallows C p criterion, where the candidate models are also nested.Recently, Yu et al. (2021) study post-averaging inference in generalized linear models and establish the corresponding asymptotic theory.These aforementioned studies are all based on parametric candidate models and the candidate models are assumed to be nested while, as noted above, in this article the nonparametric candidate models are allowed to be nonnested.
The rest of this article proceeds as follows.Section 2 presents the nonparametric approach of Ma, Racine, and Yang (2015) that underlies our candidate models and then derives an optimal weighting scheme for averaging across this set of candidates.Section 3 provides post-averaging inference theory for our model averaging scheme.Section 4 considers the finitesample behavior of the proposed approach relative to popular model selection criteria along with an illustrative application, while Section 5 presents some concluding remarks.Detailed proofs and a range of simulation results appear in the supplementary materials.An R package that implements the proposed approach is available for the practitioner (Racine 2017).
Our goal is to approximate g(x, z), which is particularly useful for prediction, and is also a typical object of interest in the literature on model averaging estimation (e.g., Hansen 2007;Lu and Su 2015).To this end, we use S candidate nonparametric regression models to approximate Equation (1).For s = 1, . . ., S, let the sth candidate model be where X i,(s) is a q s -dimensional vector of continuous predictors, Z i,(s) is an r s -dimensional vector of categorical predictors, g (s) (X i,(s) , Z i,(s) ) = E(Y i |X i,(s) , Z i,(s) ), and e i,(s) = μ i − g (s) (X i,(s) , Z i,(s) ) + i represents the approximation error in the sth model.We use M X,(s) = [0, 1] q s and M Z,(s) = r s l=1 D l to denote the supports of X i,(s) and Z i,(s) , respectively.Let M X = [0, 1] q and M Z = r l=1 D l be the supports of X i and Z i , respectively.
To provide an optimal weighting scheme, we first need to estimate each candidate model.To handle the presence of categorical predictors, we follow Ma, Racine, and Yang (2015) to estimate g (s) by tensor-product polynomial splines weighted by categorical kernel functions.Let L(Z i,(s) , z (s) , λ (s) ) be a product categorical kernel function and let B (s) (x (s) ) be the tensorproduct polynomial splines, both of which will be defined below.Then the nonparametric function g (s) (x (s) , z (s) ) can be approximated by (z (s) ) by minimizing the following weighted least squares criterion: where L(Z i,(s) , z (s) , λ (s) ) is defined in Equation ( 6).Thus, g (s) (x (s) , z (s) ) can be estimated by With the following specifications of L(Z i,(s) , z (s) , λ (s) ) and B (s) (x (s) ), one can write β (s) (z (s) ) defined in Equation (4) as a linear function of Y. First, define the univariate kernel function (Z il,(s) , z l,(s) , λ l,(s) ) to be λ l,(s) if Z il,(s) = z l,(s) , and 1 otherwise, where λ l,(s) ∈ [0, 1] for all l and s.Then the product kernel function can be expressed as . We have the following expression for L(Z i,(s) , z (s) , λ (s) ): where λ (s) = (λ 1,(s) , . . ., λ r s ,(s) ) is the vector of bandwidths for each of the categorical predictors, 1(•) is the indicator function, and Next, we specify the tensor-product polynomial splines be the space of polynomial splines of order m l,(s) and pre-select an integer N l,(s) = N n,l,(s) .Divide [0, 1] into (N l,(s) + 1) subintervals I j l ,l,(s) = [t j l ,l,(s) , t j l +1,l,(s) ], where {t j l ,l,(s) } N l,(s) j l =1 is a sequence of equally spaced points, called interior knots.G l,(s) contains functions that are polynomials of degree m l,(s) − 1 on each subinterval I j l ,l,(s) , and functions that are m l,(s) − 2 times differentiable on (s) , where N l,(s) is the number of interior knots and m l,(s) is the spline order, and let B l,(s) (x l,(s) ) = {B j l ,l,(s) (x l,(s) ) : 1 − m l,(s) ≤ j l ≤ N l,(s) } be a basis system of the space G l,(s) .We define the space of tensor-product polynomial splines 4) can be written as a linear function of Y, that is, Furthermore, we can estimate μ i,(s) by We can rewrite this expression in matrix form as μ (s) = P (s) Y, where μ (s) = ( μ 1,(s) , . . ., μ n,(s) ) and P (s) is a square matrix of dimension n with the ith row vector being Let the weight vector w = (w 1 , . . ., w S ) belong to the set (10)

Weight Choice Criterion
So far, the weight vector w in μ(w) has not been specified.Motivated by the Mallows criterion for model averaging estimators (Hansen 2007), we propose a similar method for choosing the weight vector w.
and the conditional expected loss by where X = (X 1 , . . ., X n ) and Z = (Z 1 , . . ., Z n ) .Let the Mallows-type criterion function be It is easy to show that which suggests that for the optimal choice of w in the sense of minimizing L n (w), we can choose w to minimize C n (w) by noting the fact that n −1 tr( ) does not depend on w.Supposing for the moment that is known, the optimal choice of the weight vector is given by where C n (w) is defined in Equation ( 13).Then the optimal model averaging estimator of μ is μ( w) = P( w)Y.

Asymptotic Optimality
In this section, we provide the theoretical results for the asymptotic optimality of our proposed model averaging estimator.Denote δ max (A) as the largest singular value of a matrix A. To prove asymptotic optimality, an essential step is to bound δ max (P (s) ) and tr(P (s) ).In least squares model averaging (e.g., Hansen 2007; Wan, Zhang, and Zou 2010), is the conventional projection matrix, so δ max (P (s) ) = 1 and tr(P (s) ) = q s .When considering more general or complex settings in model averaging, common treatments in the literature involve either directly assuming the upper bound of δ max (P (s) ) and tr(P (s) ) (e.g., Hansen and Racine 2012;Liu and Tao 2016;Sun et al. 2022), or imposing high level assumptions on the projection matrix P (s) (e.g., Li et al. 2018;Zhang and Wang 2019).In this article, we provide a sharp bound of δ max (P (s) ) and tr(P (s) ) under some primitive and commonly made assumptions in nonparametric analysis.In particular, we show (in Supplementary Appendix A) that δ max (P (s) ) = 1 unconditionally, and tr(P (s) ) = O(K (s) ) under some regularity conditions.These bounds resemble the least squares case.Moreover, the form of our projection matrix (9) nests those of several existing nonparametric and semiparametric models, and thus could be applied for model averaging of those models.
For the varying coefficient model averaging approach in Li et al. (2018) and the time-varying model averaging approach in Sun et al. (2022), our discrete kernel weighted diagonal matrix L z (s) corresponds to their continuous kernel weighted matrix K [z (s) ] and K t , respectively.In addition, our projection matrix is similar to the categorical semiparametric varying coefficient model in Ouyang, Li, and Racine (2013).Next, we introduce some notation to facilitate the assumptions and discussions that follow.Recall that q s is the dimension of the continuous covariate X i,(s) .Let g (s) (x (s) , z (s) ) be m s times differentiable with respect to x (s) , for every fixed z (s) ∈ M Z,(s) (see Newey 1997).Specifically, for any q s -dimensional vector of integers k = (k 1 , . . ., k q s ) with q s d=1 k d = m s , we require that the partial derivative exists and is continuous.Let w o s be an S × 1 vector in which the sth element is one and all others are zeros.Denote ξ n = inf w∈W [nR n (w)], and Assumption 1 (Model Averaging).
(i) For some fixed integer 1 Assumption 1 is typical in the literature on model averaging estimation (see, e.g., Hansen 2007; Hansen and Racine 2012; Wan, Zhang, and Zou 2010; Ando and Li 2014 by way of illustration).Assumption 1(i) imposes a finite bound on the conditional moment function.Assumption 1(ii) regulates the candidate model set in terms of the minimum and maximum expected squared loss.First, it requires ξ n → ∞, implying that there is no finite approximating model with zero bias.As mentioned in Hansen (2007), this is a conventional assumption in nonparametric regression as in our case.Second, the minimum risk ξ n should explode fast enough whereas the maximum risk ξ 0 n should not diverge too quickly.One way to satisfy Assumption 1(ii) is to remove some badly misspecified models, that is, to constrain the divergence rate of ξ 0 n (Wan, Zhang, and Zou 2010).Assumption 2 requires that the conditional variance be bounded below by a positive constant.This condition is commonly used in the Mallows-based model averaging literature (Hansen and Racine 2012;Zhao, Zhang, and Gao 2016) as well as for models with heteroscedasticity (Andrews 1991).
Theorem 1.Under Assumptions 1 and 2, letting w be defined as in Equation ( 14), we have Theorem 1 shows that the practitioner may do as well as if they knew the true μ i , that is, the weight vector w is asymptotically optimal in the sense that the average loss with w is asymptotically equal to that with the infeasible optimal weight vector.
Theorem 1 considers the case where the error variance is known.In practice, however, is often unknown, and we now turn our attention to this important case.We will consider two different ways of dealing with the unknown .In the first way, the estimator depends on w = (w 1 , . . ., w S ), while in the second way, the estimator only depends on the largest model, that is, the model with the largest number of predictors q s + r s .
For the first estimator, we estimate the unknown based on residuals from model averaging estimation by where i (w) = Y i − μ i (w).Replacing with (w) in C n (w), we obtain a feasible criterion Correspondingly, the new optimal weights are defined as The asymptotic optimality of weight w † requires the following assumptions.
Assumption 3 (Regularity for Spline and Kernel).For every s = 1, 2, . . ., S: and Assumption 3 is analogous to the standard conditions for establishing asymptotic properties for spline regression and discrete kernel estimators.Though at the current stage we do not explicitly require statistical properties such as consistency and asymptotic normality of the kernel-weighted spline regression estimators, we call for these conditions in order to achieve the bounds of tr(P (s) ) needed for Theorems 2 and 3 that follow.In particular, Assumption 3-(i) is used to derive the upper bound and lower bound of eigenvalues of Racine, and Yang 2015), which is in turn used to bound tr(P (s) ).Assumption 3(ii) places a restriction on the rate of K (s) , which is similar to that of the consistency of spline regression estimators (Newey 1997).It is also analogous to condition (12) in Wan, Zhang, and Zou (2010) and condition (C.3) in Zhao, Zhang, and Gao (2016).Assumption 3(iii) is a mild requirement for the smoothing parameter of the categorical predictors that can be found in Ma, Racine, and Yang (2015).
Theorem 2. Under Assumptions 1-3, letting w † be defined as in Equation ( 15), we have in probability as n → ∞.
The second strategy for estimating is based on the largest model that includes all predictors (e.g., see the estimation of the variance of homoscedastic error terms in Hansen (2007)).Taking this approach, we estimate the unknown based on residuals from the largest model indexed by s * , therefore, ) Y, and s * = argmax 1≤s≤S (q s + r s ).The new Mallows criterion function becomes The new optimal weights are defined as Theorem 3.Under Assumptions 1-3, letting w * be defined in Equation ( 17), we have in probability as n → ∞.
The first approach based on Equation ( 15) has the advantage of not putting too much confidence in a single model.However, noting that C † n (w) is a cubic function of w since (w) contains w, the optimization problem will be computationally costly.The second approach based on Equation ( 17) is a standard quadratic programming problem because * is independent of w.Therefore, it can be solved in a computationally efficient manner using off-the-shelf quadratic programming algorithms and therefore would be more appealing to practitioners from a practical perspective.In this article we investigate both approaches, and based on simulation studies we recommend the second approach-in our experience it appears to perform slightly better based on mean squared error grounds, and is computationally cheaper (see Table 5 in supplementary appendix G for relative mean square error performance).Proofs of Theorems 1-3 can be found in supplementary appendix A.

Post-Model-Averaging Inference
When the true model is nested within the candidate model set, we label it as model s and in this case g (s) (x (s) , z (s) ) = g(x, z).
A natural question in this instance is how to conduct inference about g (s) (x (s) , z (s) ) based on the model averaging estimator g( w, x, z) = S s=1 w s g (s) (x (s) , z (s) ), where w is an estimator for w proposed in Section 2. The study of this type of problem is referred to as post-averaging inference and is known to be quite difficult in general.The reason for this difficulty is because w and β (s) (z (s) ) are estimated based on the same set of samples and the noise terms (i.e., the i 's) are embedded in these statistics in a very complicated manner.
In this section, we aim to establish the asymptotic theory for g( w, x, z).Letting p s = m s + 1, we introduce the following regularity conditions.Assumption 4.
Assumption 4(i) places some requirements on the number of knots, whereas Assumption 4(ii) is a standard assumption in the literature (see e.g., the Assumption A.4 and Theorem 2 of Ma, Racine, and Yang 2015).
We introduce the notion of the quasi-true value as follows.
Definition 1.For each z (s) ∈ M Z,(s) , the quasi-true value of β (s) is From Definition 1, it is readily seen that In light of this, we can also define the corresponding quasi-true value for the mean function of the sth model as It then follows immediately from Lemma A.9 of Ma, Racine, and Yang (2015) that under Assumptions 1(i), 3(i), and 4, for any given x (s) , z (s) , Denote μ * (s) = [μ * (s) (X 1,(s) , Z 1,(s) ), . . ., μ * (s) (X n,(s) , Z n,(s) )] , it is seen that μ * (s) = P (s) μ.In Lemma A.1, we show that δ max (P (s) ) = 1, which together with (2) indicates that, for 1 ≤ s ≤ S, In this article we say that model s is correctly specified if sup where h l,(s) = N −1 l,(s) .If (21) does not hold, we say that model s is misspecified.For example, if model s omits some important relevant regressors, this usually leads to biased estimation results and the bias will not disappear even when n → ∞.By the above definition, both the true model s or an over specified model, whose regressors contains (x (s) , z (s) ) as a proper subset, are correctly specified models because an over specified model also leads to an asymptotically unbiased and consistent estimator of g(x (s) , z (s) ).Now define where is an estimator of that satisfies ∞ = O a.s.
(1), where • ∞ represents the maximum absolute row sum norm.Then C † n (w) and C * n (w) proposed in Section 2 can be viewed as a special case of C 0 n (w) under different choices of 's.Define w = argmin w∈W C 0 n (w).Then w nests w † and w * proposed in Section 2 as special cases.In the rest of this section, we will work on C 0 n (w) to study the property of w.We first present a few lemmas that provide some intermediate results.
Lemma 1.Under Assumptions 1(i), 3(i), and 4, we have: The proof of Lemma 1 is provided in supplementary appendix C.
We have the following results regarding A n and b n .
Lemma 2. Under Assumptions 1(i), 3(i), and 4, we have The proof of Lemma 2 is provided in supplementary appendix C.
With Lemmas 1 and 2, we are ready to investigate the asymptotic properties of w.To this end, without loss of generality, we assume that the first s − 1 models are misspecified and models s, . . ., S are correctly specified.Let ) and Denote η min (A) as the smallest eigenvalue of matrix A. We need the following assumption.
Assumption 5 places some restrictions on the correctly specified models and on the misspecified models and basically requires that the difference between the misspecified models and the Sth model (a correctly specified model) is nonignorable, asymptotically.A similar condition has been imposed in Yu et al. (2021).We have the following result.
Lemma 3. If Assumptions 1(i), 3(i), 4, and 5 are satisfied, then, The proof of Lemma 3 is provided in supplementary appendix C. Lemma 3 shows that the weights corresponding to the misspecified models converge to zero at a rate faster than the standard nonparametric estimation rate (Ma, Racine, and Yang 2015 showed that the standard rate is ).This property plays a pivotal role in the derivation of the asymptotic distribution of g( w, x, z).Now, we introduce some additional notation.For each fixed x (s) and z (s) , denote Moreover, denote ρ (s,t) i,j and p x (s) ,z (s) ,i as the (i, j)th entry of P (s) P (t) and ith entry of p x (s) ,z (s) , respectively.We need the following assumption.Assumption 6.For every s ≤ s, t ≤ S.
(i) There is a constant c 1 such that |ρ (ii) For every x (s) and z (s) , there is a constant c 1 such that Assumption 6 regulates the behaviour of the spline basis and excludes situations of extremely unbalanced designs, where a single observation remains relevant asymptotically (Hansen and Racine 2012).Similar assumptions have also been employed and discussed in the literature (e.g., in Andrews 1991).
Assumption 7.For each 1 ≤ s ≤ S, where c 0 is a positive constant.
Assumption 7 imposes some mild requirements on the probability density functions of i and Z i,(s) .
Lemma 4. Suppose that for some fixed integer surely for some positive constant c 1 .Furthermore, assume that m 1,n → m 0,1 , γ 2,n → m 0,2 , where m 0,1 and m 0,2 are S * × 1 constant vectors and n → 0 for some positive definite constant matrix 0 ∈ R S * ×S * .Under Assumptions 3(i) and 4-7, we have The proof of Lemma 4 is provided in supplementary appendix C. The proof of the lemma relates closely to a new central limit theorem for stochastic linear-quadratic forms established in supplementary appendix D.
Theorem 4.Under the assumptions that guarantee Lemma 4, we have, for each given x and z, as n → ∞, where ν * s,(s) (γ 0 ) is the sth entry of ν * (s) (γ 0 ) and γ 0,s is the sth entry of γ 0 defined in Lemma 4.
The proof of Theorem 4 is provided in supplementary appendix C. The theoretical framework adopted in our proof is different from those considered in Zhang and Liu (2019) or Yu et al. (2021).In those studies, the parameter dimension is fixed, whereas in the current study, the regression coefficient vector dimension K n → ∞, as n → ∞.In Theorem 4, the limiting distribution obtained for the model averaging estimator is nonstandard.The reason for this is because, when there are multiple correct models, the model averaging procedure tends to impose additional uncertainty on the resulting estimator.
Furthermore, consider the important situation where we only have a unique correctly specified model (i.e., the model s is based on (X (s) , Z (s) )).Then in this scenario we have the following standard asymptotic normal distribution theory for the model averaging estimator.
Corollary 1. Suppose that Assumptions 1(i), 3(i), 4, and 5 are satisfied.If we further assume that then, for each given x and z, the where z 1−α/2 is the (1−α/2)×100% quantile of standard normal distribution, v ( s) is a consistent estimator of v (s) under model s and s = argmax 1≤s≤S w s .
The proof of Corollary 1 is given in supplementary appendix E.

Monte Carlo Assessment of Finite-Sample Performance
We consider two Monte Carlo simulation experiments designed to assess the finite-sample performance of the proposed methods relative to a representative set of competing methods that include equally weighted model averaging and a set of popular model selection techniques along with that based on the largest model (i.e., the model with the largest number of predictors).

Case (I)
Case (I) is a setting in which the candidate models are underspecified.For Case (I), data is simulated from the following data generating process (DGP) where x 1 , x 2 , and x 4 are continuously distributed as U[−1, 1] while x 3 has discrete support generated as a binomially distributed random variate drawn from three independent Bernoulli trials with probability of success π = 1/2.The DGP is given by y , where is iid and the variance of is set so that the signal/noise ratio is such that the expected R 2 for a correctly specified model would be (0.95, 0.80, 0.50, 0.20) which corresponds to σ = (0.25, 0.50, 1.00, 2.00) times the standard deviation of the systematic component of the DGP.Results are summarized in Table 1 (model average weights are summarized in Table 3 in the supplementary appendix F).
For Case (I) we estimate the following six models: (a) note that the model y i = g 3 (x 3i ) + i is not included since x 3 is discrete and we need at least one continuous predictor in the kernel-weighted spline specification).For each of these six models, we use cross-validation to select the degree of the tensor spline and smoothing parameter for the discrete predictor, when present, then estimate the model by nonparametric series methods using kernel-weighted B-splines as outlined above.For the proposed approach we average these six estimators by assigning weights w 1 , w 2 , w 3 , w 4 , w 5 , and w 6 to these estimators using the Mallows criterion outlined above.In particular, we choose weights to minimize the objective function in Equation ( 17) in which the unknown is estimated based on residuals from the largest model.

Case (II)
Case (II) is a setting in which the candidate models contain the true model which also coincides with the largest model.The DGP is given by y = x 1 +x 2 +x 1 x 2 +x 2 1 +x 2 2 +x 3 +x 1 x 3 +x 2 x 3 + which is identical to Case (I) except that x 4 no longer appears, and the residual variance is set per Case (I).We use the same set of models as per Case (I), and for each of the six models, we use the delete-one cross-validation method to select the spline degree and kernel bandwidth, when present, then we average over these estimates and select weights to minimize the Mallows objective function as per Case (I).Results are summarized in Table 2 (model average weights are summarized in Table 4 in the supplementary appendix F).

Discussion
From Table 1 we observe that our proposed model averaging approach has the smallest estimation MSE for all cases considered, with the minor exception of equally weighted model averaging (EMA) when both the sample size and signal-to-noise ratios are small (i.e., n < 400 and σ > 0.5).In general, the gain of our model averaging method is more substantial for smaller sample sizes with smaller signal/noise ratios.
Furthermore, Table 2 shows that, when the true model is included in the set of candidate models, the use of model averaging can outperform model selection in small sample settings as Case (II) reveals.

An Empirical Illustration
We consider a panel of annual observations for six U.S. airlines for the 15 year period 1970 to 1984 taken from the Ecdat R package (Croissant and Graves 2020) which consists of 90 observations in total.The variables in the panel include airline (airline), year (year), the logarithm of total cost in $1000 (lcost), the logarithm of an output index in revenue passenger miles (loutput), the logarithm of the price of fuel (lpf ), and load factor, that is, the average capacity utilization of the fleet (lf ).This data has been used to model a historical cost function for the U.S. airline industry, and we direct the interested reader to Greene (2003), Table F7.1 for further details.
The six candidate models that we use in the analysis below are as follows (we provide the R formulas fed to each candidate model where the variable to the left of the tilde is the dependent variable and the variables on the right are a list of the predictor variables used in the nonparametric candidate spline-based procedure described above): We treat airline as an unordered factor (i.e., z (s) ), thus, candidate models that include the unordered factor airline are essentially nonparametric fixed effects models where the fixed effects are modeled via an unordered categorical kernel.The remaining predictors are treated as continuous (i.e., x (s) ).
We are interested in the predictive performance of model averaging and model selection based on the above candidate models.We therefore fit each of the candidate models on data spanning the years 1970 through 1982 and evaluate predictive performance on years 1983 and 1984 (there were 78 estimation observations and 12 evaluation observations).The evaluation data was not used to fit the models hence serves as a useful yardstick by which to measure predictive performance.
The model average weights accorded to each of the six models outlined above by the proposed nonparametric Mallows model average (MMA) method are 0.002395, 0.0006264, 0.1259, 0.3092, 0.5149, and 0.04684, respectively.The models selected via AIC, BIC, CV, and C p are models 5, 4, 4, and 5, respectively.
Predicted square error was computed as n −1 2 i (y i − y i ) 2 where n 2 is the size of the evaluation dataset, y i the lcost values from the evaluation dataset, and y i either the predicted values for the evaluation dataset generated from one candidate model fitted on the estimation data using the predictors from the evaluation dataset or the predicted values from a model averaging procedure constructed in the same manner.
The relative predictive square error for the equal weighted model average (EMA) approach is 13.32, while for the AIC, BIC, CV, and C p model selection procedures they are 1.976, 4.164, 4.164, and 1.976, respectively, while that for the largest model (L) is 10.27 (numbers > 1 indicate poorer predictive square error performance relative to the proposed MMA approach).By way of comparison, the relative predictive square error values for each of the candidate models are 89.16, 555.6, 11.29, 4.164, 1.976 and 10.27 (numbers > 1 indicate poorer performance relative to the proposed approach).Though the reader may feel that it makes little sense to, say, run nonparametric analysis on models with a large number of covariates based on such a small sample size, it is evident that model averaging is able to outperform each of the candidate models and its peers, which is of course the point we wish to make.
Based on these results, it is evident that the proposed method is capable of outperforming model selection methods which is consistent with our Monte Carlo simulation results.

Summary and Possible Extensions
We propose a novel model averaging approach where the candidate models are based on a recently proposed kernel-weighted spline regression method that admits both continuous and categorical predictors (Ma, Racine, and Yang 2015).We provide theoretical underpinnings for optimal model average weights in this setting, develop the corresponding post-averaging inference theory, assess finite-sample performance, and present an empirical illustration.The proposed model average approach is capable of outperforming a range of popular model selection strategies, and may therefore be of interest to practitioners who wish to confront model uncertainty in applied settings.An R package that implements the proposed approach is available for practitioners (Racine 2017).
In this article we only consider the case where both X i and Z i are finite dimensional vectors.One possible extension is to allow for X i (Z i ) to approach infinite dimensional vectors whose dimension grows slowly as n increases.Letting S be the number of models we average over, then S also diverges as n → ∞.We conjecture that the main results of the article remain valid when allowing for S to diverge as n increases, provided that S diverges sufficiently slowly.A rigorous verification of this conjecture, however, lies beyond the scope of this article.In Theorem 4, we establish the inference theory of the proposed model averaging estimator, where the asymptotic distribution is nonstandard.This nonstandard distribution also relies on the unknown index s.Therefore, implementation of the confidence interval based on Theorem 4 demands further efforts.In Corollary 1 we describe a special case of Theorem 4 that admits an easily implementable computation of the confidence interval.However, this result requires that there is only one correctly specified model, ruling out the case where both the true model and over-specified models are nested within the set of candidate models.In a recent study in Yu et al. (2021), focusing on generalized linear models, a simulation-based algorithm for confidence interval estimators was proposed and its asymptotic properties were also investigated.However, due to the complexity of kernel-weighted spline regression, how to extend the result of Yu et al. (2021) into our context still remains terra incognita and we identify this as a promising future research direction.
Model 1: lcost ˜loutput Model 2: lcost ˜year + lpf Model 3: lcost ˜loutput + lpf Model 4: lcost ˜lf + lpf + loutput + airline Model 5: lcost ˜year + lf + lpf + airline Model 6: lcost ˜year + loutput + lpf + lf + airline is the conditional expectation of the dependent variable, where g(•, •) is an unknown smooth function.Letting Z il denote the lth component of Z i , we assume that Z il takes c l different values in D l = {0, 1, . . ., c l −1}, where c l is a finite positive integer.Assume that each X il is distributed on a compact interval [a l , b l ], and without loss of generality, we take all intervals [a l , b l

Table 1 .
Relative MSE, Case (I); numbers > 1 indicate inferior MSE performance relative to the proposed model averaging approach.

Table 2 .
Relative MSE, Case (II); numbers > 1 indicate inferior MSE performance relative to the proposed model averaging approach.