Nonparametric instrument model averaging

We present a new nonparametric model averaging approach to the instrumental variable (IV) regression where the effects of multiple instruments on the endogenous variable are modelled as nonparametric functions in the reduced form equations. Even if individual IVs may have weak and nonlinear relevance to the exposure, our proposed model averaging is able to ensemble their effects with optimal weights to produce valid inference. Our analysis covers both the case in which the number of IV is fixed and the case in which the dimension of IV is diverging with sample size. This novel framework can be especially beneficial to the practical situations involving weak IVs since in many recent observational studies we may encounter a large number of instruments and their quality could range from poor to strong. Numerical studies are carried out and comparisons are made between our proposed method and a wide range of existing alternative methods.


Introduction
Theory and methods for estimating causal effects in presence of unmeasured confounders from observational studies are abundant and still growing these days (cf.Austin 2011;McCaffrey et al. 2013;Imai, Tingley, and Yamamoto 2013;Austin and Stuart 2015, among others).An attractive solution to the endogeneity issue is to adopt the so-called instrumental variable (IV), for which a two-stage least squares (2SLS) estimation procedure is usually employed.The inference of 2SLS crucially depends on the quality of the available instruments.In many modern datasets we may collect a large number of potentially useful IVs but their quality could vary in a wide range.How to effectively conduct a satisfactory inference with many weak IVs is an important research question in statistics.We propose a nonparametric model averaging approach to contribute towards this topic.
A qualified IV should satisfy the relevance condition, with a moderate to high degree of correlation with the endogenous variables (Greenland 2000).However, when facing large number of IVs this assumption might be violated for some instruments.We stress that the use of strong instruments is essential for a valid inference using 2SLS (Bound, Jaeger, and Baker 1995;Kline 1998).Weak IVs are known to undermine the causal inferences with misleading results (Nelson and Startz 1990).See Donald and Newey (2001), Hall and Peixe (2003), Hall, Inoue, Jana, and Shin (2007) and Hansen, Hausman, and Newey (2008) for some related developments.All these works only examine the linear association between IV and the endogenous variable.In this work we will relax the linearity assumption and consider a more general functional dependence in the reduced form equation.In fact, a relevant IV with strong nonlinear dependency on the endogenous variable may appear to be weak when mistakenly used in a linear model.Our methodology thus offers greater flexibility to the IV regression.
Strengthening instruments often leads to a decrease in sample data (Small and Rosenbaum 2008;Baiocchi, Small, Lorch, and Rosenbaum 2010;Zubizarreta, Small, Goyal, Lorch, and Rosenbaum 2013;Keele and Morgan 2016).In comparison, model averaging does not remove data points directly, and thus can incorporate weak IVs more efficiently.Compared to model selection, model averaging is less sensitive to model mis-specification and structural assumptions.It has the advantage of avoiding the need to defend the choice of a single 'best' model.A key step for model averaging is to estimate the optimal weights assigned to sub-models (Sloughter, Gneiting, and Raftery 2010;Koop and Korobilis 2012;Wang, Ma, Wei, and Wu 2016).Some authors also considered model averaging for the traditional 2SLS setting.For example, based on the linear instrumental variable regression framework, Martins and Gabriel (2014) proposed to obtain the model averaging weights via a direct smoothing of information criteria attained in the two stage least squares estimation procedure.Other related approaches include (Clyde, Ghosh, and Littman 2011;Koop, Leon-Gonzalez, and Strachan 2012;Yu, Wang, Chen, and Wei 2014;Zeugner and Feldkircher 2015;Burgess, Zuber, Gkatzionis, and Foley 2018), among many others.Nonparametric model averaging only became popular in recent years (Li, Linton, and Lu 2015;Li, Xia, Wong, and Nott 2018;Huang and Li 2018;Fang, Li, and Xia 2022;Li, Lv, Wan, and Liao 2022).But from our review there has been no work of using nonparametric model averaging for IV regression yet.
Our proposed nonparametric model averaging method can accommodate highdimensional IVs.Recent application of genetic biomarkers, typically single nucleotide polymorphisms (SNPs), as instruments in Mendelian randomisation studies (Didelez and Sheehan 2007;Wehby, Ohsfeldt, and Murray 2008;Lawlor, Harbord, Sterne, Timpson, and Davey Smith 2008;Lin, Feng, and Li 2015) has boosted research works on IVs with very high dimensions.Belloni, Chernozhukov, and Hansen (2014) offered an overview on how model selection methods can be adapted to provide better inference on causal parameters.Belloni, Chen, Chernozhukov, and Hansen (2012) proposed the post-Lasso estimator to alleviate the shrinkage bias.The Adaptive Lasso method was also adopted to select instruments in the first stage of 2SLS (Caner and Fan 2010).Fan and Liao (2014) considered the endogeneity issue with diverging number of covariates.Lin et al. (2015) proposed regularisation methods for high-dimensional IVs and applied their approaches in genomics.Guo, Kang, Tony Cai, and Small (2018) showed that Durbin-Wu-Hausman (DWH) test maintains the correct size for high-dimensional covariates and further proposed a specification test to improve the power.Seng and Li (2022) recently considered model averaging for structural equation models and offered some empirical findings under high-dimensional IV setting.However, none of these authors considered the nonparametric functional dependence for IVs.With diverging number of IVs, we will explicitly allow the number of nonparametric IV models to diverge with sample size.A regularised step is adopted to determine the model weights and a few familiar penalty functions are recommended for this purpose.Our procedure is relatively easy to implement and also enjoys solid theoretical support for analysis and inference.

Methods
We suppose that all variables are centralised and therefore leave intercept terms out in this paper.The true model for explaining the causal effects of covariate X on the response variable Y is where X is an endogenous variable correlated with the model error .The relationship can be fully characterised if we assume However, in numerous structural econometric models or observational studies in biostatistics, the conditional expectation function is not the parameter of interest.The structural parameter or causal parameter β ∈ R spells a relation between Y and X, where X is endogenous and so E( | X) = 0.The coefficient β is of interest and it is a major measure of the causal effect of X on Y in presence of unmeasured confounding effects from .We do not involve additional exogenous variables to simplify the notation and presentation.Those covariates can be handled easily by adding a projection step in practice.
When the unobserved error term is correlated with the endogenous variable X, the estimation of β using the ordinary least squares (OLS) method by fitting a linear regression on X might lead to biased estimates.The 2-stage least squares (2SLS) method can be used to obtain a consistent estimator of β.The standard IV methodology just applies traditional linear regression on X.In this paper we consider nonparametric regression on the endogenous variable X.Thus suppose there are d instrumental variables Z 1 , Z 2 , . . ., Z d , and we consider a fully nonparametric reduced form equation for X: where m : , and e is the random error term.
The following model conditions are assumed to hold in this paper.
(C1) Z and X are dependent, i.e. m is not a constant function.
The joint conditions of (C1) and (C2) specify the relationship between Z and the variables in model (1).In fact, Z is a relevant predictor of Y and its effect is through X only when the conditions are met.It implies the exogeneity of the instruments relative to the structural model (1).These are the standard requirements for 2SLS results to be valid.Condition (C3) implies the exogeneity of Z holds and standard nonparametric regression can be applied to fit the first stage model.
We stress that we should try our best to check these conditions in a real world study.Specifically, to verify (C1), sometimes we can use some standard statistics such as Pearson correlation for a marginal association, F-statistics, or 'partial R 2 ' of the IVs in the first stage as useful indicators of the quality of the IVs.A common test used to check (C2) is the well-known J-test of overidentifying restrictions.There are also plenty of recent developments on this issue in the literature.For example, see Kang, Zhang, Cai, and Small (2016).(C3) is usually assumed so that IVs are not related to unmeasured variables that affect the exposure.It is also part of the validity for IV and can be checked similarly as (C2).However, unlike (C1), the validity assumption often cannot be completely tested and we have to carry out sensitivity analysis case-by-case.

Nonparametric instrument model: single IV
In this part we consider the simple case where d = 1.We give results for single IV case first and will proceed to the case with multiple IVs in the next subsection.Suppose that we are given the independent data of sample size n for the response variable Let Z be the support of the instrumental variable Z and assume that Z is a compact set in R. We use a J n -dimensional equispaced B-spline basis on Z, denoted by B(z) = (B 1 (z), . . ., B J n (z)) , to approximate the function m(z), that is m(z) ≈ B(z) θ , where θ = (θ 1 , . . ., θ J n ) is a vector of coefficients of B(z).See Schumaker (2007) for the definition and properties of B-spline basis.
To establish inference results, we need to assume the following conditions.
(C4) Let s > 0 be an integer, α ∈ (0, 1] be a constant such that r = s + α > 2 and L > 0 be a positive constant.We use H(r, L) to denote the set of functions on Z such that for every h ∈ H(r, L), the sth derivative of h, denoted by h (s) , exists and satisfies the following Hölder condition of order α: We assume that Z is a random variable in Z with density function bounded away from 0 and infinity, and m ∈ H(r, L). (C5) We take J n = cn 1/(2r+1) for some positive constant c, where t is the largest integer no greater than t.
In matrix notation, denote m = (m(Z 1 ), . . ., m(Z n )) , B = (B(Z 1 ), . . ., B(Z n )) .From Equation (2), we have the linear model X = m + e ≈ Bθ + e where e = (e 1 , e 2 , . . ., e n ) .We then obtain the predicted value m = B θ in the first stage of 2SLS where the OLS estimator θ = (B B) −1 B X and m is used to substitute X in the second stage when fitting a linear model (1).After a standard derivation, the 2SLS estimator for β is given by Now we are ready to state the consistency and asymptotic normality of β under the preceding conditions.We denote the normal distribution with mean η and covariance by N(η, ), and by ' P → we mean convergence in probability, by ' D → we mean convergence in distribution.Further assume conditions (C6) to (C8) below.
Assumptions (C6)-(C8) are standard regularity assumptions to ensure the asymptotic results in Theorem 2.1.Given that m ∈ H(r, L) and Z 1 , . . ., Z n are independent, (C6) can be shown to be true via standard concentration arguments; see for example Lemma A.2 of Huang, Wu, and Zhou (2004).( C7) and (C8) are true when the noises 1 , . . ., n are independently normally distributed.
We summarise our results below: Theorem 2.1: Under conditions (C1) to (C8), as n → ∞, we have In practice the asymptotic variance n −1 u −1 σ 2 can be estimated by a plug-in method and computed by ( m m) −2 m ˆ ˆ m, where ˆ = Y − X β.

Nonparametric instrument model averaging: Low-dimensional IV
Now we consider the case with d > 1 and propose the nonparametric instrument model averaging (NIMA) method.As the function m in Equation ( 2) is hard to estimate directly, we adopt model averaging method.The proposed approach here is to adopt least squares model averaging (Hansen 2007) to compute the weighted average of the predicted value of X from a set of submodels in the first stage, and then use the predicted X to estimate β in the second stage.
At the first stage, consider d distinct submodels to predict X with each Z k , k = 1, . . ., d.Specifically, for the k th submodel, we regress X on Z k in a fully unspecified functional form: where In what follows when we say that condition (C4) holds, it should refer to that each m k , k = 1, . . ., d should satisfy the Hölder condition in (C4).We may then adopt similar estimation methods as in the preceding subsection.In the k th submodel, we still use a B-spline basis to approximate the function are the B-spline basis functions, and θ k = (θ k1 , . . ., θ kJ n ) is a vector of coefficients of the basis.
We denote Z ki to be the ith observation of the kth IV, Then for the k th submodel, we have m k ≈ B k θ k .Applying the OLS leads to the sample estimator Next we compute the model-based predicted value as mk = B k θ k for each submodel.Substituting θ k , we have from which we can see the prediction for the ith subject is To construct a model average estimator of X, we seek weights to minimise where ω = (ω 1 , . . ., ω d ) are the weights corresponding to the d models.Define the optimal weight to be ω It can be easily shown that The optimal weights ω * can be estimated by minimising the empirical least square function Denote M = (m 1 , . . ., m d ), M = ( m1 , . . ., md ), ω = ( ω1 , . . ., ωd ) .Then the closed form solution for the estimated weights is To establish the consistency of the estimated weights, we need to modify condition (C6) given in the preceding section and impose an additional condition.
(C6') With probability tending to 1, n −1 m k m l →u kl for all k, l = 1, . . ., d, and U = (u k,l ) 1≤k,l≤d is positive definite such that there exists an infinitesimal where λ 1 (U) and λ d (U) are the smallest eigenvalue and the largest eigenvalue of U, respectively.
(C6') is a multivariate version of (C6), with an additional bounded constraint for U to ensure the estimability of U. In particular, the infinitesimal ρ n in the condition indirectly allows the true m k (•) functions to converge to zero at a slow rate, corresponding to the case with weak instruments.The order of ρ n is chosen to ensure that the smallest eigenvalues of U and U −1 has a higher order than the estimation error.By the weak law of large numbers, (C9) holds when m k ∈ H(r, L) and (Z 1 , X 1 ), . . ., (Z n , X n ) are independent.
Theorem 2.2: Under conditions (C1) to (C5), (C6') and (C9), as n → ∞, we have The proof is given in the supplementary material.In the literature of model averaging, justification of weight optimality is very necessary.Following Theorem 2.2, our weight estimates are consistent to the true yet unattainable optimal weights.
Using the estimated weight ω, the model averaging prediction for X is We then proceed to the second stage and regress Y on X to obtain the final NIMA estimate β given by To establish the consistency and asymptotic normality of the model average estimator, we need to assume the following conditions modified from the preceding subsection.
Theorem 2.3: Under conditions (C1) to (C5), (C6') to (C8') and (C9), as n → ∞, we have Theorem 2.3 reduces to Theorem 2.1 when d = 1.Chamberlain (1987) and Chen, Chen, and Lewis (2020) found combinations of instruments to improve the estimation efficiency of β.Our paper focuses more on finding the 'best' linear combinations of models (for predicting X) based on different functional forms of IVs.Notice that the true model weights are given as ω * = U −1 v. Consequently, from Theorem 2.3 we have the asymptotic variance of With some simple calculation, it can be shown that when an additional IV is served in 2SLS, the asymptotic variance of β becomes smaller, leading to a more efficient estimator.In the next subsection, we address the case where the number of IVs d is growing to infinity.To deal with the high dimensionality, a penalised approach is introduced to obtain a regularised weight estimator at the model averaging step.
In practice the asymptotic variance of β can be estimated again by a plug-in method and computed by ( X X) −2 X ˆ ˆ X, or more specifically,

Nonparametric instrument model averaging: high-dimensional IV
In this subsection we extend the proposed NIMA method to the case where d → ∞.It is known that the performance of the 2SLS method deteriorates drastically or becomes inapplicable as the dimension of instruments increases.We thus adopt a regularisation step to cope with the high dimensionality at the model averaging stage in the NIMA procedure.We use the same spline approximation method in the preceding section to obtain the d nonparametric models { mk : k = 1, . . ., d}.To obtain the weights at the model averaging stage, a regularised weight estimator ωd is given by where p λ (•) is a sparsity-inducing penalty function to be specified below, and λ > 0 is a tuning parameter that controls the strength of the regularisation.We consider the following three common choices of the penalty function p λ (t) for t ≥ 0: (a) the L 1 penalty or Lasso (Tibshirani 1996), p λ (t) = λt; (b) the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li 2001), and (c) the minimax concave penalty (MCP) (Zhang 2010), The SCAD and MCP penalties have an additional tuning parameter a to control the shape of the function.Based on the original literature, we choose a = 3.7 for SCAD penalty and a = 3 for MCP penalty.These penalty functions have been widely used in high-dimensional sparse modelling and their properties are well understood in ordinary regression models (e.g.Fan and Lv 2010).Moreover, the fact that these penalties belong to the class of quadratic spline functions on [0, ∞) allows for a closed-form solution to the corresponding penalised least squares problem in each coordinate, leading to very efficient implementation via coordinate descent (e.g.Mazumder, Friedman, and Hastie 2011).
In this paper, we assume that the optimal weights for combining the marginal regressions are sparse with respect to the marginal regressions, i.e. most elements in ω * are zero.Denote the index set for the nonzero weights as S and its complement as S c , i.e. ω * i = 0 iff i ∈ S c .We assume that the cardinality of S is |S| = p.We remark that the sparsity assumption with exact zeros here can be further relaxed to ).Under such an approximate sparse assumption, the convergence results provided in Theorem 2.4 will still be valid; see the technical discussions after the proof of Theorem 2.4 in the supplementary material.
For any two index sets S 1 , S 2 ∈ {1, . . ., d}, and a d × d matrix A, we shall use A S 1 ,S 2 to denote the corresponding submatrix with the rows indexed by S 1 and columns indexed by S 2 .To establish the consistency of ωd , we make the following regularity assumptions: (C5') We take J n = c( n log(dn) ) 1/(2r+1) for some positive constant c, where t is the largest integer no greater than t.(C6") Let U S,S be the sub-matrix of U = E(M M) with row and column indexed by S. We assume that U S,S is positive definite such that there exists an infinitesimal ρ d,n ( , where λ 1 (U S,S ) and λ d (U S,S ) are the smallest eigenvalue and the largest eigenvalue of U S,S , respectively.(C10) Let e k be defined as in (4).We assume that for all k = 1, . . ., d, and given Z k , we have E(e k ) = 0, Var(e k ) ≤ σ 2 for some σ 2 < ∞, and there exists a constant a > 0 such that E|e k | t ≤ σ 2 t!a t−2 /2 hold for all t ≥ 2. Similarly, we assume that there exist positive constants a and σ 2 such that given X, E| | t ≤ σ 2 t!a t−2 /2 hold for all t ≥ 2. (C11) There exists a constant κ ∈ (0, 1) such that Notice that different from the fixed dimensional case, we have to introduce an additional log(dn) term in the order of J n in (C5') to balance the estimation accuracy under the high dimensional settings.(C6") is an analog to (C6') imposed for the variables in the active set S. The moment condition (C10) is imposed to control the tail behaviour of the noises, and as a result, ensure the validity of classical concentration results (c.f.Bernstein's inequality in Lin and Bai 2011) which plays an important role in establishing uniform convergence under high dimensionality.( C11) is an irrepresentability condition for model selection consistency; see for example Theorem 2.1 in Zou (2006) for further discussions.Denote the ith element of ωd as ωd,i .The following theorem establishes the consistency of ωd under the sparse case.
In Theorem 2.4 we provide an explicit estimation error bound for the high dimensional case and that distinguishes from Theorem 2.2.We remark that the additional assumptions in Theorem 2.4 would hold under the fixed dimensional case, and hence the error bound derived in Theorem 2.4(ii) also holds for the fixed dimensional case.Theorem 2.4(i) indicates that the zero weights can be consistently identified, and Theorem 2.4(ii) yields an upper bound for the overall estimation error of the weights.Practically, the tuning parameter is usually chosen via the cross validation based on the squared loss.Without loss of generality, suppose we have selected p instrumental variables with nonzero weights: From Theorem 2.4 we have that with probability tending to 1, p → p.Following a similar idea of hybrid Lasso, we can further construct the least squared estimator using these selected instrumental variables, i.e.
which is obtained by replacing M by M p = ( md 1 , . . ., md p ) in ( 5).The model averaging prediction for X is then given by: We then proceed to the second stage and regress Y on Xp to obtain the final estimates βp given by βp = ( The following theorem provides an upper bound for the estimation error of βp : Theorem 2.5: Under the assumptions of Theorem 2.4, and assume that Let βp be defined as in (8).Given p, we have: Condition (9) ensures that the signal strength ω * 2 has a higher order than the estimation error in Theorem 2.4(ii).Interestingly, the estimation upper bound provided in (10) suggests that the estimation error of βp is proportional to ω * −1 2 .Intuitively, a larger ω * 2 could reflect the strengths of the combined instrumental variables, and hence lead to more accurate estimation of β.Assume that size of the active set p is a fixed constant.Similar to Theorem 2.3, given the selected instrumental variables, we can establish the consistency and asymptotic normality of the regularised model average estimator.Let v p and U p be defined as v and U in Conditions (C5) and (C6'), respectively, with the instrumental variables replaced by Z d 1 , . . ., Z d p .We have the following asymptotic results.
Corollary 2.1: Assume that the assumptions of Theorem 2.3 hold with the selected Using this corollary, we can estimate the asymptotic variance of βp with ( X p Xp ) −2 X p ˆ ˆ Xp , or more specifically,

Simulation
We next conduct simulation studies to examine the performance of the proposed nonparametric model averaging estimation method.In the following cases, Y is generated according to Equation (1) with β = 1, and X is generated according to different nonparametric models given below.We generate the error terms ( e ) ∼ N(( 0 0 ), ( 1 0.8 0.8 1 )).The IVs (Z 1 , . . ., Z d ) are generated from N(0, ), where is a compound symmetric matrix with variance 1 and covariance ρ.
Case 1. Linear model (d = 5): Case 2. Nonparametric additive model (d = 5): 1 e sin(50Z 1 ) + 0.06 e Z 2 cos(50Z 2 ) + 0.05Z 3 3 e Z 3 + 0.04(e −2Z 4 + e 2Z 4 ) + 0.08Z 3 5 e Z 5 + e Case 3. Nonparametric additive model with interaction terms (d ≥ 5): X = 0.08Z 3 1 e sin(50Z 1 ) + 0.06 e Z 2 cos(50Z 2 ) + 0.05Z 3 3 e Z 3 + 0.04(e −2Z 4 + e 2Z 4 ) + 0.08Z 3 5 e Z 5 + 0.06Z 3 5 cos(50Z 1 ) + 0.05 e 2Z 1 +sin(50Z 2 ) + e Case 4. Nonparametric additive model with interaction terms and high-dimensional covariates (d > 5): X = 0.08Z 3 1 e sin(50Z 1 ) + 0.06 e Z 2 cos(50Z 2 ) + 0.05Z 3 3 e Z 3 + 0.04(e −2Z 4 + e 2Z 4 ) + 0.08Z 3 5 e Z 5 + 0.06Z 3 5 cos(50Z 1 ) + 0.05 e 2Z 1 +sin(50Z 2 ) + In all these cases, the strengths of the IVs are designed to be weak to moderate, relative to the order of the error magnitude.Such a specification may be more plausible in a real high-dimensional IV studies.For example, using single nucleotide polymorphism (SNP) as IVs for bio-medical research, it is commonly observed that massive number of SNPs only have weak association with the exposures.Comparisons are made with various competing methods.In Cases 1 and 2, we also considered the effects of applying different splines in the first stage, and examine whether our procedure is sensitive to these functional basis choices.In Case 3, we included the linear model averaging method of Martins and Gabriel (2014) for comparison.We considered 2 different ways of model screening for their approach, namely 'trimming' and 'split-sample' screenings, to reduce the number of candidate models for averaging.In terms of the criteria used to select models as well as to compute the weights of the candidate models, the relevant moment selection criteria (RMSC), model selection criteria (MSC) and generalised R-squared (GR 2 ) as defined in Martins and Gabriel (2014) are applied.The AIC and BIC penalty terms are used to compute the RMSC and MSC respectively in the comparison.This method, however, is infeasible for comparison when d is enormous, due to the huge number of candidate issue, the naive method performs badly in most cases, with large bias and low coverage probability.
In cases 1 and 2, different spline bases perform quite similarly and also provide satisfactory estimation results for the proposed nonparametric model averaging.Thus for all the following cases we only report results from B-splines.The results for case 1 are included in supplementary file.In both cases, we can see that the parametric model averaging with linear models perform quite well with small bias and satisfactory coverage, especially in case 1 where the true model is linear.However, in case 2 (Table 1), we notice that the estimated causal parameter may have a much larger standard errors for the linear method, comparing to our proposed NIMA method.Even if we increase the sample size, the efficiency loss of using parametric model averaging is still nonnegligible.This traditional approach is thus not as favourable as our proposed nonparametric model averaging, when the true IV effects are nonlinear.
In case 3 we compare our NIMA method with more parametric MA methods in Table 2 when d = 5.In such a low-dimensional setting, we do not need the various regularisation methods and can include all the IVs in the 2 stage regression.The RMSC t , MSC t , GR 2 t , RMSC s , MSC s and GR 2 s methods all lead to very small estimation bias.Nonetheless, our method yields relatively smaller standard errors and the coverage rate of the 95% confidence interval is closer to the nominal level.
We then increase d to 50, 100 and 200 in case 3 and report the estimation results of NIMA with various penalty functions in Table 3.It seems with sample size fixed at n = 500, the estimation performance deteriorates as d increases.In all cases, MCP appears to be slightly better than the other two penalty methods, with a smaller bias and a slightly higher coverage.MCP also tends to selects smaller number of IVs than other methods.
We then examine the most difficult case 4 in Table 4 where large number of very weak IVs are present in the model.Our proposed nonparametric model averaging perform quite well, producing consistent estimates for the causal parameter.The parametric model averaging with linear model using only the first 5 IVs does not perform so well, with slightly larger bias and lower coverage rate.The model selection method using Lasso can also achieve consistency for the parameter estimates.However, the standard errors from model selection are always much greater than those from our proposed model averaging.An efficiency gain is again observed for this high-dimensional case.
We plotted the estimated weights for case 4 in Figure 1.In case 4, only the first 5 IVs are indeed associated with the exposure X with nondegenerating effects while all other IVs have close-to-zero effects.We randomly sample 5 IVs from {Z k : 5 < k ≤ d} and report their weights also in Figure 1.We can see that these estimated weights are distributed in a narrow neighbourhood of zero.The estimated weights from our NIMA procedure are indeed very sensible for this example.

Home price data analysis
To illustrate the proposed nonparametric model averaging method, we consider a real case study of the effect of federal appellate court decisions regarding eminent domain on home prices.The data has recently been analysed in Belloni et al. (2012) and Seng and Li (2022).
Table 1.Results for Case 2 (with ρ = 0, 0.3, 0.6, 0.9 and d = 5) using different splines.Since this data also contains many confounders, we estimate the structural model of the form Y = Xβ + X o β o + where Y denotes the log of Case-Shiller home price index, X denotes the number of pro-plaintiff appellate takings decisions, and X o include whether there was any cases in that circuit-year, number of takings appellate decisions, controls for the distribution of characteristics of federal circuit court judges in a given circuit-year, circuit-specific effects, time-specific effects, and circuit-specific time trends.An appellate court decision is pro-plaintiff if court ruled that a taking was unlawful and overturned the government?sseizure of the property in favour of the private owner.The number of pro-plaintiff decisions is thus a proxy indication of how protective a regime is of individual property rights.We are interested to estimate β which represents the effect of an additional decision upholding individual property rights on the home price.The analysis of the effect of takings law is complicated by the possible endogeneity between takings decisions and home price as suggested by Chen and Yeh (2012), Belloni et al. ( 2014), Belloni et al. (2012) and Seng and Li (2022).The instrumental variables strategy relies on the random assignment of judges to federal appellate panels which renders the exact identity of the judges and their demographics as potential instruments since they are also randomly assigned conditional on the distribution of characteristics of federal circuit court judges in a given circuit-year.We obtain a total of d = 147 potential instruments in this data analysis.The exogenous variables are included only in stage two of our analysis.The total sample size is n = 183.The data for Y, X, Z and X o are all standardised to have mean zero and standard deviation one.We now consider implementing our proposed methods and comparing with other models examined in the simulation studies.For the proposed nonparametric model averaging method, we use Lasso, SCAD and MCP penalties to regularise the high-dimensional model weights.Specifically, the following methods are compared.pLasso: 2SLS using the IVs selected by Lasso in stage one, implemented by R package glmnet; pEL a : 2SLS using the IVs selected by elastic net in stage one with elastic net mixing parameter a, implemented by R package glmnet; pAL: 2SLS using the IVs selected by adaptive Lasso in stage one, implemented by R package glmnet; pSCAD: 2SLS using the IVs selected by SCAD in stage one, implemented by R package ncvreg; pMCP: 2SLS using the IVs selected by MCP in stage one, implemented by R package ncvreg; SIS: 2SLS using the IVs selected by sure independent screening (SIS) method (Fan and Lv 2008) in stage one, implemented by R package SIS; npLasso: 2SLS using the IVs selected by Lasso with generalised additive model in stage one, implemented by R package mgcv; npSCAD: 2SLS using the IVs selected by SCAD with generalised additive model in stage one, implemented by R package mgcv; npMCP: 2SLS using the IVs selected by MCP with generalised additive model in stage one, implemented by R package mgcv; In addition to these approaches we have also tried implementing other R packages including sam, hgm, npreg which are all designed for high-dimensional variable selection.Unfortunately none of these packages can generate selection results for this data and therefore is not reported in this paper.
In Table 5, we report the estimation results for two different cases.In Case I, we estimate the effect of takings law without adjusting for the other confounding covariates.Sometimes such a marginal association study may be of interest to practitioners.The naive method produces positive estimate but is not significant.In contrast, our proposed model average method and all existing model selection methods give negative yet insignificant estimates.After adjusting for the confounding covariates in Case II, we notice that the estimated effect of takings law become positive for all methods.Our penalised nonparametric model averaging method produce relatively stronger effect estimates than most other methods.In particular, the estimator from NIMA with a Lasso penalty is significant at 0.05 level.These findings are also consistent with the earlier findings in Belloni et al. (2012), Belloni et al. (2014) and Seng and Li (2022) where parametric IV regression was conducted.In contrast, none of the other model selection methods examined in our analysis produce statistically significant results.
The estimated weights from our nonparametric model averaging and the corresponding nonparametric models for the IVs are plotted in Figures 1 and 2 of the supplementary file.These fitted models suggest the IV effects may be quite nonlinear in this case as the curves differ substantially from a straight line.Lasso assigns nonzero weights to 6 variables

Figure 1 .
Figure 1.The boxplot of partial weights corresponding to the IVs in Case 4.
k which are selected, 5 < k ≤ d under high dimension case.True value β = 1.

Table 4 .
Results for Case 4 (with ρ = 0, 0.3, 0.8 and n = 500).Notes: Bias, standard deviation (sd), standard error (se) and coverage probability (CP) of the β e estimates.Number selected (NS) is the number of instrumental variables Z k which are selected, 5 < k ≤ d under high dimension case.True value β = 1.