Bayesian Nonparametric Instrumental Variables Regression Based on Penalized Splines and Dirichlet Process Mixtures

We propose a Bayesian nonparametric instrumental variable approach under additive separability that allows us to correct for endogeneity bias in regression models where the covariate effects enter with unknown functional form. Bias correction relies on a simultaneous equations specification with flexible modeling of the joint error distribution implemented via a Dirichlet process mixture prior. Both the structural and instrumental variable equation are specified in terms of additive predictors comprising penalized splines for nonlinear effects of continuous covariates. Inference is fully Bayesian, employing efficient Markov chain Monte Carlo simulation techniques. The resulting posterior samples do not only provide us with point estimates, but allow us to construct simultaneous credible bands for the nonparametric effects, including data-driven smoothing parameter selection. In addition, improved robustness properties are achieved due to the flexible error distribution specification. Both these features are challenging in the classical framework, making the Bayesian one advantageous. In simulations, we investigate small sample properties and an investigation of the effect of class size on student performance in Israel provides an illustration of the proposed approach which is implemented in an R package bayesIV. Supplementary materials for this article are available online.


INTRODUCTION
One frequently encountered problem in regression analysis of observational data are endogenous regressors, that is, explanatory variables that are correlated with the unobservable error term. Sources of this correlation include omitted variables that are associated with both regressors and response (confounding), measurement error, reverse causality and sample selection. Neglecting the resulting endogeneity bias by using standard regression techniques can lead to severely misleading inference. Two-stage least squares (2SLS) and generalized methods of moments (GMM) estimators in combination with instrumental variables, that is, additional covariates that are uncorrelated to the error term but reasonably strongly associated to the endogenous covariate, are therefore routinely applied in the parametric regression context (see, e.g., Wooldridge 2002). These approaches do not necessarily impose distributional assumptions on the error term (for point estimation) but intrinsically rely on linearity of all effects, which is frequently not justified by subject-matter considerations. Thus, in recent years an increasing number of approaches to nonparametric instrumental variable regression has appeared; see Blundell and Powell (2003) for an excellent survey. Horowitz (2011b) made a strong case for nonparametric estimation in a discussion of implications for inference in misspecified parametric models. However, nonparametric instrumental variable regression methods are still rarely used in practice, partly due to a lack of easily available implementations and the need of user assistance for choosing specific tuning parameters. This article addresses these issues by providing a Bayesian framework which routinely allows for the automatic choice of tuning parameters and the construction of simultaneous credible bands for the quantification of the uncertainty of function estimates. Simultaneous credible bands are the Bayesian analogue to simultaneous confidence bands, which are important to assess the uncertainty of an entire curve estimate and study the relevance of an effect, for example. Pointwise confidence bands, which are almost exclusively used for this purpose, understate this uncertainty and can thus lead to erroneous identifications of nonlinear effects.
In general, there are two approaches to instrumental variables regression either based on a control function or based on regularization.
The control function approach (Newey, Powell, and Vella 1999;Pinkse 2000;Su and Ullah 2008) is directly related to the simultaneous equations literature. For simplicity, we consider a model with a single endogenous covariate for the remainder of the introduction, that is, y 2 = f 2 (y 1 ) + ε 2 , y 1 = f 1 (z 1 ) + ε 1 (1) with response y 2 , covariate y 1 , and instrumental variable z 1 with effects of unknown functional form f 2 and f 1 , respectively, and random errors ε 2 and ε 1 . Endogeneity bias arises if E(ε 2 |ε 1 ) = 0. Then, assuming the identification restrictions E(ε 1 |z 1 ) = 0 and E(ε 2 |ε 1 , z 1 ) = E(ε 2 |ε 1 ), yields E(y 2 |y 1 , z 1 ) = f 2 (y 1 ) + E(ε 2 |ε 1 , z 1 ) = f 2 (y 1 ) + E(ε 2 |ε 1 ) where v(ε 1 ) is a function of the unobserved error term in the first equation. This has motivated the following two-stage estimation scheme: In a first step, estimated residualsε 1 are determined from y 1 −f 1 (z 1 ) using any nonparametric estimation technique for estimating the nonlinear functionf 1 (z 1 ). In a second step, an additive model (Hastie and Tibshirani 1990) with response variable y 2 is estimated, where in addition to y 1 the estimated residualsε 1 are included as a further covariate. Disadvantages of this two-stage approach include the difficulty to incorporate the uncertainty introduced by estimating the parameters in the first step when constructing confidence bands in the second step. In particular, to the best of our knowledge, no approach for simultaneous confidence bands that accounts for this uncertainty has been proposed to date. In addition, automatic smoothing parameter selection for the control function v(ε 1 ) is difficult since common selection criteria like cross-validation or plugin estimators focus on minimizing the error in predicting the response variable y 1 while we are interested in achieving a precise estimate for v(ε 1 ) to yield full control for endogeneity. Finally, outliers and extreme observations in ε 1 may severely affect the endogeneity correction and, therefore, some sort of robustness correction (such as trimming of the residuals) might be necessary (Newey, Powell, and Vella 1999).
In the Bayesian framework, approaches based on representing the model as simultaneous equations are related to the control function approach (see also Kleibergen and Zivot 2003 and Lopes and Polson 2013 for reviews of Bayesian parametric methods). All of the corresponding approaches that allow for smooth covariate effects assume bivariate normality of the errors (ε 1 , ε 2 ) ∼ N(0, ) (e.g., Koop, Poirier, and Tobias 2005;Chib and Greenberg 2007;Chib, Greenberg, and Jeliazkov 2009 ). Then, both equations in (1) are estimated simultaneously in a Gibbs-sampling scheme, facilitating the estimation of smoothing parameters and credible bands. In this formulation, the control function is not explicitly estimated but is given implicitly by the conditional error distribution. However, bivariate nor-mality implies linearity of this conditional expectation since E(ε 2 |ε 1 ) = σ 12 σ 2 1 ε 1 , where σ 12 = cov(ε 1i , ε 2i ) and σ 2 1 = var(ε 1i ). As a consequence, the control function is implicitly restricted to be linear in ε 1 . This corresponds to the assumption that a hypothetical (unknown) omitted variable (inducing the endogeneity bias) has a linear effect on the response. This assumption seems to be rather restrictive, in particular when allowing for effects of unknown functional form for the observed explanatory variables. Another common source for nonnormality of the error terms are outliers and thus robustness issues of methods relying on bivariate normality are a serious concern. As a consequence, Conley et al. (2008) proposed the application of a Dirichlet process mixture (DPM) prior (Escobar and West 1995) to obtain a flexible error distribution, but they still relied on linear covariate effects.
A completely different strategy is to assume y 2 = f 2 (y 1 ) + ε 2 with E(ε 2 |z 1 ) = E(y 2 − f 2 (y 1 )|z 1 ) = 0, calling for regularization techniques (see Darolles et al. 2011). Here, an ill-posed inverse problem has to be solved which creates the need for an additional regularization parameter. Thereby, data-driven selection of the regularization parameter is an active area of research with recent progress given in Breunig andJohannes (2011), Fève andFlorens (2010), and Horowitz (2011a) among others. Again, construction of simultaneous confidence bands is difficult, with Horowitz and Lee (2012) being the first attempt. Recent publications on regularization methods in a (quasi-)Bayesian framework are given in Simoni (2012, 2013), Kato (2012), and Liao and Jiang (2011), where the former two also address data-driven choice of the regularization parameter. To the best of our knowledge, approaches to simultaneous credible bands have not been considered in this context. It is important to note that the regularized model and the model in (1) and (2) are nonnested and cannot be considered as substitutes for one another while neither of both is more general (Horowitz 2011b). Thus, we do not contribute to the (Bayesian) regularization literature but rather intend to provide advances in the control functions and Bayesian simultaneous equations literature.
We propose a Bayesian approach based on Markov chain Monte Carlo (MCMC) simulation techniques adapting the simultaneous equations framework and extending the approach by Conley et al. (2008) to include flexible covariate effects. To do so, we employ Bayesian P-splines (Lang and Brezger 2004) for the estimation of flexible covariate effects and a DPM prior for the estimation of a flexible joint error distribution. Thus, we neither make an assumption on the functional form of the effects (besides a smoothness condition) nor on the distribution of the error terms. Further, we will allow a more flexible choice of prior distributions than Conley et al. (2008). The Bayesian formulation will enable us to automatically estimate the smoothing parameters in both equations and to construct simultaneous credible bands that do not depend on distributional assumptions. Moreover, through the use of the DPM prior, outliers in the error terms will automatically be downweighted such that improved outlier robustness is provided.
The approach is used to analyze the effect of class size on scholastic achievements of students in Israel following Angrist and Lavy (1999). Here, an unknown nonlinear form of the class size effect seems plausible. Further, a clearly nonnormal bivariate error density warrants nonparametric estimation of the error density to ensure proper endogeneity bias correction and valid confidence bands.
The remainder of the article is organized as follows. In Section 2 the considered model is introduced and prior distributions are discussed. Section 3 describes Bayesian inference including smoothing parameter determination and construction of simultaneous credible bands. In Section 4, small sample properties are explored through simulations and the approach is compared to existing approaches. An application to class size effects on student performance is provided in Section 5 and the article concludes in Section 6.

ADDITIVE SIMULTANEOUS EQUATIONS MODEL
We consider the simultaneous equations model with additively separable error terms where y 2 denotes the outcome of primary interest affected by one continuous endogenous covariate y 1 and l 2 exogenous variables v 2 , = 1, . . . , l 2 . The same model structure applies to the endogenous covariate which is additionally related to exogenous variables v 1 , = 1, . . . , l 1 . m 1 (·) and m 2 (·) are unknown smooth functions and ε 1 and ε 2 are (possibly correlated error terms) with E(ε 1 ) = E(ε 2 ) = 0. Endogeneity bias in function m 2 (·) arises when E(ε 2 |ε 1 ) = 0. In the simultaneous equations model, identification relies on the instrumental variables v 11 , . . . , v 1l 1 (with the same identification restrictions as in the control function approach). While a bivariate normal distribution for the error terms (ε 1i , ε 2i ) is a convenient model that enables the inclusion of correlated errors to correct for the bias, it implies strong implicit assumptions on the conditional distribution of ε 2 given ε 1 (linearity of E(ε 2 |ε 1 )) as discussed in the introduction. We, therefore, follow Conley et al. (2008) and employ a Dirichlet process mixture prior (Escobar and West 1995) for the joint error distribution which basically allows to specify a hyperprior on the space of potential error distributions.
While the model above is very flexible, in practice the sample size needs to get impractically large due to the well known curseof-dimensionality when dimensions l 1 and l 2 are large as in most applied cases. Imposing an additive structure on m 1 (·) and m 2 (·) helps to circumvent this problem while retaining a still flexible structure and attaining the one-dimensional convergence rate. Thus we define decomposing v r , r = 1, 2, = 1, . . . , l r into q 1 and q 2 variables x r , r = 1, 2, = 1, . . . , q r with linear effects (typically categorical covariates in dummy or effect coding) and p 1 and p 2 continuous variables z r , r = 1, 2, = 1, . . . , p r with effects of unknown, nonlinear form represented by smooth functions f 11 (·), . . . , f 1,p 1 +p 2 (·) and f 21 (·), . . . , f 2,p 2 (·). To ensure identifiability of the additive model structure, all functions f r (·) are centered around zero. On the one hand, note that further restricting all f r (·) to be linear yields the semiparametric model discussed in Conley et al. (2008) as a special case. This is further explored in Section 4.1. On the other hand, note that bivariate smooth regression surfaces f r (z r 1 , z r 2 ) of two continuous covariate z r 1 and z r 2 or varying coefficients of the form x r 1 f (z r 2 ) with interaction variable x r 1 and effect modifier z r 2 can be easily incorporated. In both cases, x r 1 and z r 2 may also be replaced by y 1 . Finally, note that models with nonseparable error terms have also been considered in the literature; see Matzkin (2008) and references therein. These models relax the additivity assumption of the error term but impose different restrictions and are not necessarily more general than the model we are considering.

Joint Error Distribution
The standard approach in the Bayesian nonparametric simultaneous equations literature is to assume bivariate normal errors (ε 1i , ε 2i ) ∼ N(0, ), i = 1, . . . , n with constant covariance matrix which is assumed to be a priori inverse-Wishart distributed ∼ IW(s , S ) where IW denotes the inverted-Wishart distribution parameterized such that E( ) = S −1 /(s − 3). An obvious first relaxation is to use a finite mixture of K * * Gaussian components with mixing proportions π 1 , . . . , π K * * and component-specific (nonconstant) means and covariances μ c , c , c = 1, . . . , K * * : (ε 1i , ε 2i )|π 1 , μ 1 , 1 , . . . , π K * * , μ K * * , Though being already quite flexible, this model introduces the problem of selecting the number of mixture components K * * . As a remedy, we consider a Gaussian Dirichlet process mixture (Escobar and West 1995) which can be interpreted as the limiting case of a finite mixture model as K * * → ∞ (Neal 2000). More specifically, we assume an infinite mixture model with the following hierarchy: In this specification, the mixture components are assumed to be iid draws from the base measure G 0 (given by a normal-inverse Wishart distribution) of the Dirichlet process (DP) while the mixture weights are generated in a stick-breaking manner based on a Beta distribution depending on the concentration parameter α > 0 of the Dirichlet process. The concentration parameter α determines the strength of belief in the base distribution G 0 , which is the expectation of the Dirichlet process around which more mass will be concentrated for large α.
To emphasize the capability of the prior to model means and covariances varying with observations, we can also express the implied hierarchy by (ε 1i (Sethuraman 1994), where δ θ is a unit point mass at θ . Although we are dealing with an infinite mixture, there can be at most n components affiliated with data and, therefore, most components will in fact be empty and only determined by the prior. More precisely, in a specific dataset errors will be clustered together into K * ≤ n clusters with means μ l = (μ 1l , μ 2l ) t and covariances This can be nicely seen by considering the so-called polyaurn scheme (Blackwell and MacQueen 1973). Let θ 1 = (μ 1 , 1 ), θ 2 = (μ 2 , 2 ), . . . be an (infinite) sequence of iid draws from G. Then, the predictive distribution of a new θ k+1 conditional on the previous values θ 1 , . . . , θ k marginalizing out G is given by with δ θ i denoting a unit point mass at θ i . That is, θ k+1 equals to any of the k previous θ 1 , . . . , θ k with probability 1 α+k and is drawn from the base distribution G 0 with probability α α+k . Moreover, Equation (7) can also be reexpressed in terms of the distribution of the distinct values known as a so-called Chinese restaurant process. By doing so, it can be shown that a new θ k+1 equals to some θ l with probability n l α+k with n l the number of values already corresponding to θ l , that is, the probability is proportional to the cluster size. Besides the clustering property of the Dirichlet process, these probability expressions also demonstrate the important role of the concentration parameter α: The expected number of components for a given sample size n is approximatively given by E(K * |α, n) ≈ α log(1 + n/α) (Antoniak 1974). Thus, the concentration parameter α is directly related to the number K * of unique pairs (μ l , l ) in the data. To avoid fixing K * we, therefore, estimate α from the data and consequently have to assign a prior on it. The standard conjugate prior for α is a Gamma prior α ∼ Ga(a α , b α ). Alternatively, a discrete prior on K * as in Conley et al. (2008) can be used (which is equally supported by our software). See Conley et al. (2008) for details.
Since our model includes constants γ 10 and γ 20 , we have to ensure that E(ε 1i , ε 2i ) = 0 for identifiability. Though centered Dirichlet process mixtures could generally be applied for this purpose, we opt to achieve this by choosing μ 0 = (0, 0) t and constraining n i=1 μ 1i = n i=1 μ 2i = 0. This simple solution allows us to use efficient algorithms for estimation. Note that from an a priori zero mean μ 0 = (0, 0) t alone, it does not follow that G has a posterior zero mean. Also note that Conley et al. (2008) do not center the error distribution such that the global intercepts are not identified.
With respect to priors on the parameters in the base distribution G 0 , Conley et al. (2008) proposed to choose parameters μ 0 , τ , s and S as fixed to reduce the computational burden. They argue that by standardizing y 1 and y 2 , zero means μ 0 = (0, 0), a diagonal S as well as parameters s and τ chosen such that components of c and μ c may take even extreme values given the data were standardized beforehand, introduce negligible prior information. However, as Escobar and West (1995) emphasized, the prior variance τ −1 (which is closely linked to the bandwidth in kernel density estimation in case of a constant ) has a strong impact on the degree of smoothness of the density. For a given number of distinct mixture components in the data (K * ), a small value of τ allows the means (μ 1l , μ 2l ), l = 1, . . . , K * to vary more strongly resulting in a greater chance of multimodality in the error term distribution for fixed l . Also, τ may have an effect on the down-weighting of outliers in the conditional mean E(ε 2i |ε 1i ) and thus on the influence of outliers on endogeneity bias correction as we will see in Section 3.2. To express uncertainty about τ , Escobar and West (1995), therefore, proposed to choose a conjugate prior τ ∼ Ga(a /2, b /2). Finally, the choice of an inverse Wishart prior on S , S ∼ IW(s S , S S ), might be desirable. Our method allows to flexibly choose between fixed and uncertain hyperparameters.

Parametric Effects
For parametric effects γ r , r = 1, 2, = 0, . . . , q r , we use diffuse priors p(γ r ) ∝ const in case of complete lack of prior knowledge. Note that there is abundant literature showing that flat priors in combination with very weak instrumental variables can lead to identification problems (see, e.g., Kleibergen and Zivot 2003) and the use of Jeffrey's prior is then recommended. However, when using Dirichlet process mixtures for the joint error distribution, Jeffrey's prior does no longer take the well known form that arises in case of normal error terms. Therefore, we will restrict our analyses to flat priors and recommend to check the explanatory power of instrumental variables in advance or to use informative normal priors (such that posteriors will always be proper). Nevertheless, our simulations in Section 4 indicate that our approach works well even in the case of quite weak instruments confirming simulations results of Conley et al. (2008).

Nonparametric Effects
Since their introduction by Eilers and Marx (1996), penalized splines have become increasingly popular for representing effects of continuous covariates with unknown, nonlinear form but with a global smoothness assumption on differentiability. See, for example, Kauermann, Krivobokova, and Fahrmeir (2009), Reiss andOgden (2009), andOpsomer (2009) for their theoretical properties.
We will consider the Bayesian analogue to penalized splines as introduced by Lang and Brezger (2004). Hence, we assume that each of the smooth functions f r (x) of some covariate x ∈ {y 1 , z 11 , . . . , z 1p 1 , z 21 , . . . , z 2p 2 } can be approximated by a suitable spline function s r (x) in the space of spline functions S(d r , κ r ) of degree d r with knots κ r = {x min < κ 1 < κ 2 < · · · < κ K r < x max }, that is, s r (x) ∈ S(d r , κ r ). Since S(d r , κ r ) is a (K r + d r + 1)dimensional vector space (a subspace of all d r -times continuously differentiable functions), s r (x) can then be represented as a linear combination of suitable basis functions B k (x), that is, Due to their simplicity and numerical stability, we will use B-spline basis functions in the following, although clearly any basis function such as truncated polynomials or Fourier series expansions, for example, can be used.
Although the global smoothness properties are determined by the degree of the spline basis d r , the variability of the resulting estimates heavily depends on the location and number of knots. Instead of directly aiming at optimizing the number and position of the knots in a data-driven manner, the penalized spline approach relies on using a generous number of equidistant knots in combination with a penalty that avoids overfitting. Claeskens, Krivobokova, and Opsomer (2009) showed that the theoretical properties of penalized spline estimators are similar to either those of regression splines or those of smoothing splines depending on the assumptions concerning the growth of the basis dimension with the sample size. Between the two scenarios, there is a clear breakpoint concerning the growth rate for the number of knots. If the number of knots grows fast enough, the shrinkage bias (due to the penalization term) dominates the average MSE and the approximation bias (due to the approximation of a general smooth function by a spline basis) becomes negligible. More specifically, using a penalty on the integrated squared second derivative, the average squared approximation bias is of order O(K −4 ) with number of knots K ∼ Cn ν/5 for some constants C and ν > 1, whereas the average squared shrinkage bias is of order O(n −4/5 ). We assume that the number of knots is taken large enough such that the approximation bias is negligible and we can replace f r (x) by s r (x). The common rule of thumb is to choose K r = min(n/4, 40).
In the frequentist framework, Eilers and Marx (1996) proposed to penalize the squared qth-order differences of adjacent basis coefficients, thereby approximating the integrated squared qth derivative of the spline function. In the Bayesian framework, this corresponds to assigning a random walk prior to the spline coefficients with β r k = β r ,k−1 + u k or β r k = 2β r ,k−1 − β r ,k−2 + u k for first-and second-order random walks with u k iid ∼ N(0, τ 2 r ) and noninformative priors for the initial parameters. In this specification, the random walk variance τ 2 r acts as an inverse smoothing parameter with small values corresponding to heavy smoothing, while large values allow for considerable variation in the estimated function. In the limiting case of τ 2 r → 0, the estimated function approaches a constant or a linear effect for first and second order random walk priors, respectively. From the random walk specification, the joint prior distribution for the coefficient vector β r can be derived as a partially improper multivariate Gaussian distribution with where r is the penalty matrix given by the cross-product of a difference matrix D r of appropriate order, that is, r = D t r D r .
To complete the fully Bayesian prior specification, a prior on τ 2 r has to be assigned. We choose a conjugate inverse-gamma distribution with shape and scale parameters a τ r and b τ r , that is, τ 2 r ∼ IG(a τ r , b τ r ).

Hyperparameter Choices
From the properties of the inverse Wishart distribution (see, e.g., Link and Barker (2005) for a related discussion) it follows that the residual variances (diagonal elements of l ) are a priori inverse gamma distributed, σ 2 rl ∼ IG((s − 1)/2, S rr /2), r = 1, 2 with S rr the rth diagonal element of S . Further, given S is diagonal, it follows that the correlation coefficient ρ l in component l is a priori beta-distributed, (ρ l − 1)/2 ∼ Be((s − 1)/2, (s − 1)/2). Thus, the prior of the correlation coefficient has a symmetric density around 0 (since the beta distribution parameters are equal) and consequently choosing a diagonal S results in a zero prior mean for the correlation E(ρ l |·) = 0. However, the prior distribution of ρ l also depends on s . For s = 3, we obtain a Be(1, 1) distribution which is the uniform distribution, for s < 3 we obtain a U-shaped distribution and for s > 3 a unimodal distribution. Conley et al. (2008) used as default specification s = 2.004 and thus a prior on ρ l with a U-shaped density. Thus, although in their prior choice errors are uncorrelated in the mean, more probability mass is assigned to correlations close to −1 and 1 than to values close to 0. To avoid such a prior information, we rather choose s = 3 such that the prior on ρ l is uniform over [−1, 1]. Alternatively, in certain situations one might want to choose s > 3 such that the prior on ρ l is unimodal and symmetric around zero to a priori favor no endogeneity in case of only weak information in the data (and thereby stabilize estimation similar to regularization techniques).
Given s = 3 we obtain σ 2 rl ∼ IG(1, S rr /2) as prior on the residual variances. Taking into account that responses are centered and standardized, we choose diagonal S , with equal elements such that the inverse Gamma introduces only weak information on the residual variances. To choose these elements, we follow Conley et al. (2008) and choose default S rr such that p(0.25 < σ rl < 3.25) = 0.8 based on the inverse gamma distribution of σ 2 rl keeping in mind that y 1 and y 2 were standardized beforehand. With s = 3 we obtain S = 0.2I 2 and thus σ 2 rl ∼ IG(1, 0.1) as a weakly informative default. Note that with s = 2.004 and S = 0.17I 2 , Conley et al. (2008) chose as default a IG(0.502, 0.085)-prior on the residual variances. Although imposing an IW-prior on S instead is conceptually and computationally straightforward, associated hyperparameter choice is unclear and is, therefore, not followed in the remainder of the article.
Given the possible impact of τ on the smoothness of the density and weighting of outliers, we might want to impose a hyperprior on τ , τ ∼ Ga(a /2, b /2). We will follow Escobar and West (1995) and impose a diffuse gamma prior with default hyperparameters a = 1 and b = 100 which is in contrast to Conley et al. (2008) who chose a fixed τ . The impact of estimating τ versus fixing it is studied in our simulation study in Section 4.1.
With respect to the concentration parameter α, we follow the recommendation of Ishwaran and James (2002) and choose a Gamma prior with hyperparameters a α = b α = 2 as defaults. This allows both small and large values of α corresponding to many and few mixture components, respectively. For the smoothing parameters τ 2 r of nonparametric effects we choose the standard noninformative prior τ 2 r ∼ IG(0.001, 0.001) in the following.

Estimation
Assuming the structure given in (5) and (6), both equations in (4) can be written in the generic form y r = η r + ε r , r = 1, 2, with predictors where all parametric effects (including the intercept) in each equation are combined in the design matrix V r with regression coefficients γ r whereas the nonparametric effects are represented using B-spline design matrices X r with corresponding basis coefficients β r andp 1 = p 1 + p 2 ,p 2 = p 2 + 1.
Estimation is carried out by using posterior means from Gibbs sampling steps in an efficient Markov chain Monte Carlo implementation. Specifically, given the parameters of the error distribution, full conditionals for the covariate effect parameters in each equation resemble those for the normal heteroscedastic regression model and sampling techniques proposed in Lang and Brezger (2004) (with heteroscedastic errors) can be applied. On the other hand, given the parameter vectors β r , τ 2 r , = 1, . . . ,p r and γ r , r = 1, 2, the components of the error distribution can be obtained using any algorithm for Bayesian nonparametric estimation of bivariate densities based on DPM priors (see Neal 2000 for an overview). Thus, our software allows us to choose efficiently implemented algorithms that are called on top of our sampler. More precisely, we use the implementation provided by the R package DPpackage (Jara et al. 2011) of two Gibbs sampling algorithms with auxiliary variables given in Neal (2000). In addition, the implementation accompanying Conley et al. (2008) is integrated. Full details on all full conditionals are given in the online appendix given in the supplementary materials file.

Smoothing Parameter Estimation
In general, all nonparametric smoothing techniques involve some smoothing parameter controlling the roughness of the fit. This parameter has a strong impact on the estimate and has to be carefully chosen in finite samples. In the control function approach, smoothing parameter choice for the control function E(ε 2 |ε 1 ) and of the covariate functions have to be addressed differently. Here, smoothing parameter choice is particularly delicate since smoothness of functions in the first stage and of the control function influence the way of endogeneity bias correction for f 21 (y 1 ). Thereby, the major problem is to find the smoothing parameter for the control function. Given this smoothing parameter is correctly chosen, it seems plausible that the remaining ones can be found using common criteria like cross-validation. Newey, Powell, and Vella (1999) minimized the cross-validation (CV) criterion over a multidimensional grid and thus treat the control function in the same way as f 21 (y 1 ). That is, the MSE of the additive predictor as a whole is (asymptotically) minimized instead of the MSE of f 21 (y 1 ) given E(ε 2 |ε 1 ). Marra and Radice (2011) took the same route using penalized splines with quadratic roughness penalties and minimize a multivariate version of generalized cross-validation (GCV). In Section 4.2, we show that this can lead to a confounded estimate of f 21 (y 1 ) due to inappropriate choices for the smoothing parameter of the control function. Choosing the smoothing parameter from a global optimization criterion often induces insufficient smoothness, although situations with oversmoothing may also occur. In general, global optimization criteria are not suitable for determining smoothing parameters that minimize the MSE of f 21 (y 1 ). Su and Ullah (2008) proposed a "plug-in" estimator for the smoothing parameter in a multidimensional function f (y 1 , ε 1 ) (in the model with q 1 = q 2 = p 2 = 0, p 1 = 1) where f (·, ·) is a two-dimensional function using kernel regression with a product kernel with single bandwidth, and a pilot bandwidth for estimatingf 1 (z 1 ). Here, choosing the pilot bandwidth and the assumption of a single bandwidth for f (y 1 , ε 1 ) might be problematic.
Our Bayesian approach is closely related to the control function approach. For comparison with Equation (3), consider the conditional distribution of y 2 given y 1 and the DPM parameters, then On the one hand, covariate effects f 2 (·) are estimated by penalized splines. On the other hand, estimates for parameters associated with nuisance components E(ε 2i |ε 1i ) and ξ i result from the DP mixture (see also online available supplementary material). Note that these parameters and the decomposition of E(ε 2i |ε 1i ) and ξ i follow immediately from the conditional distribution of (mixtures of) bivariate normals. Thereby, mean and variance components may vary with i such that E(ε 2i |ε 1i ) and (ξ 1 , . . . , ξ n ) may follow any functional form and distribution, respectively.
In contrast to parametric frequentist approaches and Bayesian approaches assuming bivariate normality, σ 12,i σ 2 1,i in E(ε 2i |ε 1i ) may vary with observation i rather than being constant. This also shows that in the presence of heteroscedasticity (σ 12,i nonconstant), endogeneity bias correction may fail when bivariate normality with constant variance is assumed.
Compared to nonparametric frequentist approaches, σ 12,i σ 2 1,i acts like a varying coefficient allowing the degree of endogeneity correction to be different over observations. The nonconstant variances σ 2 1,i and means μ 1i shrink the error terms of the firststage equation toward their (nonconstant) mean and thereby automatically down weight outliers in ε 1i . Here, on the one hand the "smoothing parameter" is the number of mixture components governed by the data and prior on the concentration parameter α. On the other hand, τ plays an important role for the smoothness of the error density. As mentioned before, a small τ allows the μ 1i to vary more strongly around its mean which translates in a possibly stronger downweighting of outliers in ε 1i depending on τ . Note that control function approaches can be extremely sensitive to outliers in the error distribution if these are not explicitly handled, since they do not account for the high variability of the control function at extreme values of ε 1 (outliers) where observations are scarce. Performance of the DPM approach in case of residual outliers and capability of explaining unobserved heterogeneity are investigated in Section 4.2. However, note that there is no such thing as a free lunch and the downweighting of outliers can also turn into a disadvantage in specific situations. If y 1 or y 2 are discrete and concentrated very strongly on only a few numbers, rarer measurements may be misinterpreted as outliers and variability can then completely be explained by the error distribution leaving no variation to be explained by the covariates (in particular in case of binary covariates). Note that this issue is not specific to our proposed approach but applies to all regression approaches with DPM prior on the error density as in Chib and Greenberg (2010), for example. A rough diagnostic check is to visualize the estimated error density for discreteness.
Note that in contrast to the frequentist approaches, we do not impose dependencies between values of v(ε 1i ) for adjacent ε 1i and σ 12,i σ 2 1,i is also not a function of ε 1 . Also note that the DP prior specification allows "for different degrees of smoothing across the sample space through the use of possibly differing variances" (Escobar and West 1995) and thus the "smoothing parameter" of the conditional mean can be considered to be locally adaptive. See Escobar and West (1995) for connections between DPM and kernel density estimation with varying bandwidth.

Simultaneous Bayesian Credible Bands
Simultaneous inference is important to assess the estimation uncertainty for the entire curve allowing us to make statements about the significance of an effect or feature significance and to perform specification tests.
The construction of simultaneous confidence bands is already difficult in the case of exogenous covariates. The crucial point is to take the asymptotic bias into account that is inherent in nonparametric estimators. Also, the data-driven choice of smoothing (or regularization) parameters introduces extra variability that has to be taken care of. See Krivobokova, Kneib, and Claeskens (2010) for connections between Bayesian and frequentist frameworks and Wiesenfarth et al. (2012) for simultaneous confidence bands in single equation additive models.
In the nonparametric instrumental variables framework, the only approach to simultaneous confidence bands we are aware of was given by Horowitz and Lee (2012). They proposed asymptotic uniform confidence bands building on the interpolation approach by Hall and Titterington (1988) and estimate the critical value by bootstrapping in connection with undersmoothing. That is, they chose the regularization parameter smaller than optimal such that the bias term becomes asymptotically negligible. However, the degree of undersmoothing necessary is unknown in practice. Moreover, they did not take the variability due to estimating the smoothing and regularization parameters into account which may lead to undercoverage in small samples.
We propose the use of Bayesian simultaneous credible bands making use of the sample quantiles obtained from the Monte Carlo output. A simultaneous credible band as the Bayesian counterpart to simultaneous confidence bands is defined as the region I α such that P f |Y (f ∈ I α ) = 1 − α, that is, the posterior probability that the entire true function f is inside the region given the data equals to 1 − α. In the instrumental variable regression context, Bayesian credible bands have the considerable advantage of naturally incorporating uncertainty from the estimation of all the unknowns in the model. In particular, this includes those parameters of the "first stage" equation that explain the endogenous covariate (which is particularly difficult in the frequentist framework). Even uncertainty due to estimating the corresponding smoothing parameters is taken into account. Moreover, we do not have to make any distributional assumption, that is, also asymmetric bands can be obtained.
We follow Krivobokova, Kneib, and Claeskens (2010) and obtain Bayesian simultaneous credible bands from scaling the pointwise credible intervals derived from the α/2 and 1 − α/2 quantiles of the function samples from the MCMC output with a constant factor until (1 − α)100% of the sampled curves are contained in the credible band. Thereby, the information on the possibly nonnormal error distribution is preserved and the complete variability is taken into account without overly demanding computational effort.

Parametric Model
In this section, settings with linear covariate effects are simulated to compare the Bayesian approach to the well-established 2SLS estimator showing that it is capable of correcting endogeneity bias and that in the cases of outliers in the error distribution and nonlinear conditional means, the Bayesian approach outperforms the 2SLS procedure. Thus, this section supplements the studies of Conley et al. (2008) where normal and log-normal error distributions were simulated. We consider the basic model where z 2 and z 1 are independently uniformly distributed on [0, 1] and all coefficients are equal to 1. We consider four different bivariate distributions for the error terms: (i) A simple bivariate normal distribution with a quite high degree of endogeneity (ii) A mixture of two normal distributions that adds outliers (with very small correlation ρ = 0.1) on (i): (iii) A mixture of four bivariate normals with weights 0.3, 0.2, 0.3, and 0.2, means (2, 2) t , (1.5, 0.5) t , (−0.3, 0) t , and (−1, −1) t , all variances (in each mixture components and both equations) equal to 0.1 and correlations 0.5, 0.2, 0.6 and 0.8 between the equations. This setting is an example of (unobserved) heterogeneity with varying degrees of endogeneity in each cluster. (iv) A symmetric bivariate distribution which is conditionally normal with nonlinear conditional mean, that is, ε 1 |ε 2 ∼ N( 4 ε 2 2 +1 , 1 ε 2 2 +1 ) and vice versa for ε 2 |ε 1 . Note that the degree of endogeneity varies over observations.
Densities of example draws from distributions in (iii) and (iv) are shown in Figure 1. Obviously, the strength of the instruments as well as the degrees of endogeneity vary over the settings. In each setting, we simulated 500 Monte Carlo replications with rather small and moderately large sample sizes n = 100, 400. For our DPM approach, the initial 3000 iterations are discarded for burn-in and every 30th iteration of the subsequent 30, 000 iterations is used for inference. As discussed in Section 2.4, we choose a weakly informative prior on the error distribution with s = 3, S = diag(0.2, 0.2), μ 0 = (0, 0) t , and a α = b α = 2. Labeled as "DPM1," we consider first a fixed τ chosen ac-cording to Conley et al.'s (2008) assessment strategy based on the observation that the errors are marginally t-distributed and thus μ r ∼ S rr /τ (s − 1) t s −1 . Considering that the data were centered and standardized, τ is then chosen such that p(−10 < μ r < 10) = 0.8 which results in τ = 0.036 given s = 3 and S = 0.2I 2 . Second, we consider a weakly informative gamma distribution for τ with a = 1 and b = 100, labeled as "DPM2" in the following. To assess the cost of the DPM modification, we also consider the Bayesian estimator assuming bivariate Gaussian errors with the same parameters for s , S and μ 0 , labeled as "Gauss." Results for all settings are given in Table 1.
In simulation settings (i) and n = 100, the DPM approach performs overall better than 2SLS especially in terms of variability of the point estimates. Particularly, the RMSEs (evaluated at the design points) are considerably lower for the DPM approach. Note that 6.6% of the 2SLS estimates even had a negative sign (vs. virtually none in the DPM approach with 0.6% and 0.2%, respectively). In setting (i), the DPM approach with gamma prior on τ performs only slightly better than with fixed τ . This becomes more pronounced in setting (ii) (at least for n = 100), however, where in presence of outliers, assigning a hyperprior is clearly preferable. While RMSEs of 2SLS increase in presence of outliers in setting (ii), this was not the case for the DPM estimator. For the larger sample size n = 400, 2SLS and the DPM approach perform almost identically well and as expected, the impact of the prior on τ diminishes. Note that in settings (i) and (ii) instruments are very weak with a population R 2 of R 2 pop = var(z 1 ) var(z 1 ) + var(z 2 ) + σ 2 1 = 1/12 1/12 + 1/12 + σ 2 1 ≈ 0.07 and even slightly lower in setting (ii).
In settings (iii)   strongly increased widths of the 2SLS confidence intervals, coverage probability of the intervals are, however, still close to the nominal level. Still, this also has an impact on the power of detecting a significant positive effect: On a 5% level, rejection rates of 59% and 32% for the 2SLS estimator for settings (iii) and (iv), respectively, were observed versus 100% for the DPM estimator. In these two settings, the DPM estimator with fixed τ performed best, since estimation of τ increased variability in DPM2. Here, also for n = 400, due to the nonlinear conditional means, the DPM approach performs better than 2SLS in terms of efficiency (MSE and IQR) and interval widths. Again, the importance of the prior on τ diminishes for increasing sample size.
In setting (i), the estimator assuming Gaussian errors and the DPM estimator perform almost identically, that is, there is negligible cost for using the more flexible estimator besides the increased computational burden. As the nonnormality becomes more pronounced in settings (ii)-(iv), the benefit of the DPM approach becomes stronger, particularly for small sample sizes.

Nonparametric Model
To allow some comparison with the simulation results given in Su and Ullah (2008) also covering the approaches by Newey and Powell (2003) and Pinkse (2000), we first replicate their DGPs 1 and 4 and then vary the complexity of the error distribution of their DGP 4.
In settings (b.ii) and (b.iii) the error distribution in (b) is replaced by the distributions in settings (ii) and (iii) of the previous section, respectively. In the final setting which we call (b.v), the distribution in (b) is replaced by one of the distributions given in Marra and Radice (2011) which exactly resembles the structural assumptions of the control function approach: with w ∼ U (0, 1), g 1 (w) = − exp(−3w) and g 2 (w) = −0.5(w + sin(πx 2 .5)) standardized to have variance one and v 1 , v 2 iid ∼ N(0, 1). Note that in settings (b) and (b.v), w can be considered as an omitted variable with linear and nonlinear effects, respectively.
We compare our DPM approach (with hyperparameter settings DPM1 and DPM2 as in the previous subsection) with the two-step control function approach of Newey, Powell, and Vella (1999) using penalized splines and choosing smoothing parameters by GCV as suggested by Marra and Radice (2011). Moreover, naive (i.e., without bias correction) estimation using the single equation Bayesian penalized splines estimator (Lang and Brezger 2004) are given. Note that naive local linear estimation with smoothing parameter chosen by least squares cross-validation (as in Su and Ullah 2008) resulted in mean RMSE that differed by less than 0.01 from those obtained by naive Bayesian estimation in all settings. As a benchmark, we also give the results for the models using the true but unobserved y 2 − E(ε 2 |ε 1 ) as response using the single equation Bayesian estimator (labeled "bench"). Finally, results obtained using Tikhonov and Landweber-Friedman regularization approaches (labeled as "TK" and "LW," respectively) both modified for local linear kernel regression (with normal kernel) using R package "np" (Hayfield and Racine 2008) are given. In both cases, smoothing parameters are chosen by least squares cross-validation (LSCV). The regularization parameter of the former is sequentially chosen by minimizing a penalized sum of squared residuals criterion while one of the latter is sequentially chosen with respect to an optimal stopping rule (see Darolles et al. 2011, Fève and Florens 2010, and Centorrino Feve, and Florens 2013.
For all Bayesian approaches, we used a burn-in of 5000 iterations and used 1000 of the subsequent 40,000 iterations for estimation. Further, cubic B-splines based on 25 and 40 knots for sample sizes of 100 and 400, respectively, and a second-order random walk prior were used. The same choices were made for the two-step control function approach with corresponding second order difference penalty.
In Table 2, mean RMSEs and coverage rates of 95% simultaneous credible bands (when available) for DGP 1 and 4 of Su and Ullah (2008) (settings (a) and (b)) are given. In the case of Tikhonov regularization, the regularization parameter was sometimes chosen too small such that oscillating function estimates resulted. This inflated mean RMSEs which lay between 1.3 and 38.6 and we, therefore, provide median RMSEs instead. For all other estimators, mean and median RMSEs differed only negligibly.
We find RMSEs for all estimators that are considerably smaller than those given in Su and Ullah (2008). Note that we even obtained better results for the naive estimator using LSCV. This is most probably because while we used a numerical minimization algorithm with a random starting value to minimize the LSCV criterion, Su and Ullah (2008) (personal communication) chose the bandwidth h according to h = c √ var(y 1 )n −1/5 with a limited grid search over c. Thereby, they obtained RMSEs that only slightly changed with increasing degrkee of endogeneity which is rather implausible. This highlights the importance of data-driven choice of the tuning parameters.
While both the control function and DPM approach decreased the mean RMSE compared to the naive estimator, the DPM approach performed slightly better with negligible impact of the prior choice. Landweber-Fridman regularization performed similar to the control function approach for low degree of endogeneity and slightly worse for increasing degree of endogeneity. Tikhonov regularization performed slightly worse than Landweber-Fridman regularization. Table 3 gives results for settings (b.ii), (b.iii), and (b.v). In settings (b.ii) and (b.iii) (outliers and multimodal error density, unobserved heterogeneity) the control function approach is clearly outperformed by the DPM approach. Figure 2 shows the estimated curves in the first 50 simulation runs of setting (b.iii) illustrating that estimates of the control function approach can be seriously confounded when E(ε 2 |ε 1 ) is not a smooth function. Clearly, this cannot only be attributed to the higher variability of the cross-validated smoothing parameter off 21 (y 1 ). Also in setting (b.v), the DPM approach performs better although not as pronounced.
RMSEs for the DPM approach were also consistently lower than using the regularization approaches, particularly in the mixture distribution setting. Regularization approaches seemed to be more robust to outliers than the control function approach resulting in a better performance in setting (b.ii), while they performed much worse in the case of the mixture distribution.  In all settings, the DPM approach provides simultaneous credible bands with frequentist coverage rates above the nominal level. That is, the credible bands were successful in taking into account all the variability in the estimation. On the other hand, the credible bands are slightly conservative in a frequentist coverage sense which is unsurprising since this is a wellknown property of Bayesian credible bands also observed in Krivobokova, Kneib, and Claeskens (2010) in the single equation case. Note that no simultaneous confidence bands are available for control function approaches.
In summary, the proposed approach outperformed the control function approach based on GCV smoothing parameter selection and-relying on the results given in Su and Ullah (2008)the control function estimators of Pinkse (2000) and Su and Ullah (2008).
Moreover, our Bayesian approach provided us with simultaneous credible bands which performed extremely well even in the case of rather complex error distributions and small sample sizes. We emphasize that comparisons to approaches based on regularization should be treated with caution since they are based on different assumptions and regularity conditions. There may well be situations where they outperform our approach. Our intention is to show that the proposed approach can be a valuable alternative which may be favorable in certain situations.

CLASS SIZE EFFECTS ON STUDENT ACHIEVEMENTS
In a very influential article, Angrist and Lavy (1999) analyzed the effect of class size on fourth and fifth grades students tests scores in Israel. Among others they consider the model tscore ji = γ 20 + γ 21 csize ji + γ 22 disadv ji + ν j + ε 2ji , where tscore ji is the class level average of a reading comprehension test score, csize ji the number of students and disadv ji the fraction of disadvantaged students in class i of school j, respectively. Further, ν j is a school-specific random effect.
To deal with the endogeneity of csize ji due to nonrandom assignment of class sizes, they define the predicted class size pcsize ji of class j in school i as an instrument given by pcsize ji = enrol j int[(enrol j −1)/40]+1 , where enrol j is the beginning of the year enrollment in school j for a given grade and int(k) is the largest integer less or equal to k.
Then, using a sample of 2019 public schools and assuming a first stage equation csize ji = γ 10 + γ 11 pcsize ji + γ 12 disadv ji + ε 1ji they fit the model using 2SLS and find, for fourth and fifth graders, class size effects of −0.110 and −0.158, respectively, with standard errors of 0.040 each resulting in the conclusion of a significantly negative effect on the reading comprehension test score. When applying the DPM estimator to this parametric model (with hyperparameter setting "DPM2"), that is, the semiparametric estimator by Conley et al. (2008) modified to allow to the specification of a prior on τ , posterior means of class size effects of −0.103 and −0.108 are obtained. Hence, we find virtually no difference between fourth and fifth graders and estimates close to the 2SLS estimate for fourth graders.
As a robustness check for validity of the instrument, Angrist and Lavy (1999) add linear, quadratic, and piecewise linear effects of enrolment to the equations and find that this has quite an impact on the estimated coefficients for class size (ranging between −0.074 and −0.147 and between −0.186 and −0.275 for fourth and fifth graders, respectively). That is, inclusion of enrol and the functional form of its effect (which is roughly approximated by a few parametric specifications) affects the estimated class size effect. Furthermore, a violation of the linearity assumption on the class size effect cannot be ruled out and there may be a positive effect for small classes which vanishes for larger classes. This would correspond to a nonlinear effect, which could not properly be identified by a simple linear model as the one given above. Thus, we relax the assumption of linear effects and extend the model of Angrist and Lavy (1999) to the following specification: csize ji = γ 10 + γ 11 pcsize ji + f 12 (disadv ji ) Note that inclusion of random school effects ν rj ∼ N(0, σ 2 ν r ) with inverse gamma priors on the variance parameters σ 2 ν r ∼ IG(a σ νr , b σ νr ), r = 1, 2 in both equations capturing withinschool correlations of class average scores did not change the results substantively but basically only increased the widths of the confidence bands slightly and are, therefore, not discussed further. Also note that within-school correlations will be generally positive and thus will increase confidence bandwidth (given point estimates do not change) such that given confidence bands will not underestimate estimation precision. Figure 3 shows posterior means of the smooth effects for fourth graders (top panels) and fifth graders (bottom panels) in Equation (8) (solid black lines) jointly with 95% pointwise credible intervals (gray areas) and 95% simultaneous credible bands (areas between black dashed curves). On the left-hand side, class size effects together with 2SLS estimates in the model excluding enrol (gray solid line) and including a linear (gray dashed line) and quadratic effect (gray dotted line) of enrol are given. All results are based on hyperparameter specification "DPM2"; results with "DPM1" were very similar.
Regarding fourth grade students, no significant class size effect is found. This does not mean, however, that there is none; the data (and instrument) might just be not informative enough. Note that using 2SLS, the functional form specification of the enrolment effect (not included, linear, quadratic or piecewise linear) has a relatively strong impact on the class size coefficient. In contrast, using the nonparametric DPM approach, inclusion of a smooth effect of enrolment barely influenced the class size effect and, therefore, results for the model without enrolment are omitted. Revealed by the simultaneous credible bands, estimation uncertainty is excessively high particularly for class sizes smaller than 20 casting interpretability of point estimates into doubt. If, however, one is willing to do so, we find indeed a negative relationship between class size and student performance for small class sizes (less than 25 students) and no association as soon as this "threshold" is exceeded.   (8) with 95% pointwise (gray areas) and simultaneous (areas between dashed lines) credible bands. 2SLS results for different parametric specifications of enrolment are given by gray lines.
For fifth grade students, again estimation uncertainty is too high to draw reliable conclusions on the impact of csize on students performance and its functional form. Note that we find a significant deviation from the 2SLS fit (in models including enrolment).
For both grades, the estimated curvesf 22 (disadv) (see Figure 3 middle plots) significantly deviate from the linear estimates obtained from 2SLS (gray straight lines). Such a misspecification of the functional form of the effect of a control variable can of course also affect the estimated class size effect. The smooth effects of enrolment are highly nonlinear but not significant for both grades.
We find clearly nonnormal error densities in Figure 4. The error density for the first equation has a distinct peak while both densities show some slight indication of asymmetry.
It is also interesting to note that using the proposed approach we obtainγ 11 ≈ 0.99 in the first stage equation which is very close to the theoretically expected coefficient equal to 1. Angrist and Lavy (1999) obtained coefficient estimates of 0.772 and 0.670 and of 0.702 and 0.542 for fourth and fifth graders, respectively, and depending on whether (a linear effect of) enrol was included or not. Thus, they obtain substantially smaller coefficients than expected leading to different bias correction. Differences most likely occur due to different handling of outliers in 2SLS and the Bayesian model based on the DPM prior. Finally, note that Horowitz (2011b) analyzed the same data with a bivariate smooth function of csize and disadv. They also find no significant class size effect (though only reporting results for disadv = 1.5).

CONCLUSION
We presented a flexible, nonparametric approach for models with one endogenous regressor. The advantages include the availability of simultaneous credible intervals, which naturally incorporate the variability of estimation of the instrumental variable equation and data-driven smoothing parameter selection which is particularly difficult in two-step frequentist approaches. We do not rely on a normality assumption such that violations of bivariate normality will not affect estimates and more efficient interval estimates are provided. In our simulation study, we show that the approach based on the DPM is quite robust in case of outliers making the Bayesian approach advantageous even in the parametric context, where although 2SLS methods are consistent they are sensitive to outliers in finite samples.
In our application, we found that without imposing linearity on effects, no reliable conclusions on the relationship between class sizes and student performance can be drawn.
Interesting questions for future research include the incorporation of discrete endogenous variables and binary/categorical outcomes of interest as well as nonparametric sample selection models adjusting the error density estimation in Wiesenfarth and Kneib (2010). Our results can also be used for seemingly unrelated regression (SUR) extending Lang et al. (2003). The approach is implemented in the user-friendly R package bayesIV.

SUPPLEMENTARY MATERIALS
The online supplementary materials file provides details on the full conditionals for all parameters discussed in Section 2.