Yet another look at the omitted variable bias

Abstract When conducting regression analysis, econometricians often face the situation where some relevant regressors are unavailable in the data set at hand. This article shows how to construct a new class of nonparametric proxies by combining the original data set with one containing the missing regressors. Imputation of the missing values is done using a nonstandard kernel adapted to mixed data. We derive the asymptotic distribution of the resulting semiparametric two-sample estimator of the parameters of interest and show, using Monte Carlo simulations, that it dominates the solutions involving instrumental variables and other parametric alternatives. An application to the PSID and NLS data illustrates the importance of our estimation approach for empirical research.


Introduction
Omission of relevant variables leads to challenging problems in applied work. If the correlation between the omitted variable and included regressors is strong, least squares estimates are biased and inconsistent. This issue has a long history in economics.
A classic example of the missing regressor is an ability measure in Mincer's (1974) wage regression, where estimates suffer from the so called "ability bias" unless the regressors include a variable representing ability (see, e.g., Card, 1995, for details). Micro-level data sets such as the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID) do not generally contain individual test scores that can be used as (a proxy for) ability.
Another example can be found in the study of gender wage gap (e.g., Black et al., 1999;Zabalza and Arrufat, 1985). Work experience is an important regressor in the wage regression. Although the General Household Survey (GHS) and CPS contain wages and other predictors, they do not include a variable for actual work experience.
This article argues that availability of a suitable auxiliary sample offers an underexploited opportunity. We use the sample to construct a new semiparametric estimator and we study its properties. In a nutshell, our two-sample estimation procedure is to run the ordinary least squares (OLS) after replacing the missing regressor with a nonparametric estimate of its conditional mean, obtained using a nonstandard kernel function tailored for both discrete and continuous data.
Our approach is different from but not unrelated to the work on matched-sample estimators (see, e.g., Imbens, 2006, 2011;Hirukawa and Prokhorov, 2018). One key advantage of our estimator is that the imputed proxy does not rely on a combined sample constructed via the nearest-neighbor matching (NNM) or another ad hoc matching scheme. Matching-free estimation is also the purpose of some well-known moment-based two-sample methods (e.g., Krueger, 1992, 1995;Arellano and Meghir, 1992;Inoue and Solon, 2010;Klevmarken, 1982;Murtazashvili et al., 2015;Pacini and Windmeijer, 2016). Unfortunately, these approaches are not applicable in the setting of a linear regression where some regressors are missing in the primary sample, and two-sample moment-based estimation is infeasible. Our setup is closer to what is considered by Pacini (2019), Chen et al. (2008) and Graham et al. (2016), who also use an auxiliary sample to impute missing regressors. However, Pacini's (2019) main focus is on relaxing identification conditions for two-sample estimation, while Chen et al. (2008) and Graham et al. (2016) study parametric and semi-parametric efficiency bounds for GMM.
We establish large-sample properties of the proposed semiparametric two-sample estimator. Its finite-sample properties are examined through extensive Monte Carlo simulations (largely deferred to an online supplement to save space), where we show that our estimator dominates the competitors, in particular, instrumental variable (IV) estimators using linear and nonlinear instruments and a parametric two-sample alternative. In an empirical application, we look for a setting where we are able to obtain estimates based on both the IV estimator and proposed estimator at the same time. We do so for the return to schooling and offer new insights arising from the use of IVs based on one sample as well as from the use of proxies based on two samples.
The nonparametric imputation we propose is important in several practical dimensions. First, having access to a second sample, empirical economists commonly impute the variables that are necessary for estimation but missing in the first sample. For example, Fang et al. (2008) impute medical expenditure of Health and Retirement Study (HRS) respondents using the information from the Medicare Current Beneficiary Survey (MCBS), and Flavin and Nakagawa (2008) impute the square footage of the homes of PSID respondents using data on housing from the American Housing Survey (AHS). Instead of estimating the conditional mean of the missing variable nonparametrically, these papers adopt a linear projection-based imputation. As a matter of course, if the functional form turns out to be incorrectly specified, then consistency of the estimators can no longer be guaranteed. Furthermore, estimation errors associated with imputation need to be explicitly accounted for, as imputed values cannot be treated as error-free and adjustments need to be made to the standard errors. A remarkable observation is that our nonparametric imputation approach performs no worse even under correct parametric specification. The result on inconsistency of the linear projection approach and the circumstances under which it still works may be of independent interest. We explore this in the Monte Carlo simulations of this article and in the online supplement.
Second, our nonparametric imputation method is important in the context of the literature on two-step semiparametric regression estimation with a plug-in first-step nonparametrically generated regressor (see, e.g., Hsu et al., 2022;Newey, 2009;Rilstone, 1996;Stengos and Yan, 2001). The literature explores statistical properties of regression estimators when the regressors that cannot be observed directly are estimated by nonparametric methods, similarly to Pagan's (1984) classical fully parametric case. However, these approaches are not directly applicable when imputation needs to be based on two samples or when there are no common observations in them.
Finally, we employ a new uniform consistency result for estimators based on a nonstandard kernel and mixed data, which is of independent interest. A key aspect of the result is compactness of the support of the kernel function, a feature that often arises in economics and finance either by construction or as a theoretical requirement. For example, economic and financial variables such as expenditure and budget shares, unemployment rates and default and recovery rates are typically bounded from above and below by construction. Theoretical compactness requirements are used in the context of partial linear regression models (see, e.g., Yatchew, 1997), NNM (see, e.g., Abadie and Imbens, 2006), and estimation of first-price auctions (see, e.g., Guerre et al., 2000), to name just a few examples. Interested readers may consult Hirukawa et al. (2022) for further details.
The remainder of this article is organized as follows. Section 2 describes the model, defines the proposed estimator, and derives and discusses its convergence properties. Section 3 presents results of selected Monte Carlo simulations to compare finite-sample properties of the proposed estimator with the IV and other alternatives. As an empirical example, in Section 4, we apply the two-sample estimator to a version of Mincer's (1974) wage regression. Both in simulations and the application, we compare our estimator (a) with the IV approach, (b) with an approach where the missing variable is imputed using a fully parametric method, and (c) with the approach of Hirukawa and Prokhorov (2018) based on NNM. Section 5 concludes with a few questions for future research. Technical proofs are provided in the Appendix. The online supplement contains details of the alternative estimators (a)-(c), comprehensive simulation results for a variety of designs, and additional discussions on the empirical example.
The article adopts the following notational conventions: kAk ¼ trðA 0 AÞ È É 1=2 is the Euclidean norm of matrix A; 1fÁg denotes an indicator function; 0 pÂq signifies the p Â q zero matrix, where the subscript may be suppressed if q ¼ 1; Bðp, qÞ ¼ Ð 1 0 y pÀ1 ð1 À yÞ qÀ1 dy for p, q > 0 is the beta function; and the symbol > applied to matrices means positive definiteness.

Model
We consider the following linear regression model where X 1 2 R d 1 , X 2 2 R d 2 , X 3I 2 R d 3I (the distinction between these regressors will be made clear shortly). Throughout, we assume that either b 1 or b 3 is the parameter of primary interest. Let d :¼ d 1 þ d 2 þ d 3I : When ð1, X 0 1 , X 0 2 , X 0 3I Þ 0 2 R dþ1 is exogenous and a single random sample of ðY, X 1 , X 2 , X 3I Þ is available, the OLS estimator of b ¼ ðb 0 , b 0 1 , b 0 2 , b 0 3 Þ 0 is consistent under the usual assumptions.
Let S 1 denote the data set at hand and let S 2 denote a second data set which will be required for a two-sample estimation. We assume that S 1 ¼ ðY, X 1 , X 3 Þ and S 2 ¼ ðX 2 , X 3 Þ, where X 3 :¼ ðX 3I , X 3E Þ is the vector of common variables across the two samples that are partitioned into those included (X 3I ) and those excluded (X 3E ) from regression (1). We note that X 2 is missing in S 1 although it is assumed to be relevant, i.e., b 2 6 ¼ 0: This setup was recently considered by Pacini (2019) but his focus was on relaxing the identification assumptions involving the joint distribution of ðY, X 1 , X 2 , X 3I Þ: The distinction between X 1 and X 3 will become important later. For now, we can think of X 3 as variables that are common across S 1 and S 2 , enter as regressors and are used, due to their presence in both samples, for imputation of the missing regressors in X 2 . On the other hand, when a common variable serves merely as a regressor but is not used for imputation, it is classified as X 1 . The vector X 1 can be empty; the vector X 3 cannot.
Even though S 1 and S 2 contain common variables X 3 , this does not mean that they need to have common observations, i.e., the two samples do not have to overlap. Finally, denote d 3 :¼ where d 3 > 0 must be the case, and either d 3I or d 3E is allowed to be zero.
An example of this sampling arrangement is when a micro-level data set (e.g., CPS or PSID) can be complemented by an auxiliary data set which is focused on some important variables that are not available elsewhere, e.g., a psychometric data set with ability measures.
In such cases it is common to instrument for the unobserved variable (see Section 1.1 of the online supplement for details of IV estimation). However, the IV estimation is unavailable if there are no valid instruments in the same sample and it may be desirable to avoid the IV estimation for a host of other reasons discussed in Section 2.5. Denote We now discuss how we construct a proxy for X 2 using S 2m :

Additional notation
We start by introducing some additional notation that will help define the semiparametric twosample estimator and derive its asymptotic properties. First, we allow for the vector of common variables X 3 ¼ ðX 3I , X 3E Þ to consist of both continuous (C) and discrete (D) variables so that > 0 must be the case for the subsequent asymptotic analysis. This distinction will be needed for the application of kernel smoothing to mixed continuous and discrete data. Second, let X 3 :¼ X 3C Â X 3D , where X 3C :¼ suppðX 3C Þ and X 3D :¼ suppðX 3D Þ: Third, denote g 2 ðX 3 Þ :¼ EðX 2 X 3 Þ j and g 2 :¼ X 2 À g 2 ðX 3 Þ with Eðg 2 X 3 Þ ¼ 0: j The conditional expectation function g 2 ðÁÞ and the error term g 2 will determine the asymptotic adjustments required to account for the use of nonparametric proxies. Finally, for a pair of vectors ðZ, WÞ, let U Z, W :¼ EðZ i W 0 i Þ, S n, Z, W :¼ ð1=nÞ where summations for S n, Z, W and S m, Z, W are assumed to be taken within S 1n and S 2m , respectively, and where the subscripts n and m emphasize the relevant sample sizes.

Estimation of b and g 2 ðÁÞ
We reformulate the regression (1) as where X :¼ ð1, X 0 1 , g 2 ðX 3 Þ 0 , X 0 3I Þ 0 and :¼ u þ g 0 2 b 2 : In essence, we wish to obtain an estimator of the unknown function g 2 ðX 3 Þ using S 2 and construct a proxy for X 2 using some distance measure between the common variables X 3 in S 1 and S 2 : We will use a kernel function (as opposed to NNM or another matching scheme) to achieve that.
Letĝ 2 ðÁÞ be some consistent nonparametric estimator of g 2 ðÁÞ: Then, the estimation procedure we propose consists of the following two steps: Step 1: Regard ðX 2j , X 3j Þ È É m j¼1 in S 2 and X 3i f g n i¼1 in S 1 as m data points and n design points, respectively, and obtain n nonparametric estimatesĝ 2 ðX 3i Þ È É n i¼1 : Step 2: Run OLS for the regression of Y i onX i :¼ ð1, X 0 1i ,ĝ 2 ðX 3i Þ 0 , X 0 3Ii Þ 0 : The estimator of b is then Because this estimator is the OLS withĝ 2 ðX 3i Þ plugged in place of the missing regressor X 2i , we call it the plug-in least squares (PILS) estimator hereinafter.
The part of X 3 that is used for imputation but excluded from the original model can be used to form exclusion restrictions EðX 0 3E uÞ ¼ 0: If g 2 ðX 3 Þ is not linear in X 3E , these restrictions can, in principle, be used to improve precision of the estimation of b (or to construct specification tests). However, since our main goal is to use X 3E in the construction of a proxy for X 2 , we do not pursue the issues of efficiency or specification testing here.
At a first glance, the PILS approach-just like its fully parametric counterpart (see Section 1.2 of the online supplement for details)-may be viewed as a variant of generated regressors, which have been extensively studied by many researchers. However, two samples and nonparametric imputation raise new and nontrivial issues, which require careful consideration. We study the issues in the next section.
A remaining task is to deliver a consistent estimator of g 2 ðÁÞ: Taking into consideration that g 2 ðÁÞ may depend on both continuous and discrete covariates, we choose the kernel regression smoother for mixed continuous and categorical data proposed by Racine and Li (2004). In this sense, PILS may be viewed as a partial mean estimator of Newey (1994), although Newey (1994) does not consider the situation where design and data points for kernel smoothing come from different data sources.
A key part of the estimator is the construction of a multivariate kernel. Let Kðt 3C ; x 3C , hÞ and Lðt 3D ; x 3D , kÞ denote product kernels for the continuous and discrete components of X 3 , respectively, where t 3Á and x 3Á are data points and design points, and h and k are vectors of bandwidths. We provide details of how these kernels are constructed in Appendix A.1. Then, the product kernel for X 3 is It follows that a nonparametric estimator of g 2 ðÁÞ can be defined aŝ There are many conventional options for the specific univariate kernel to use for continuous variables due to the requirement for compactness of X 3C (see Assumption 2 below). Assuming without loss of generality that the compact set is a d 3C -dimensional unit hypercube 0, 1 ½ d 3C , we propose to employ Chen's (1999) univariate beta kernel with support on 0, 1 ½ as an attractive alternative to standard symmetric kernels. Our application of this kernel is motivated by the uniform convergence result on a compact support which we prove in a companion paper (Hirukawa et al., 2022) and which is, as far as we know, new. This asymmetric kernel is free of the boundary bias by construction and its shape varies across design points even under a fixed value of the smoothing parameter b. The latter property implies that the amount of smoothing by this kernel changes in an adaptive manner.

Regularity conditions
Now we explore asymptotic properties of the PILS estimator as both n and m diverge. For this purpose, the following regularity conditions are imposed.
Assumption 2. X 3C is continuously distributed with a convex and compact support X 3C , and its density is bounded and bounded away from zero on X 3C : Assumption 3.
Assumption 6. Sequences of smoothing parameters h p ð¼ h p ðmÞ > 0Þ, b p ð¼ b p ðmÞ > 0Þ and k q ð¼ k q ðmÞ 2 ð0, 1ÞÞ and boundary parameters h p ð¼ h p ðmÞ > 0Þ for p ¼ 1, :::, d 3C and q ¼ 1, :::, d 3D satisfy one of the following conditions as m ! 1 : (a) for a symmetric th-order kernel, h p , k q ! 0 and log m=ðm Q d 3C p¼1 h p Þ ! 0; and (b) for the beta kernel, b p , k q , h p ! 0, b p =h p ! 0 and log m= m The first three assumptions are often encountered in matching-based estimation (see, e.g., Hirukawa and Prokhorov, 2018) and are natural in our settings. In particular, it follows from Assumption 3(i) that even when some part of X 3 does not enter regression (1) due to an exclusion restriction, X 1 and X 3 are exogenous in the full model. Nonlinearity of g 2 ðÁÞ in Assumption 3(iii) is required only when all continuous common variables are included as regressors in (1). Otherwise, excluded continuous common variables introduce additional randomness in g 2 ðÁÞ, which helps identify b.
Assumption 3(ii) is also standard for the proxy variable literature. It implies that It is sometimes possible to use a weaker assumption, e.g., conditional mean independence Eðg 2 jX 1 , X 3 Þ ¼ Eðg 2 jX 3 Þ (see, e.g., Wooldridge, 2010, Section 4.3.2). However, we focus on conditional independence for two reasons. First, it implies orthogonality of g 2 and X 1 , which is a key requirement for consistency of PILS in Theorem 1. Second, it simplifies the covariance estimation in Section 2.4.3. A similar assumption can be found in Inoue and Solon (2010, Assumption c) and Hirukawa and Prokhorov (2018, Assumption 4). Moreover, this assumption involves quantities that are either unobserved (g 2 ) or unobserved in the same sample (g 2 and X 1 ). Shah and Peters (2020) have recently demonstrated that (even if they were observed) a uniformly valid conditional independence test does not exist. This may look like a disadvantage of our method. However, as discussed by Wooldridge (2010, p.68), a conditional independence assumption is routinely made or implied in all proxy-based estimators, and our approach is no different. As we argue in Section 4.3, the assumption is easier to claim given enough observables in X 3 .
Along with compactness of X 3C in Assumption 2, Assumptions 4-6 are required to establish weak uniform consistency of the nonparametric regression estimatorĝ 2 ðÁÞ on X 3 : Similar conditions can be found, for instance, in Li and Ouyang (2005), Hansen (2008) and Su et al. (2013) for a symmetric kernel, and Hirukawa et al. (2022) for the beta kernel. Notice that the beta kernel is nonnegative so that the order ¼ 2 is the case in Assumption 4(i).
Strictly speaking, uniform consistency holds on X 3 for a symmetric kernel. In contrast, for the beta kernel, the boundary parameters h 1 , :::, h d 3C play a key role in uniform consistency ofĝ 2 ðÁÞ: (more slowly than the smoothing parameters b 1 , :::, b d 3C ) so that S X 3C Â X 3D is expanding to X 3 as m ! 1: On the other hand, because the regression estimatorĝ 2 ðÁÞ is a Nadaraya-Watson-type estimator, it suffers from the so-called boundary bias when a symmetric kernel is employed. When the beta kernel is used, g 2 ðÁÞ is free of this issue by construction. We note that Assumption 5(a) allows for higher-order kernels (i.e., > 2). In theory, such kernels could be a remedy for the curse of dimensionality implied by Corollary 1 below. In practice, however, it is well known that higher-order kernels are unlikely to dominate nonnegative ones for moderate sample sizes, and that the advantage, if any, is typically marginal (see, e.g., Marron, 1994;Marron and Wand, 1992). In fact, our simulation results indicate poor finite-sample performance of PILS using higher-order kernels.

Consistency and asymptotic normality of PILS
The next two theorems establish consistency and asymptotic normality ofb PILS : In particular, asymptotic normality ofb PILS is obtained after correcting for B g 2 , the bias term due to kernel smoothing. Each theorem holds regardless of the number of common variables and regardless of the divergence patterns in (n, m).
Theorem 1. If Assumptions 1-6 hold, thenb PILS ! p b as n, m ! 1: , and 8 > < > : As derived in the proof of Theorem 1, there is the asymptotically negligible bias term B g 2 , due to kernel smoothing. There are also two ffiffiffi n p -asymptotically normal terms. One comes from the sampling error in regression (1), and it has the asymptotic variance U À1 X, X X 1 U À1 X, X : The other is due to the approximation error of g 2 , and its asymptotic variance is U À1 X, X X 2 U À1 X, X : As a consequence, the asymptotic variance X in Theorem 2 has similar structure to that of the double kernel estimator by Stengos and Yan (2001, Theorem 1) and the two-step series estimator by Hsu et al. (2022, Theorem 2.1). A difference from Stengos and Yan (2001) and Hsu et al. (2022) is that the two terms are uncorrelated. Each of them can be obtained within two independent samples S 1 and S 2 , and thus X is free of covariance terms.

Covariance estimation
Covariance estimation is essential for inference. The problem of estimating V consistently boils down to proposing consistent estimators of X 1 and X 2 : For the PILS residual i : Consistency ofX 1 can be established in the same manner as in the proofs of Theorems 1 and 2.
On the other hand, € 2ib2, PILS might work as a consistent estimator for X 2 , whereb 2, PILS is the PILS estimator of b 2 andĝ 2 :¼ X 2 Àĝ 2 ðX 3 Þ: In reality, it turns out to be difficult to obtainĝ 2i f g n i¼1 because of the absence of X 2 in S 1 : Instead, using Then, we can alternatively consider the estimator and the nonparametric regression residualsĝ 2j È É m j¼1 ¼ X 2j Àĝ 2 ðX 3j Þ È É m j¼1 can be obtained within S 2 : The estimation procedure forR 2 ðÁÞ is inspired by the Algorithm in Section 2.4 of Fan and Yao (1998). To show consistency ofX 2 , we can see that a similar argument to the proof of Lemma A1 in Appendix A.2 establishes uniform consistency ofĝ 2j for g 2j , which in turn leads to uniform consist- In conclusion, V can be consistently estimated bŷ -consistency of PILS Theorems 1 and 2 jointly imply that although PILS is consistent, its convergence is affected by the bias term B g 2 generated by kernel smoothing. For a large d 3C , the order of magnitude in the bias term dominates and the convergence rate ofb PILS becomes inferior. It appears that this problem is unavoidable in a regression that uses a nonparametric component. A similar phenomenon arises in the context of imputation bias correction (see, e.g., Hirukawa and Prokhorov, 2018). However, ffiffiffi n p -consistency ofb PILS automatically holds for small values of d 3C : To illustrate such cases, we modify Assumption 6 in order to control the bias and variance convergence of g 2 ðÁÞ more easily.
Assumption 6 0 . Sequences of the smoothing and boundary parameters satisfy one of the following conditions as m ! 1 : (a) for a th-order symmetric kernel, there is a bandwidth hð¼ hðmÞ > 0Þ so that h 1 , :::, h d 3C / h, k 1 , :::, k d 3D / h , h ! 0, and log m=ðmh d 3C Þ ! 0; and (b) for the beta kernel, there are a smoothing parameter bð¼ bðmÞ > 0Þ and a boundary parameter hð¼ hðmÞ > 0Þ so that b 1 , :: The corollary below describes the cases in whichb PILS becomes ffiffiffi n p -asymptotically normal.
Corollary 1. Let h / ð log m=mÞ a and b / ð log m=mÞ 2a for some constant a > 0. Also suppose that one of the following divergence patterns in (n, m) is true: (i) n=m ! j 2 ð0, 1Þ; (ii) n=m ! 0; or (iii) n=m ! 1 and n=m 4a ! 0. Then, under Assumptions 1-5 and 6 0 , as n, m ! 1, ffiffiffi n p ðb PILS À bÞ ! d Nð0 ðdþ1ÞÂ1 , VÞ holds either (a) if a th-order symmetric kernel is employed and d 3C 2 À 1 or (b) if the beta kernel is employed and d 3C 3: Corollary 1 presents an upper bound of the number of continuous common variables for PILS to be ffiffiffi n p -consistent. This is recognized as the curse of dimensionality in continuous common variables, and it can be relaxed, in theory, by using higher-order kernels. In particular, when employing a nonnegative kernel (which is either a symmetric pdf or the beta kernel), we must limit the number of continuous common variables to 3 or less to make PILS ffiffiffi n p -consistent. To see this in more detail, note that for the most realistic case n=m ! j, shrinkage rates of the smoothing parameters are h / ð log m=mÞ a and b / ð log m=mÞ 2a for some a 2 ð1=4, 1=d 3C Þ (see Appendix A.4). These rates are faster than what yields Stone's (1982) optimal global rate of convergence, i.e., h Ã / ð log m=mÞ 1=ð4þd 3C Þ and b Ã / ð log m=mÞ 2=ð4þd 3C Þ that can be obtained through balancing the squared bias and variance ofĝ 2 ðÁÞ: This is because to attain ffiffiffi n p -consistency we should keep the convergence rate of the dominant bias term inĝ 2 ðÁÞ sufficiently fast by undersmoothing, or more specifically, by setting the exponent a within the aforementioned range. We give some examples of h and b that fulfill the rate requirement in Sections 3 and 4.
Finally, it may be the case that the missing regressor X 2 can be observed with a measurement error. This issue occurs, for example, when an ability measure is unreliable in the wage regression. Suppose that we can observe X 2 ¼ X 2 þ t at best, where t is the measurement error. As long as EðtjX 3 Þ ¼ 0 d 2 Â1 , the effect of t is filtered out via kernel smoothing and holds. In the end, consistency of PILS is maintained.

Comparison of PILS with competing estimators
We conclude this section by comparing PILS with competing estimators. The competing estimators we consider are IV estimators using linear and nonlinear instruments, a fully parametric alternative to PILS (called PARA hereinafter), and the matched-sample indirect inference (MSII) and fully-modified MSII (MSII-FM) estimators. PARA is often a method of choice (see, e.g., Fang et al. 2008). MSII(-FM) was proposed by Hirukawa and Prokhorov (2018) as an alternative to the inconsistent OLS estimator using the matched sample (called MSOLS hereinafter). Definitions and statistical properties of these estimators are reviewed in the online supplement. Finite-sample properties of all the estimators will be assessed in comparison with PILS in the next section.

Comparison with IV
The use of IVs has been associated with at least two problems. First, instruments that are not strongly correlated with the endogenous variables-weak instruments-lead to large inconsistencies in the IV estimators even if a weak correlation exists between the instruments and the structural equation error, i.e., even if the instruments only slightly violate the validity assumption (see, e.g., Bound et al., 1995). When we are ready to assume that the IVs are valid, the weak instrument problem results in large asymptotic and finite-sample biases and, when many such instruments are available, the distribution of IV estimators is known to deviate substantially from the large-sample approximation (see, e.g., Bekker, 1994;Chao and Swanson, 2005). Second, finite-sample properties of IV estimators are known to be generally poor, especially when the instruments are weak (see, e.g., Flores-Lagunes, 2007). Various bias reduction techniques have had limited success and even very inventive uses of instruments have, for these reasons, been criticized in terms of their validity, strength and finite-sample performance.
Furthermore, valid and strong instruments may be unavailable in the same sample. A number of ingenious methods have been proposed to combine data from more than one source in such cases (see, e.g., Krueger, 1992, 1995;Arellano and Meghir, 1992;Inoue and Solon, 2010;Klevmarken, 1982;Murtazashvili et al., 2015). However, the two-sample IV estimators inherit the same problems as their single-sample counterparts (see, e.g., Choi et al., 2018).
The identification strategy for our estimator is different. The central identification conditions are that g 2 ðÁÞ either depends on additional exogenous variables X 3E or is nonlinear. If there are no additional exogenous variables, then identification relies on nonlinearity in the first step. The weak identification issue, similar to the weak IV, can appear in our setting if g 2 ðÁÞ is linear and depends very weakly on X 3E so that g 2 ðÁÞ is close to being a linear function of X 3I only. We consider these cases in the extensive simulation comparisons.
There is a rich literature proposing ingenious methods such as integrated regression functions, basis functions, sieves and local smoothing, to incorporate nonlinearities in an IV estimation based on conditional moment restrictions (see, e.g., Dominguez and Lobato, 2004;Kitamura et al., 2004;Lavergne and Patilea, 2013;Mammen et al., 2016). For example, Mammen et al. (2016) provide a general "three-step" estimation framework that uses a nonparametric estimator in each step. A full account of nonlinearity would involve one of these methods. However, our use of several functions of instruments is quite close to how these nonlinearities are commonly handled in practice.
Let Z denote the instruments. The validity of Z as instruments can be checked. Note that X 2 admits the reduced form X 2 ¼ g 2 ðX 3 Þ þ g 2 ¼ g 2 ðX 3I , X 3E Þ þ g 2 : If X 3I is non-empty (i.e., some common variables are included as regressors) and some elements of Z are relevant, then those elements are either a part of X 3I or correlated with X 3I : As a result, Z and X 2 are also correlated, and the IV estimator of the omitted variable regression becomes inconsistent. Then, we are tempted to test the null of consistency of IV estimation indirectly by testing the null hypothesis H 0 : g 2 ðX 3I , X 3E Þ ¼ g 2 ðX 3E Þ a:e: against the alternative H 1 : g 2 ðX 3I , X 3E Þ 6 ¼ g 2 ðX 3E Þ for a non-empty set on suppðX 3I Þ: Several versions of the test of significance in nonparametric regressions have been proposed in the literature. Examples include Fan and Li (1996) and Racine (1997) for continuous regressors and Lavergne (2001) and Racine et al. (2006) for discrete or categorical regressors, to name a few.

Comparison with PARA
PARA imputes a parametric, linear predictor of X 2 . It seems attractive, as it is not subject to the curse of dimensionality in continuous common variables unlike PILS. However, it requires dimðX 3E Þ > 0 for identification because otherwise, the imputed predictor of X 2 is a linear function of X 3I : In contrast, as in Assumption 3(iii), PILS achieves identification by nonlinearity even if X 3E is empty.
PARA shares similar properties with the two-sample (TS) two-stage least squares (2SLS) estimator of Inoue and Solon (2010). An interpretation of PARA in the context of TS2SLS is that the missing regressor X 2 in the former plays the role of an endogenous variable in the latter. In particular, in the special case when X 1 is empty, PARA is numerically equivalent to TS2SLS. 1 As demonstrated in Proposition S1 of the online supplement, consistency of PARA depends on whether X 1 is present in (1). It necessarily becomes consistent when X 1 is empty. When X 1 is nonempty, which is the most common setting in practice, consistency of PARA fails if EðX 1 X 0 2 Þ 6 ¼ EðX 1 X 0 3 ÞA 0 , where A is the coefficient matrix in the linear projection of X 2 on X 3 . 2 On the other hand, consistency of PILS fails if EðX 1 X 0 2 X 3 Þ 6 ¼ EðX 1 X 3 ÞEðX 0 2 X 3 Þ, j which is a violation of an implication of Assumption 3(ii). Since EðX 1 X 0 2 Þ 6 ¼ EðX 1 X 0 3 ÞA 0 does not imply EðX 1 X 0 2 X 3 Þ 6 ¼ j EðX 1 X 3 ÞEðX 0 2 X 3 Þ, j PILS can be consistent when EðX 1 X 0 2 Þ 6 ¼ EðX 1 X 0 3 ÞA 0 (i.e., PARA is inconsistent) but still EðX 1 X 0 2 X 3 Þ ¼ EðX 1 X 3 ÞEðX 0 2 X 3 Þ j (i.e., Assumption 3(ii) is valid). This could be an explanation for the significant differences observed between the PILS and PARA estimates of the return to schooling in Section 4.

Comparison with MSII(-FM)
While PILS and MSII(-FM) are both two-sample semiparametric estimators, the former has three novel features relative to the latter. First, while both MSII(-FM) and PILS are ffiffiffi n p -consistent and asymptotically normal two-sample estimators, their approaches to restoring consistency differ. Consistency of MSII(-FM) is established by imputing X 2 from S 2 and then eliminating the nonvanishing bias caused by the imputation. The bias correction requires a consistent, nonparametric estimate of R 2 , and that the estimation error inR 2 is O p ðn À1=2 Þ: As a consequence, the asymptotic variance of MSII(-FM) tends to be large and highly complicated because of multiple asymptotically normal terms with the same O p ðn À1=2 Þ rate.
In contrast, consistency of PILS comes from directly imputing a consistent estimate of g 2 ðX 3 Þ ¼ EðX 2 X 3 Þ j in place of the missing X 2 . By construction, it does not require bias correction that is a key ingredient in MSII(-FM). 3 The asymptotic variance of PILS also comes from two O p ðn À1=2 Þ terms, namely, the sampling error as in OLS and the approximation error from replacing the unobservable g 2 with its kernel estimateĝ 2 : Although it is hard to compare asymptotic variances of PILS and MSII(-FM) analytically, Monte Carlo simulations below indicate that the former tends to be smaller than the latter.
Second, the asymptotic analysis we develop for PILS explicitly incorporates discrete matching variables. This is in contrast to the asymptotic analysis of Hirukawa and Prokhorov (2018), who do not accommodate discrete variables explicitly but argue that, similarly to the treatment effect literature (see, e.g., Abadie and Imbens, 2006), the inclusion of discrete matching variables with a finite number of support points does not affect convergence rates of MSII(-FM).
Third, we clarify the role of excluded continuous matching variables for identification. Hirukawa and Prokhorov (2018) maintain the assumption that all common variables enter the 1 We thank an anonymous referee for pointing this out to us. 2 The coefficient matrix A, as opposed to the one given in Proposition S1 of the online Supplement, is based on the assumption that the linear projection has no intercepts. Our argument below is not affected with or without intercepts in the projection. 3 From the viewpoint of imputation, the relationship between MSII(-FM) and PILS may be compared to the one between the NNM-based estimators studied by Imbens (2006, 2011) and kernel-based matching estimators studied by Heckman et al. (1998). regression and are used for both estimation and matching. The price to pay for this assumption is that we need to impose nonlinearity of the conditional mean of the missing regressor given the matching variables in order to achieve identification. This article relaxes the assumption so that some common variables may be employed only for imputation. The existence of such common variables allows for linearity in the conditional mean.

Monte Carlo setup
In this section we present selected results of an extensive Monte Carlo study comparing finitesample properties of PILS with the alternative estimators. The design of the Monte Carlo study is meant to reflect the two scenarios depending on whether or not X 1 is empty. Due to the limitation of space, we focus on the more realistic case in which X 1 is non-empty. Monte Carlo designs and results when X 1 is empty are available in the online supplement.
The Monte Carlo study is designed to mimic important aspects of the return to schooling application in Section 4. These are: (a) allowing for both included and excluded continuous matching variables (educ and age in the application); (b) maintaining n=m ¼ 2 and considering the sample sizes of ðn, mÞ ¼ ð2000, 1000Þ; (c) permitting the included continuous matching variable X 3IC (educ) to become endogenous when X 2 (abil) is omitted; and (d) choosing as an instrument Z a variable that differs from the matching variable (fatheduc) and setting corrðX 3IC , ZÞ and corrðX 3IC , Z 2 Þ (roughly) equal to 0.4. 4 Of primary interest is estimation of b 3 , the coefficient on X 3IC (educ).
Our design is largely inspired by the study on nonlinear instruments of Dieterle and Snell (2016). Consider the two forms of our regression , where all these random variables are mutually independent. Next, X 1 and X 3I C are generated as where q 1 ¼ corrðX 3I C , ZÞ and q 1 2 0:1, 0:4 f g: The cases with q 1 ¼ 0:1 and 0.4 correspond to weak and strong instruments, respectively. The specification of X 3I C ensures compactness of The missing regressor X 2 is generated as one of the following two models: 4 Sample correlations between educ and fatheduc and between educ and fatheduc 2 are 0.434 and 0.420, respectively. Model A: Model B: The constant C for Model B is the variance of E ðX 3I C =ð2q 1 ÞÞ 3 Z j g: È Hence, the term 12 ffiffiffi ffi C p in the denominator of g 2 is intended to make Varðg 2 Þ invariant to q 1 . Also observe that X 2 is linear in X 3 ¼ ðX 3I C , X 3E C , X 3E D Þ for Model A. This guarantees consistency of PARA. On the other hand, X 2 is nonlinear and even additively non-separable in X 3 for Model B. As a result, PARA becomes inconsistent in this case.
The above procedure provides us with two observable samples The complete sample is the sample that would not be observed in practice. Finally, the matched sample S ¼ Y i , X 1i , X 2j 1 ðiÞ , :::, is constructed via the NNM with respect to X 3 ¼ ðX 3I C , X 3E C , X 3E D Þ as in Hirukawa and Prokhorov (2018). The NNM is based on the Mahalanobis distance, and the number of matches is set equal to K ¼ 1 (single match) that is most commonly chosen. Sample sizes of S 1 and S 2 are n 2 1000, 2000 f gand m ¼ n=2: The number of replications is 1000. There are three options to estimate b 3 consistently. The first option is to estimate regression (S) using S 1 only. While EðvÞ ¼ EðX 1 vÞ ¼ 0 holds, EðX 3I C vÞ 6 ¼ 0 (i.e., X 3I C is endogenous in (S)) is the case and OLS for (S) is inconsistent. However, EðvjZÞ ¼ 0 and q 6 ¼ 0: Hence, we can estimate b 3 consistently by using Z and its functions as instruments for X 3I C : The remaining two options rely on both S 1 and S 2 to estimate regression (L). Option two is to construct the matched sample S from S 1 and S 2 and run MSII or MSII-FM using S: Option three is to employ PILS using both S 1 and S 2 : Based on the above estimation strategies, we compare the following estimators of b 3 : (i) the infeasible OLS estimator for regression (L) using S Ã [OLS Ã ]; 5 (ii) the inconsistent OLS estimator for regression (S) using S 1 only [OLS-S]; (iii) the IV estimator for (S) using Z as an instrument for X 3I C [IV1-S]; (iv) the IV estimator for ðSÞ using Z 2 as an instrument for X 3I C [IV2-S]; (v) the two-step GMM estimator for ðSÞ using ðZ, Z 2 Þ as instruments for X 3I C [GMM-S], along with the 5 OLS Ã is consistent, because EðuÞ ¼ EðX 1 uÞ ¼ EðX 2 uÞ ¼ EðX 3IC uÞ ¼ 0: initial 2SLS estimator [2SLS-S]; (vi) the inconsistent MSOLS estimator for (L) using S with K ¼ 1 [MSOLS]; (vii) the MSII-FM estimator for (L) using S with K ¼ 1 and second-order polynomial [MSII-FM], along with the first-step MSII estimator [MSII]; (viii) the PARA estimator for (L) using S 1 and S 2 [PARA]; (ix) the PILS estimator for (L) using S 1 , S 2 and the Epanechnikov kernel K E ðtÞ ¼ ð3=4Þð1 À t 2 Þ1 jtj 1 f g [PILS-E]; 6 and (x) the PILS estimator for (L) using S 1 , S 2 and the beta kernel [PILS-B].
Implementing the two PILS estimators requires a choice of the smoothing parameters. We usê h ¼r X ð log m=mÞ 0:3 for the Epanechnikov kernel,b ¼r U ð log m=mÞ 0:6 for the beta kernel, and k ¼ ð log m=mÞ 0:6 for the discrete kernel, wherer X andr U are sample standard deviations of the continuous common variable in the original scale Xð¼ X 3I C , X 3E C Þ and in the transformed scale U :¼ ðX þ 2Þ=4 2 0, 1 ½ , respectively. These shrinkage rates fulfill the requirements in Section 2.4.4.
For each estimator of b 3 , the following performance measures are computed: (i) Mean (simulation average of the parameter estimate); (ii) SD (simulation standard deviation of the parameter estimate); (iii) RMSE (root mean-squared error of the parameter estimate); (iv) SE (simulation average of the standard error); and (v) CR (coverage rate for the nominal 95% confidence interval). Formulae for standard errors are as follows: (a) heteroskedasticity-robust standard errors by Eicker (1963) and White (1980) are calculated for OLS Ã and IV-S; (b) the formula for MSII-FM follows Theorem 4 of Hirukawa and Prokhorov (2018), (c) the formula for PILS appears in Theorem 2; and (d) the formula for PARA is given in Proposition S3 of the online supplement. Because of inconsistency we do not compute SE or CR of OLS-S and MSOLS. Those of the firststep MSII are also omitted due to its nonparametric convergence rate. Likewise, we present SE and CR of PARA only when it is consistent, i.e., for Model A.
We also consider a variation to the above design which examines how the estimators behave near identification failure. Identification of two-sample estimators heavily relies on the functional form of g 2 ðÁÞ and existence of excluded matching variable X 3E : In particular, identification failure occurs if g 2 ðÁÞ reduces to a linear function of the included matching variable X 3I only. Therefore, we allow for weaker identification by changing how we generate X 2 : Model A 0 : Model B 0 : where the weight w 2 1:0, 0:1 f g controls the strength of identification in two-sample estimators. We fix q 1 ¼ 0:4, ðn, mÞ ¼ ð2000, 1000Þ and the number of replications is 1000. All other aspects of the simulation setup remain unchanged. Therefore, w ¼ 1.0 corresponds to the results for q 1 ¼ 6 In addition to the second-order Epanechnikov kernel, we consider the fourth and sixth-order Epanechnikov kernels by Hansen (2005) 0:4 in the original setup. Because the purpose is to check how the quality of two-sample estimates deteriorates as w shrinks, we only report Mean, SD and RMSE for each estimator.

Results
Table 1 presents the results from alternative specifications of X 2 . It is immediately clear that OLS Ã outperforms all other estimators as would be expected. However, OLS Ã is infeasible, and its performance can be used only as a benchmark. Instead, realistic comparisons can be made between the other estimators. Performance of IV1-S and IV2-S is largely governed by strengths of instruments, while huge biases and variability due to weak instruments are ameliorated by a larger sample size. While performances of 2SLS-S and GMM-S improve due to additional moment restrictions, their efficiencies do not surpass those of two PILS estimators.
The instrument strength also influences the performance of the two-sample estimators. MSII-FM is a prominent example. The estimator is severely biased and unstable for each of two weak instrument cases.
The performance of PARA depends heavily on specification of X 2 . PARA performs exceptionally well in the more favorable case (i.e., Model A), as expected. On the other hand, the estimator generates substantial bias and variability for Model B. The poor performance may be attributed to identification failure due to difficulty in the linear projection of the product of periodic and binary random variables.
The two PILS estimators appear to be more robust against changes in the strength of instruments and/or specification of X 2 . An inspection reveals that PILS-B outperforms PILS-E in terms of RMSE, except in the case of Model B with q 1 ¼ 0:1: Table 2 presents the results for the variation addressing near identification failure. Qualities of OLS Ã , OLS-S and all IV-based estimates remain unchanged throughout this table, as expected. In contrast, performances of two-sample estimators are influenced by both the functional form of g 2 ðÁÞ and the value of w.
For Model A 0 , g 2 ðÁÞ reduces to a linear function of the included matching variable X 3I C as w ! 0, and identification failure is expected. Indeed, as w decreases, the finite-sample distribution of MSII-FM appears to be off-centered from the true value of 1 and highly dispersed. A potential explanation is that as w decreases, excluded matching variables ðX 3E C , X 3E D Þ lose their importance in estimating correction terms for non-negligible and matching-discrepancy biases. Nonetheless, using the full set of X 3 ¼ ðX 3I C , X 3E C , X 3E D Þ makes their estimates quite imprecise due to additional noises, whereas in reality it may suffice to estimate the terms by using X 3I C only. Likewise, the finite-sample distribution of PARA for Model A 0 also tends to be dispersed with w. This may be attributed to near multi-collinearity in the regression. The proxy of X 2 is almost a linear function of X 3I C for a very small w, and imputing it as an extra regressor makes PARA unstable. Large biases in PILS-E and B are thought to be due to the same identification failure. On the other hand, g 2 ðÁÞ for Model B 0 collapses to a cubic function of X 3I C for w ! 0: In this case, the role of ðX 3E C , X 3E D Þ is less important for identification, and the performance of all twosample estimators is only marginally affected by w. In particular, PILS-B seems to be more robust against changes in w than any other two-sample estimator.
4. An application: Estimation of return to schooling

Earnings function
Estimation of the causal link between education and earnings has been a focus of labor economists over the last several decades. Card (2001) suggested that endogeneity of education in the earnings equation might at least in part be responsible for the continuing interest in uncovering the causal effect of education on labor market outcomes. As Griliches (1977) discusses in detail, empirical labor economists have long speculated that the primary reason why education is endogenous when estimating the returns to schooling is some omitted explanatory variable(s)unobservable factors that influence education such as ability and motivation-that are likely to have a direct effect on individual earnings and wages. The aim of this section is to estimate the rate of return to additional schooling using alternative estimators available to us. These include the traditional one-sample OLS and linear and nonlinear IV estimators as well as MSII, PARA, and PILS. A more comprehensive comparison can be found in the online supplement.
Following the classical framework of the human capital earnings/wage function of Mincer (1974) we assume additivity and wish to use US data in order to estimate the causal effect of education on earnings using the following model: where log ðearningsÞ is the natural logarithm of the individual's total annual labor income, education is the individual's completed years of education, ability is the individual's ability/skills (with zero mean), experience is work experience of the individual, married is an indicator for whether the individual is married, black is an indicator for whether the individual is black, south is an indicator for whether the individual currently lives in the southern geographical region, urban is an indicator of the individual's urban residence while growing up, 7 and u is an idiosyncratic error. The main issue with estimating Eq.
(2) is that econometricians are typically unable to observe the individual's skills. Because of that, the (one-sample) OLS estimator of the return to schooling-b 1 -from Eq. (2) where ability is excluded is likely to suffer from an "ability bias." Two textbook solutions to this problem are (i) to find within the same data set a valid proxy for the unobservable skills and use OLS estimation, and (ii) to find within the same data set a valid instrumental variable for the individual's educational level and use the IV approach. Since in addition to these two approaches, we advocate using two-sample estimation, we would like to be able to apply in practice the one-and two-sample approaches to the same applied task.

7
The publicly available part of the PSID survey that we use does not provide information on current urban residence.

Constructing the samples
To estimate earnings and wage equations for the US, labor economists frequently use such micro-data sets as CPS, PSID and the National Longitudinal Survey (NLS). While CPS has a larger sample and is more representative of the entire demographic composition of the US population than the other two surveys, it does not typically collect important wage determinants such as actual work experience and ability. The PSID data set does provide information on actual work experience but, similar to CPS, generally does not collect data on ability. NLS routinely reports ability measures and contains data on actual and potential work experience but it is less representative of the US labor force. Given the features of these surveys, PSID and NLS seem most suitable as our main and auxiliary data sets, respectively.
We employ the 1972 wave of PSID as our main data set. We choose this wave so that we can estimate Eq. (2) using not only IV and PILS but also using a proxy for unobserved skills and ability. While PSID generally does not contain any measures of unobserved ability, the 1972 wave is among the few PSID waves that do include an ability measure.
In the 1972 wave, the PSID respondents were administered a particular assessment of abstract thinking-the Lorge-Thorndike Intelligence Test (LTIT). In essence, this is a test of verbal skills-a sentence completion test-that Veroff et al. (1971, p.26) describe "as a feasible, reasonably valid assessment of what psychologists have labeled intelligence." Furthermore, Veroff et al. (1971, p.26) advise that this test "seems to correlate well with most different kinds of tests of intelligence, well enough to suggest using it singly without going to multiple measurement." LTIT administered to the PSID respondents contained 13 questions, where each question received a point for the correct response. Thus, the score for the entire sentence completion test can range from zero (the worst outcome) to 13 (the best outcome). We call the variable containing the total test score IQ score, and we demean it.
In addition to the information on the individual's ability discussed above, we also gather information on the individual's actual work experience, completed years of education, age, whether the person is black, married, and lives in the southern geographical region. 8 Furthermore, we include a dummy variable for whether the person grew up in an urban area and for whether the person grew up in the southern region. Finally, in our PSID sample, we also get information on the completed years of education by the individual's father. The last variable-father's educational level-is used as an instrument for the individual's educational attainment when no measure of ability is available. Our main sample contains 2430 men who reported positive labor income in 1971. Panel A of Table 3 reports summary statistics for the variables in our PSID sample. The sample correlation between an individual's education and the father's education is 0.434.
To exploit the two-sample approaches, we need a second sample where the ability measure is available and/or potentially more reliable. We employ the CARD data set provided with Wooldridge (2013). This data set contains observations from the 1977 wave of the Young Men Cohort of NLS. We exploit observations with positive wages in 1976.
As an ability measure we employ the results of the "Knowledge of the World of Work" (KWW score) test that the NLS respondents were administered during their interviews in 1966. The KWW score from NLS is arguably a better measure of unobserved ability than the IQ score from the 1972 wave of PSID. A higher KWW score indicates higher intelligence. We demean this measure before using it to estimate Eq. (2).
Panel B of Table 3 provides summary statistics for the NLS sample of m ¼ 1102 respondents when education, married, black, south, and urban are used as the included common variables, X 3I , and age (the individual's age, in years) and south while growing up (an indicator for whether the person resided in the southern region while growing up) are used as the excluded common variables, X 3E : Note that experience and experience 2 , which are not observed in the NLS sample, represent X 1 in regression (2). The summary statistics for experience are provided in Panel A of Table 3.

Estimation approaches
We report estimation results in Table 4. 9 Columns (1)-(3) report the parametric estimation results based on just the PSID sample. Columns (1) reports the OLS estimates using the ability proxy from the 1972 waive. Column (2) reports the OLS results that do not account for unobserved ability. In most real-life settings, the results in column (1) are infeasible and those in column (2) are subject to the "ability bias." Column (3) provides the IV estimates where the father's education is used as an IV for an individual's educational level.
Columns (4)-(7) report two-sample estimates based on both the PSID and NLS samples. Columns (4), (5), and (7) contain the results based on semiparametric two-sample approaches, while column (6) reports the only estimates based on a fully parametric two-sample procedure. Specifically, column (4) contains the OLS estimates when the two samples are combined using hot-deck imputation and no bias correction is applied. As discussed by Hirukawa and Prokhorov (2018), this estimator is inconsistent. Column (5) provides the MSII-FM results that account for the imputation biases. In essence, the MSII approach is a bias-corrected version of the MSOLS approach. Column (6) exhibits the PARA results as yet another two-sample estimator. It is worth emphasizing that PARA is not guaranteed to be consistent in this case. Finally, column (7) reports the PILS results using the beta kernel, chosen on the basis of the simulation results in previous section. 8 We follow the 1979 NLS of Youth definition of the southern region. The southern region includes Alabama, Arkansas, Delaware, District of Columbia,Florida,Georgia,Kentucky,Louisiana,Maryland,Mississippi,North Carolina,Oklahoma,South Carolina,Tennessee,Texas,Virginia,and West Virginia. 9 GAUSS codes implementing the IV and two-sample estimators as well as the data sets are available at https://sites.google. com/site/artembprokhorovv0/papers/SimulationAndApplicationFiles.zip.
The two-sample estimators use five variables common to both samples-education, married, black, south, and urban-as included variables and two variables-age and south while growing up-as excluded variables. Note that since in our case both the main (PSID) and auxiliary (NLS) samples contain information on the individual's educational level, we treat education as a part of X 3 . We choose not to employ experience as a common variable because experience in PSID represents actual experience, while experience in NLS represents potential experience, which is quite different. As a consequence, our entire set of common variables contains education, married, black, south, urban, age, south while growing up, where education and age are treated as continuous.
In the context of Assumption 3(ii), the PILS results assume that experience and ability are related only through education, married, black, south, urban, age, and south while growing up. This seems plausible given the weak empirical link between experience and ability and the long list of observables we condition on. In the proxy literature, this is known as redundancy or ignorability of the proxy variable in the structural equation, akin to ignorability of selection in selectivity models (see, e.g., Wooldridge, 2010, p.67).
The NNM used by MSOLS and MSII-FM adopts a single match (K ¼ 1) based on the Mahalanobis metric. We average the (demeaned) KWW score for ties in our second (NLS) sample and assign this average as a unique value of the ability measure to the respondent with the given values of our common variables. As a consequence, m ¼ 1102 respondents remain in our S 2 : A second-order polynomial is used in MSII-FM.
Our choice of a kernel for PILS reflects the favorable performance of the beta kernel in simulations. For PILS, each continuous common variable X is converted from its original scale to a variable U :¼ ðX À m X Þ=ðM X À m X Þ 2 0, 1 ½ , where M X and m X are maximal and minimal values in the pooled sample constructed from observations of X in S 1 and S 2 : The reason for using the maximum and minimum of the pooled sample is that the ranges of education and age are quite different between S 1 and S 2 : We useb ¼r U ð log m=mÞ 0:6 andk ¼ ð log m=mÞ 0:6 for the beta and discrete kernels, respectively, wherer U is the sample standard deviation of a converted continuous common variable U in S 2 : 10 Notes: The dependent variable is the log of total annual labor earnings. Education, married, black, south, and urban are used as the included common variables, and age and south while growing up are used as the excluded common variables. The (demeaned) IQ score and KWW score variables are used as ability measures in the PSID and NLS samples, respectively.

10
One potential concern about PILS is how sensitive it is to the choice of smoothing parameter values in nonparametric estimation of g 2 ðÁÞ: It is confirmed that even after their original values are doubled or cut by half, the estimation results are qualitatively similar. Detailed results are available in the online supplement.

Empirical findings
It is immediately clear from Table 4 that the PARA estimation results substantially differ from the rest. First, contrary to the corresponding OLS Ã estimate, the estimate of the rate of return to education is negative. Second, the PARA estimate of the coefficient on ability is expectedly positive but extraordinarily large when compared to the OLS Ã estimate. Third, PARA estimates are quite unstable in that the standard error of each coefficient estimate is much larger than those of the corresponding estimates from other methods in general. Because it is hard to provide a meaningful interpretation to the PARA results, we exclude them from further discussion. One lesson from the PARA estimates is that parametric imputation methods for missing regressors should be used with caution. In this respect, MSII(-FM) and PILS adopt nonparametric imputation methods, and thus they are more robust than PARA. Table 4 shows that the signs of coefficient estimates on all the regressors except ability are as expected and most of the estimates are significant. The MSOLS estimator yields negative and insignificant estimates of the ability effect, whereas the MSII-FM estimator results in a positive and insignificant estimate. Interestingly, OLS Ã and PILS are the only two estimators that produce positive (as one would expect) and statistically significant estimates of the ability effect. Finally, we note that the PILS standard errors tend to be smaller than those of MSII-FM, as predicted in a previous section.
Next we focus on the estimates of the rate of return to education. The estimates reported in Table 4 provide evidence that the father's education is unlikely to be a valid instrument for an individual's education. This is because if the father's education were a valid IV, the OLS-S and MSOLS estimators of b 1 would be upward-inconsistent. However, we observe that the IV estimate is the largest in magnitude. Furthermore, if this instrument is valid and there is no measurement error in the PSID ability measure, then OLS Ã and IV are both consistent. However, we see that the OLS Ã and IV estimates are noticeably different, suggesting again that the two estimators are unlikely to both be consistent.
The infeasible OLS -OLS Ã -is the second smallest (conceding in magnitude only to PILS). In addition, if there is no measurement error in the PSID ability measure then OLS Ã is consistent, but OLS-S and MSOLS are still upward-inconsistent. This possibility seems to fit well with our estimates in Table 4. Indeed, the feasible one-sample estimate of b 1 with omitted ability-OLS-S-is between IV and infeasible OLS. In fact, the MSOLS and MSII-FM estimates of b 1 are also between the one-sample IV and infeasible OLS Ã estimates. Therefore, there are grounds to view the infeasible OLS Ã as a benchmark result in our analysis.
Importantly, the only estimate of b 1 that is smaller than the infeasible OLS is PILS. We note that PILS is also the second closest (in absolute value) to the infeasible OLS estimate, while the first closest is MSII-FM. This would be expected given that only MSII-FM and PILS are the consistent approaches if the instrument is invalid. Furthermore, the possibility that PILS (and not OLS Ã ) is closer to the true return to education is not unrealistic. If the PSID ability measure is error-ridden, OLS Ã is upward-inconsistent suggesting that PILS is likely to be the only estimator that delivers the closest estimate to the true population value. 11

Concluding remarks
When some regressors are unavailable for a regression analysis, econometricians often resort to IV estimation. They typically make a considerable effort to find and justify valid instruments for the regressors that are suspected to be endogenous due to their possible correlation with the omitted regressors. In this article, we have explored two-sample alternatives to this conventional approach.
We have developed the PILS estimation procedure for such models. The procedure first uses an auxiliary sample to obtain a nonparametric estimator of the conditional mean of the regressor which is missing in the main sample. Variables that are common in both samples are used in the conditioning set. In the second step of the procedure, we simply run OLS using a plug-in estimate of the conditional mean.
We establish asymptotic normality of PILS. Attractive finite-sample properties of PILS are confirmed in a Monte Carlo study. In particular, simulations demonstrate numerically that PILS is generally more efficient than IV and it dominates other available estimators, e.g., MSII(-FM), in terms of mean squared error. The results also indicate superior finite-sample properties of PILS using the beta kernel.
However, in order for PILS to attain the parametric rate of convergence, the number of continuous common variables must be three or less (provided that nonnegative kernels are employed). In this respect, the curse of dimensionality plays out similarly to the NNM-based estimation considered by Hirukawa and Prokhorov (2018).
A natural extension is to adopt a dimension reduction method when using multiple common variables. This can be done using a sparsity-based algorithm, e.g., lasso of the first step, or using a propensity score algorithm. We leave these ideas for future research.
where t 3C :¼ ðt 3C, 1 , :::, t 3C, d3C Þ, x 3C :¼ ðx 3C, 1 , :::, x 3C, d3C Þ and h :¼ ðh 1 , :::, h d3C Þ are vectors of data points, design points and bandwidths, respectively. Next, we construct a kernel for the discrete component. While a variety of discrete kernels can be applied for this component, our focus is on those given by Aitchison and Aitken (1976). In what follows a product of their discrete kernels is exclusively considered. Each of d 3D discrete variables X 3D, q , q ¼ 1, :::, d 3D is assumed to take r q ð! 2Þ different values, i.e., X 3D, q 2 0, 1, :::, r q À 1 f g : In addition, each discrete variable is classified into either unordered or ordered, because the kernels employed for the two types of categorical variables differ slightly. The univariate discrete kernel for an unordered variable is l t 3D, q ; x 3D, q , k q À Á : where t 3D, q , x 3D, q and k q 2 ð0, 1Þ are the data point, design point and bandwidth, respectively. The univariate discrete kernel for an ordered variable takes the form of ' t 3D, q ; x 3D, q , k q À Á :¼ r q t 3D, q À x 3D, q ! 1 À k q À Á rqÀ t3D, q Àx3D, q j j k t3D, q Àx3D, q j j q : If there are q 1 ð d 3D Þ unordered discrete variables, then the product kernel for all d 3D discrete variables is given by where t 3D :¼ ðt 3D, 1 , :::, t 3D, d3D Þ, x 3D :¼ ðx 3D, 1 , :::, x 3D, d3D Þ and k :¼ ðk 1 , :::, k d3D Þ: For the continuous component, we also apply an alternative nonstandard kernel. Taking compactness of X 3C into account (see Assumption 2), we employ the beta kernel by Chen (1999) in place of the univariate symmetric kernel. For a design point x 2 0, 1 ½ and the smoothing parameter b, the beta kernel is defined as It turns out that convergence properties of nonparametric estimators based on this kernel family have not been established. In a companion paper (Hirukawa et al., 2022), we provide a careful proof of weak and strong uniform convergence with rates of nonparametric estimators smoothed by the beta kernel.

A.2. Proof of Theorem 1
The proof requires a lemma on uniform consistency of the kernel regression estimatorĝ 2 ðÁÞ: Let g 2k ðÁÞ be the kth element of g 2 ðÁÞ for k ¼ 1, :::, d 2 : In order to differentiate the continuous kernels used in the nonparametric estimation of g 2k ðÁÞ, we denote the estimates obtained using th-order symmetric and beta kernels byĝ S 2k ðÁÞ andĝ B 2k ðÁÞ, respectively. Then, the lemma below establishes weak uniform consistency with rates ofĝ S 2k ðÁÞ andĝ B 2k ðÁÞ: Lemma A1. If Assumptions 1-6 hold, then, for k ¼ 1, :::, d 2 , as m ! 1, A.2.1. Proof of Lemma A1 First two statements onĝ S 2k ðÁÞ are implied by Theorem 2.1 of Li and Ouyang (2005). Remaining two statements on g B 2k ðÁÞ are established by Theorem 7 of Hirukawa et al. (2022) with r n ¼ Oð1Þ due to uniform boundedness from below of f ðÁÞ: