Testing the Unconfoundedness Assumption via Inverse Probability Weighted Estimators of (L)ATT

We propose inverse probability weighted estimators for the local average treatment effect (LATE) and the local average treatment effect for the treated (LATT) under instrumental variable assumptions with covariates. We show that these estimators are asymptotically normal and efficient. When the (binary) instrument satisfies one-sided noncompliance, we propose a Durbin–Wu–Hausman-type test of whether treatment assignment is unconfounded conditional on some observables. The test is based on the fact that under one-sided noncompliance LATT coincides with the average treatment effect for the treated (ATT). We conduct Monte Carlo simulations to demonstrate, among other things, that part of the theoretical efficiency gain afforded by unconfoundedness in estimating ATT survives pretesting. We illustrate the implementation of the test on data from training programs administered under the Job Training Partnership Act in the United States. This article has online supplementary material.


INTRODUCTION
Nonparametric estimation of average treatment effects (ATEs) from observational data is typically undertaken under one of two types of identifying conditions. The unconfoundedness assumption, in its weaker form, postulates that treatment assignment is mean-independent of potential outcomes conditional on a vector of observed covariates. This requirement carries with it considerable identifying power; specifically, it identifies the ATE and the average treatment effect for the treated (ATT) without any additional modeling assumptions. On the other hand, if unobserved confounders exist, then instrumental variables-related to the outcome only through changing the likelihood of treatment-are typically used to learn about treatment effects. Without further assumptions, the availability of an instrumental variable (IV) is, however, not sufficient to identify ATE or ATT. In general, the IV will identify only the local average treatment effect (LATE; Imbens and Angrist 1994) and the local average treatment effect for the treated (LATT; Frölich and Lechner 2010;Hong and Nekipelov 2010). If one specializes to binary instruments, as we do in this article, then the LATE and LATT parameters correspond to the ATE over specific subgroups of the population. These subgroups are, however, dependent on the choice of the instrument and are generally unobserved. Partly for these reasons a number of authors have called into question the usefulness of LATE for program evaluation (Heckman 1997;Deaton 2009;Heckman and Urzúa 2010). In most such settings, ATE and ATT are more natural and practically relevant parameters of interest-provided that they can be credibly identified and accurately estimated. (In fairness, some of the criticism in Deaton (2009) goes beyond LATE, and also applies to ATE/ATT as a parameter of interest. See Imbens (2010) for a response to Deaton (2009).) When using instrumental variables, empirical researchers are often called upon to tell a "story" to justify their validity. As pointed out by Abadie (2003) and Frölich (2007), it is often easier to argue that the relevant IV conditions hold if conditioning on a vector of observed covariates is also allowed. In particular, Frölich (2007) showed that in this scenario LATE is still nonparametrically identified and proposes efficient estimators, based on nonparametric imputation and matching, for this quantity. Given the possible need to condition on a vector of observables to justify the IV assumptions, it is natural to ask whether treatment assignment itself might be unconfounded conditional on the same (or maybe a larger or smaller) vector of covariates. In this article, we propose a formal test of this hypothesis that relies on the availability of a specific kind of binary instrument for which LATT=ATT (so that the latter parameter is also identified). Establishing unconfoundedness under these conditions still offers at least two benefits: (i) it enables the estimation of an additional parameter of interest (namely, ATE) and (ii) it potentially allows for more efficient estimation of ATT than IV methods (we will argue this point in more detail later).
To our knowledge, this is the first test in the literature aimed at this task.
More specifically, the contributions of this work are twofold. First, given a (conditionally) valid binary instrument, we propose alternative nonparametric IV estimators of LATE and LATT. These estimators rely on weighting by the inverse of the estimated instrument propensity score and are computed as the ratio of two estimators that are of the form proposed by Hirano, Imbens, and Ridder (2003), henceforth HIR. While Frölich (2007) conjectured in passing that such an estimator of LATE should be efficient, he did not provide a proof. We fill this (admittedly small) gap in the literature and formally establish the first-order asymptotic equivalence of our LATE estimator and Frölich's imputation/matching-based estimators. We also demonstrate that our LATT estimator is asymptotically efficient, that is, first-order equivalent to that of Hong and Nekipelov (2010).
More importantly, we propose a Durbin-Wu-Hausman-type test for the unconfoundedness assumption. On the one hand, if a binary instrument satisfying "one-sided noncompliance" (e.g., Frölich and Melly 2013a) is available, then the LATT parameter associated with that instrument coincides with ATT, and is consistently estimable using the estimator we proposed. (Whether one-sided noncompliance holds is verifiable from the data.) On the other hand, if treatment assignment is unconfounded given a vector of covariates, ATT can also be consistently estimated using the HIR estimator. If the unconfondedness assumption does not hold, then the HIR estimator will generally converge to a different limit. Thus, the unconfoundedness assumption can be tested by comparing our estimator of LATT with HIR's estimator of ATT. Of course, if the validity of the instrument itself is questionable, then the test should be more carefully interpreted as a joint test of the IV conditions and the unconfoundedness assumption. We use a battery of Monte Carlo simulations to explore in detail the finite sample properties of our IV estimator and the proposed test statistic. We also provide an application to illustrate how to implement and interpret the test in practice. We use the dataset from Abadie, Angrist, and Imbens (2002) on training programs administered under the Job Training Partnership Act (JTPA) in the United States and highlight a difference between the self-selection process of men versus women into these programs.
The rest of the article is organized as follows. In Section 2, we present the theoretical framework; in Section 3, we describe the estimators for LATE and LATT. The implications of unconfoundedness and the proposed test for it are discussed in Section 4. A rich set of Monte Carlo results is presented in Section 5. In Section 6, we give the empirical application along with an additional, empirically motivated, Monte Carlo exercise. Section 7 summarizes and concludes. The most important proofs and the simulation tables from Section 5 are collected in the Appendix. More detailed proofs and some additional simulations are available as an online supplement.

THE BASIC FRAMEWORK AND IDENTIFICATION RESULTS
The following IV framework, augmented by covariates, is now standard in the treatment effect literature; see, for ex-ample, Abadie (2003) or Frölich (2007) for a more detailed exposition. For each population unit (individual), one can observe the value of a binary instrument Z ∈ {0, 1} and a vector of covariates X ∈ R k . For Z = z, the random variable D(z) ∈ {0, 1} specifies individuals' potential treatment status with D(z) = 1 corresponding to treatment and D(z) = 0 to no treatment. The actually observed treatment status is then given by D ≡ D(Z) = D(1)Z + D(0)(1 − Z). Similarly, the random variable Y (z, d) denotes the potential outcomes in the population that would obtain if one were to set Z = z and D = d exogenously. The following assumptions, taken from Abadie (2003) and Frölich (2007) with some modifications, describe the relationships between the variables defined above and justify Z being referred to as an instrument: Assumption 1(i) ensures the existence of the moments we will work with. Part (ii) states that, conditional on X 1 , the instrument is exogenous with respect to the first and second moments of the potential outcome and treatment status variables. This is satisfied, for example, if the value of the instrument is completely randomly assigned or the instrument assignment is independent of V conditional on X 1 . Nevertheless, part (ii) is weaker than the full conditional independence assumed in Abadie (2003) and Frölich (2007), but is still sufficient for identifying LATE and LATT. Part (iii) precludes the instrument from having a direct effect on potential outcomes. Part (iv) postulates that the instrument is (positively) related to the probability of being treated and implies that the distributions X 1 |Z = 0 and X 1 |Z = 1 have common support. Finally, the monotonicity of D(z) in z, required in part (v), allows for three different types of population units with nonzero mass: compliers [D(0) = 0, D(1) = 1], always takers [D(0) = 1, D(1) = 1] and never takers [D(0) = 0, D(1) = 0] (see Imbens and Angrist 1994). Of these, compliers are actually required to have positive mass-part (iv) rules out P [D(1) = D(0)] = 1. In light of these assumptions, it is customary to think of Z as a variable that indicates whether an exogenous incentive to obtain treatment is present or as a variable signaling "intention to treat." Given the exclusion restriction in part (iii), one can simplify the definition of the potential outcome variables as The LATE (≡ τ ) and LATT (≡ τ t ) parameters associated with the instrument Z are defined as LATE, originally due to Imbens and Angrist (1994), is the ATE in the complier subpopulation. The LATT parameter was considered, for example, by Frölich and Lechner (2010) and Hong and Nekipelov (2010). LATT is the ATE among those compliers who actually receive the treatment. Of course, in the subpopulation of compliers the condition D = 1 is equivalent to Z = 1, that is, LATT can also be written as In particular, if Z is an instrument that satisfies Assumption 1 unconditionally (say Z is assigned completely at random), then LATT coincides with LATE. Our interest in LATT is motivated mainly by the fact that it can serve as a bridge between the IV assumptions and unconfoundedness (this connection will be developed shortly).
Under Assumption 1 one can also interpret LATE/LATT as the ATE/ATT of Z on Y divided by the ATE/ATT of Z on D. (1) (2) The quantities on the right-hand side of (1) and (2) are nonparametrically identified from the joint distribution of the observables (Y, D, Z, X 1 ). We denote the conditional probability P (Z = 1 | X 1 ) by q(X 1 ) and refer to it as the "instrument propensity score" to distinguish it from the conventional use of the term propensity score (the conditional probability of being treated). Under Assumption 1, the following identification results are implied, for example, by Theorem 3.1 in Abadie (2003): That is, τ = / and τ t = t / t . The unconfoundedness assumption, introduced by Rosenbaum and Rubin (1983), is also known in the literature as selection-on-observables, conditional independence, or ignor-ability. We say that treatment assignment is unconfounded conditional on a subset X 2 of the vector X if Assumption 2 (Unconfoundedness). Y (1) and Y (0) are meanindependent of D conditional on X 2 , that is, Assumption 2 is stronger than Assumption 1 in that it rules out unobserved factors related to the potential outcomes playing a systematic role in selection to treatment. Furthermore, it permits For example, ATE and ATT can be identified by expressions analogous to (3) and (4), respectively: replace Z with D and q(X 1 ) with p( is actually sufficient for identifying ATT.) As mentioned above, ATE and ATT are often of more interest to decision makers than local treatment effects, but are not generally identified under Assumption 1 alone. A partial exception is when the instrument Z satisfies a strengthening of the monotonicity property called one-sided noncompliance (see, e.g., Frölich and Melly 2013a): Assumption 3 dates back to (at least) Bloom (1984), who estimated the effect of a program in the presence of "no shows." The stated condition means that those individuals for whom Z = 0 are excluded from the treatment group, while those for whom Z = 1 generally have the option to accept or decline treatment. Hence, there are no always-takers; noncompliance with the intention-to-treat variable Z is only possible when Z = 1. More formally, for such an instrument D = ZD(1), and so D = 1 implies D(1) = 1 (the treated are a subset of the compliers). Therefore, Thus, under one-sided noncompliance, ATT=LATT. The ATE parameter, on the other hand, remains generally unidentified under Assumptions 1 and 3 alone.
In Section 4, we will show how one can test Assumption 2 when a binary instrument, valid conditional on X 1 and satisfying one-sided noncompliance, is available. Frölich and Lechner (2010) also considered some consequences for identification of the IV assumption and unconfoundedness holding simultaneously (without one-sided noncompliance), but they did not discuss estimation by inverse probability weighting, propose a test, or draw out implications for efficiency.

Inverse Propensity Weighted Estimators of LATE and LATT
denote a random sample of observations on (Y, D, Z, X 1 ). The proposed inverse probability weighted (IPW) estimators for τ and τ t are based on sample analog expressions for (3)-(6): whereq(·) is a suitable nonparametric estimator of the instrument propensity score function. If there are no covariates in the model, then bothτ andτ t reduce tô which is the LATE estimator developed in Imbens and Angrist (1994) and is also known as the Wald estimator, after Wald (1940). Following HIR, we use the series logit estimator (SLE) to estimate q(·). The first-order asymptotic results presented in Section 3.2 do not depend critically on this choice-the same conclusions could be obtained under similar conditions if other suitable estimators of q(·) were used instead. For example, Ichimura and Linton (2005) used local polynomial regression and Abrevaya, Hsu, and Lieli (2013) used higher order kernel regression. We opt for the SLE for three reasons: (i) it is automatically bounded between zero and one; (ii) it allows for a more unified treatment of continuous and discrete covariates in practice; (iii) the curse of dimensionality affects the implementability of the SLE less severely. (We do not say that the SLE is immune to dimensionality; however, if one uses, say, higher order local polynomials and X 1 is large, then one either needs a very large bandwidth or an astronomical number of observations just to be able to compute eachq(X 1i ), at least if the kernel weights have bounded support. A sufficiently restricted version of the SLE is always easy to compute.) We implement the SLE using power series. Let λ = (λ 1 , . . . , λ r ) ∈ N r 0 be an r-dimensional vector of nonnegative integers where r is the dimension of X 1 . Define a norm for λ as |λ| = r j =1 λ j , and for x 1 ∈ R r , let x λ ) as a K-vector of power functions. Let the (a) = exp(a)/(1 + exp(a)) be the logistic cdf. The SLE for q(x 1 ) is defined asq(x 1 ) = (R K (x 1 ) π K ) wherê π K = arg max The asymptotic properties ofq(x 1 ) are discussed in Appendix A of HIR.

First-Order Asymptotic Results
We now state conditions under whichτ andτ t are √ nconsistent, asymptotically normal and efficient.
Assumption 4 (Distribution of X 1 ). (i) The distribution of the r-dimensional vector X 1 is absolutely continuous with probability density f (x 1 ); (ii) the support of X 1 , denoted X 1 , is a Cartesian product of compact intervals; (iii) f (x 1 ) is twice continuously differentiable, bounded above, and bounded away from 0 on X 1 .
Though standard in the literature, this assumption is restrictive in that it rules out discrete covariates. This is mostly for expositional convenience; after stating our formal result, we will discuss how to incorporate discrete variables into the analysis.
Next, we impose restrictions on various conditional moments of Y, D, and Z. We define m z ( Assumption 5 (Conditional Moments of Y and D). m z (x 1 ) and μ z (x 1 ), are continuously differentiable over X 1 for z = 0, 1.
The last assumption specifies the estimator used for the instrument propensity score function.
If q(x 1 ) is instead estimated by local polynomial regression or higher order kernel regression, then it is sufficient to assume less smoothness; specifically, q(x 1 ) must be continuously differentiable only of orderq > r for the theoretical results to hold.
The first-order asymptotic properties ofτ andτ t are stated in the following theorem.
Theorem 1 (Asymptotic properties ofτ andτ t ): Suppose that Assumption 1 and Assumptions 4 through 7 are satisfied. Then, t (Y, D, Z, X 1 )] with the functions ψ and ψ t given by q(x 1 ) ; (b) V is the semiparametric efficiency bound for LATE with or without the knowledge of q(x 1 ); (c) V t is the semiparametric efficiency bound for LATT without the knowledge of q(x 1 ).
Comments 1. The result onτ is analogous to Theorem 1 of HIR; the result onτ t is analogous to Theorem 5 of HIR. Theorem 1 shows directly that the IPW estimators of LATE and LATT presented in this article are first-order asymptotically equivalent to the matching/imputation based estimators developed by Frölich (2007) and Hong and Nekipelov (2010).
2. Theorem 1 follows from the fact that, under the conditions stated,τ andτ t can be expressed as asymptotically linear with influence functions ψ and ψ t , respectively: These representations are developed in Appendix A, along with the rest of the proof. 3. To use Theorem 1 for statistical inference, one needs consistent estimators for V and V t . Such estimators can be obtained by constructing uniformly consistent estimates for ψ and ψ t and then averaging the squared estimates over the sample observa- To be more specific, letm 1 (x 1 ) and m 0 (x 1 ) be the series estimators for m 1 (x 1 ) and m 0 (x 1 ): where R K (x 1 ) is the same power series as in the SLE. As in HIR and Donald and Hsu (2013), it is true that sup x 1 ∈X 1 |m 1 (x 1 ) − m 1 (x 1 )| = o p (1) and sup x 1 ∈X 1 |m 0 (x 1 ) − m 0 (x 1 )| = o p (1). In addition, letμ 1 (x 1 ) andμ 0 (x 1 ) be defined by replacing Y i with D i in (10), and let E(Z) = n i=1q (X 1i )/n. Construct the functionsψ(y, d, z, x 1 ) andψ t (y, d, z, x 1 ) by replacing m 0 (x 1 ), m 1 (x 1 ), μ 0 (x 1 ), μ 1 (x 1 ), q(x 1 ), τ , τ t , , t , and E(Z) with their estimators. (The estimators for and t are the denominators ofτ andτ t , respectively.) Finally, let It is straightforward to show that V p → V and V t p → V t . 4. In part (b), the semiparametric efficiency bound for τ is given by Frölich (2007) and Hong and Nekipelov (2010). In particular, the bounds are the same with or without the knowledge of the instrument propensity score function q(x 1 ). If the instrument propensity score function is known, τ can also be consistently estimated by using the true instrument propensity score throughout in the formula forτ . Denote this estimator byτ * . Using Remark 2 of HIR, the asymptotic variance of It can be shown that V ≤ V * . Therefore, the IPW estimator for τ based on the true instrument propensity score function is less efficient than that based on estimated instrument propensity score function. 5. In part (c), the semiparametric efficiency bound for τ t without knowledge of the instrument propensity score function is derived in Hong and Nekipelov (2010). The efficiency bound for τ t with knowledge of the instrument propensity score function has not been given in the literature yet. However, by Corollary 1 and Theorem 5 of HIR, the semiparametrically efficient IPW estimator for τ t is given bŷ It is true that V * t,se ≤ V t , that is, knowledge of the instrument propensity score allows for more efficient estimation of τ t . On the other hand, if the instrument propensity score function is known, then τ t can also be consistently estimated by using q(X 1i ) throughout in the formula forτ * t,se . It follows from the discussion after Corollary 1 of HIR thatτ t and the latter estimator cannot generally be ranked in terms of efficiency. 6. Suppose that X 1 contains discrete covariates whose possible values partition the population intos subpopulations or cells. Let the random variable S ∈ {1, . . . ,s} denote the cell a given unit is drawn from. For each s one can define q(x 1 , s) = P (Z = 1 |X 1 =x 1 , S = s), whereX 1 denotes the continuous components of X 1 , and estimate this function by SLE on the corresponding subsample. Then LATE and LATT can be estimated in the usual way by usingq(X 1i , S i ) as the instrument propensity score (this is equivalent to computing the weighted average of the cell-specific LATE/LATT estimates). Under suitable modifications of Assumptions 4-7, the LATE estimator so defined possesses an influence function ψ(y, d, z,x 1 , s) that is isomorphic to ψ(y, d, z, x 1 ); one simply replaces the func- Abrevaya, Hsu, and Lieli (2013), Appendix D, for a formal derivation in a similar context.) However, this approach may not be feasible when the number of cells is large, which is the case in many economic applications. See Comment 8 for restricted estimators more easily implementable in practice.
7. The changes in the regularity conditions required by the presence of discrete variables are straightforward. For example, Assumption 4 needs to hold for the conditional distribu-tionX 1 | S = s for any s. The functions m z (x 1 , s), etc., need to be continuously differentiable inx for any s. Finally, in Assumptions 6 and 7, r is to be redefined as the dimension ofX 1 .
8. It is easy to specify the SLE so that it implements the sample splitting estimator described in Comment 6 above in a single step. Given a vector R K (x 1 ) of power functions inx 1 , use (1 {s=1} R K (x 1 ) , . . . , 1 {s=s} R K (x 1 ) ) in the estimation, that is, interact the powers ofx 1 with each subpopulation dummy. However, ifs is large, the number of observations available from some of the subpopulations can be very small (or zero) even for large n. The SLE is well suited for bridging over datapoor regions using functional form restrictions. For example, one can use, for some L ≤ K, , that is, one only lets lower order terms vary across subpopulations. Alternatively, suppose that X 1 = (X 1 , I 1 , I 2 ) where I 1 and I 2 are two indicators (so thats = 4). Then, one may implement the SLE with (R K (x 1 ) , R K (x 1 ) I 1 , R K (x 1 ) I 2 ), but without R K (x 1 ) I 1 I 2 . This constrains the attributes I 1 and I 2 to operate independently from each other in affecting the probability that Z = 1. Of course, the two types of restrictions can be combined. The asymptotic theory is unaffected if the restrictions are relaxed appropriately as n becomes large (e.g., L → ∞ when K → ∞ in the required manner). Furthermore, as restricting the SLE can be thought of as a form of smoothing, results by Li, Racine, and Wooldridge (2009) suggest that there might be small sample MSE gains relative to the "sample splitting" method (unless of course the misspecification bias is too large).

The Proposed Test Procedure
If treatment assignment is unconfounded conditional on a subset X 2 of X, then, under regularity conditions, one can consistently estimate ATT using the estimator proposed by HIR: wherep(x 2 ) is the series logit estimator of p(x 2 ) = P (D = 1|X 2 = x 2 ), the propensity score function. More generally, let If the unconfoundedness assumption holds, , and β t reduces to ATT. Given a binary instrument that is valid conditional on a subset X 1 of X, one-sided noncompliance implies ATT=LATT, and hence ATT can also be consistently estimated byτ t . On the other hand, if the unconfoundedness assumption does not hold, then τ t is still consistent for ATT, butβ t is in general not consistent for ATT. Hence, we can test the unconfoundedness assumption (or at least a necessary condition of it) by comparingτ t withβ t . In particular, let where p = P (D = 1). Let Assumption 4 be the analog of Assumption 4 stated for X 2 ; Assumption 5 be the analog of Assumption 5 stated for ρ d (x 2 ), d = 0, 1; Assumption 6 be the analog of Assumption 6 stated for p(x 2 ); Assumption 7 be the analog of Assumption 7 stated for p(x 2 ), where by "analog" we mean that all parameters are objectspecific, that is, Assumption 4 is not meant to impose, say, dim(X 2 ) = dim(X 1 ). The asymptotic properties of the difference betweenτ t andβ t are summarized in the following theorem.
As shown by HIR, under Assumptions 4 through 7 the asymptotic linear representation ofβ t is Theorem 2 follows directly from this result and Theorem 1. Let ψ t (·) and φ t (·) be uniformly consistent estimators of ψ t and φ t obtained, for example, as in Comment 3 after Theorem 1. A consistent estimator for σ 2 can then be constructed aŝ Thus, if one-sided noncompliance holds, one can use a simple z-test with the statistic √ n(τ t −β t )/σ to test unconfoundedness via the null hypothesis H 0 : τ t = β t . Since the difference between τ t and β t can generally be of either sign, a two-sided test is appropriate.
For the z-test to "work," it is also required that σ 2 > 0. It is difficult to list all cases where σ 2 = 0, but here we give one case that is easy to verify in practice. The proof is available in the online supplement.
Further comments 1. The result in Theorem 2 holds without unconfoundedness, providing consistency against violations for which β t = τ t . Nevertheless, as suggested by Lemma 1, unconfoundedness might be violated even when H 0 holds. For exam- is actually sufficient to identify and consistently estimate ATT. Therefore, our test will not have power against 2. Similarly, the test does not rely on one-sided noncompliance per se; it can be applied in any other situation where ATT and LATT are known to coincide. The empirical Monte Carlo exercise presented in Section 6 states another set of conditions on Z and D(0) under which this is true. Nevertheless, we still consider one-sided noncompliance as a leading case in practice (e.g., because it is easily testable).
3. The proposed test is quite flexible in that it does not place any restrictions on the relationship between X 1 and X 2 . The two vectors can overlap, be disjoint, or one might be contained in the other. The particular case in which X 2 is empty corresponds to testing whether treatment assignment is completely random.
4. If the instrument is not entirely trusted (even with conditioning), then the interpretation of the test should be more conservative; namely, it should be regarded as a joint test of unconfoundedness and the IV conditions. Thus, in case of a rejection one cannot even be sure that LATE and (L)ATT are identified.
5. Our test generalizes to one-sided noncompliance of the form: P [D(1) = 1] = 1, that is, where all units with Z = 1 will get treatment and only part of the units with Z = 0 can get treatment. To this end, define LATE for the nontreated as LATNT≡ Similarly to (7), we have LATNT=ATNT when P [D(1) = 1] = 1. We can estimate τ nt bŷ The corresponding estimator for β nt , denoted asβ nt , has the same form as the numerator ofτ nt with D i replacing Y i and p(X 2i ) replacingq(X 1i ). The logic of the test remains the same: (L)ATNT can be consistently estimated byτ nt as well asβ nt under the unconfoundedness assumption. However, if unconfoundedness does not hold, thenτ nt is still consistent for ATNT, butβ nt is generally not. The technical details are similar to the previous case and are omitted. 6. If P [D(0) = 0] = 1 and P [D(1) = 1] = 1 both hold, then Z = D and instrument validity and unconfoundedness are one and the same. Furthermore, in this case √ n(τ t −β t ) = o p (1).

The Implications of Unconfoundedness
What are the benefits of (potentially) having the unconfoundedness assumption at one's disposal in addition to IV conditions? An immediate one is that the ATE parameter also becomes identified and can be consistently estimated, for example, by the IPW estimator proposed by HIR or by nonparametric imputation as in Hahn (1998).
A more subtle consequence has to do with the efficiency ofβ t andτ t as estimators of ATT. If an instrument satisfying one-sided compliance is available, and the unconfoundedness assumption holds at the same time, then both estimators are consistent. Furthermore, the asymptotic variance ofτ t attains the semiparametric efficiency bound that prevails under the IV conditions alone, and the asymptotic variance ofβ t attains the corresponding bound that can be derived from the unconfoundedness assumption alone. The simple conjunction of these two identifying conditions does not generally permit an unambiguous ranking of the efficiency bounds even when X 1 = X 2 . Nevertheless, by taking appropriate linear combinations ofβ t andτ t , one can obtain estimators that are more efficient than either of the two. This observation is based on the following elementary lemma.
Lemma 2. Let A 0 and A 1 be two random variables with var(A 0 ) < ∞, var(A 1 ) < ∞, and var( . Lemma 2 implies that for any a ∈ R,β t (a) is consistent for τ t and is asymptoti- ). The optimal weightā can be obtained asā In other words,β t (ā) will be the most efficient estimator among all linear combinations ofβ t andτ t . Althoughā is unknown in general, it can be consistently estimated bŷ The asymptotic equivalence lemma, for example, Lemma 3.7 of Wooldridge (2010), implies that √ n(β t (â) − τ t ) has the same asymptotic distribution as √ n(β t (ā) − τ t ). If var(φ t ) = cov(φ t , ψ t ), thenā = 0, which implies thatβ t itself is more efficient thanτ t (or any linear combination of the two). We give sufficient conditions for this result.
Theorem 3. Suppose that Assumption 1 parts (i), (iii), (iv), (v) and Assumption 3 are satisfied, and let The proof of Theorem 3 is provided in Appendix B. The conditions of Theorem 3 are stronger than those of Theorem 2. The latter theorem only requires that the IV assumption and unconfoundedness both hold at the same time, which in general does not imply the stronger joint mean-independence conditions given in (12). If the null of unconfoundedness is accepted due to (12) actually holding, thenβ t itself is the most efficient estimator of ATT in the class {β t (a) : a ∈ R}.
The theoretical results discussed in this subsection are qualified by the fact that in practice one needs to pretest for unconfoundedness, while the construction ofβ t (ā) takes this assumption as given. Deciding whether or not to take a linear combination based on the outcome of a test will erode some of the theoretically possible efficiency gain when unconfoundedness does hold, and will introduce at least some bias through Type 2 errors. (A related problem, the impact of a Hausman pretest on subsequent hypothesis tests, was studied recently by Guggenberger 2010.) We will explore the effect of pretesting in some detail through the Monte Carlo simulations presented in the next section.

MONTE CARLO SIMULATIONS
We employ a battery of simulations to gain insight into how accurately the asymptotic distribution given in Theorem 1 approximates the finite sample distribution of our LATE estimator in various scenarios and to gauge the size and power properties of the proposed test statistic. The scenarios examined differ in terms of the specification of the instrument propensity score, the choice of the power series used in estimating it, and the trimming applied to the estimator. All these issues are known to be central for the finite sample properties of an IPW estimator. We also address the pretesting problem raised at the end of Section 4; namely, we examine how much of the theoretical efficiency gain afforded by unconfoundedness is eroded by testing for it, and also the costs resulting from Type 2 errors in situations when the assumption does not hold.
The constant specification (i) represents the benchmark case in which the instrument is completely randomly assigned; it will also serve as an illustration of how covariates that q(·) does not depend on can still be used in estimation to improve efficiency. The linear specifications (ii) and (iii) are functions that are easily estimable by SLE; in fact, here SLE acts as a correctly specified or overspecified parametric estimator. Model (iii) will also pose a challenge to the asymptotic results as q(X) concentrates a lot of mass close to zero. Finally, the rational models (iv) and (v) are intended to be true "nonparametric" specifications in that any implementation of the SLE based on a fixed number of powers is, in theory, only an approximation to the true function (though a quadratic approximation seems already adequate in practice). Of the two, specification (v) stays safely away from zero and one, while (iv) puts a fair amount of weight in a close neighborhood of zero. Clearly, the DGP satisfies one-sided noncompliance as D(0) = 0. The value of the parameter b ∈ [0, 1] governs whether the unconfoundedness assumption is satisfied. In particular, when b = 0 unconfoundedness holds conditional on X 2 = X. Larger values of b correspond to more severe violations of the unconfoundedness assumption. The instrument Z is valid conditional on X 1 = X for any b.
To make a credible case for unconfoundedness in practice, one often needs a large number of theoretically relevant covariates. Here we use five, which is likely insufficient in a lot of applications. Nevertheless, X is large enough to allow us to make a number of salient points without posing an undue computational burden. In Section 6, we will present an additional, smaller Monte Carlo exercise where the setup is based on a real dataset with up to 14 covariates.

The Finite Sample Distribution of the LATE Estimator
In our first exercise, we study the finite sample distribution of the LATE estimatorτ . The DGP is designed so that the true value of the LATE parameter is independent of q(·) and is approximately equal to τ = −2.73 for b = 0.5, the value chosen for this exercise. Table C.1 shows various statistics characterizing the finite sample distribution ofτ and its estimated standard error for n = 250, n = 500, and n = 2500. In particular, we report the bias ofτ , the standard error of T ≡ √ n(τ − τ ), the mean estimate of this standard error based on comment 3 after Theorem 1, and the tail probabilities of the studentized estimator S ≡ (τ − Eτ )/ s.e.(τ ) associated with the critical values −1.645 and 1.645. The number of Monte Carlo repetitions is 5000.
For each specification of q(·), we consider a number of implementations of the SLE. We start with a constant model for the instrument propensity score and then add linear, quadratic, and cubic terms (all powers of W i and all cross products up to the given order). We use the same power series to estimate all other nonparametric components of the influence function (used in estimating the standard error ofτ ). The choice of the power series in implementing the SLE is an important one; it mimics the choice of the smoothing parameter in kernel-based or local polynomial estimation. To our knowledge, there is no well-developed theory to guide the power series choice in finite samples (though Imbens, Newey, and Ridder 2007 is a step in this direction); hence, a reasonable strategy in practice would involve examining the sensitivity of results to various specifications as is done in this simulation.
When using an IPW estimator in practice, the estimated probabilities are often trimmed to prevent them from getting too close to, the boundaries of the [0,1] interval. Therefore, we also apply trimming to the raw estimates delivered by the SLE. The column "Trim." in Table C.1 denotes the truncation applied to the estimated instrument propensity scores. If the fitted valuê q(X i ) is strictly less than the threshold γ ∈ (0, 1/2), we reset q(X i ) to γ . Similarly, ifq(X i ) is strictly greater than 1 − γ , we resetq(X i ) to 1 − γ . We use γ = 0.5% (mild trimming) and, occasionally, γ = 5% (aggressive trimming). The latter is only applied to specifications (ii = Linear 2) and (iv = Rational 1), where the boundary problem is the most severe.
Many aspects of the results displayed in Table C.1 merit discussion.
First, looking at the simplest case when neither q norq depends on X, we see that even for n = 250, the bias of the LATE estimator is very small, its estimated standard error, too, is practically unbiased, and the distribution of the studentized estimator has tail probabilities close to standard normal. Even though the true instrument propensity score does not depend on the covariates, one can achieve a substantial reduction in the standard error of the estimator by allowingq to be a function of X, as suggested by Theorem 3 of Frölich and Melly (2013b). For example, whenq(X) is linear, the standard error, for n = 2500, falls from about 3.21 to 2.42, roughly a 25% reduction. Nevertheless, we can also observe that ifq(X) is very generously parameterized (here: quadratic), then in small samples the "noise" from estimating too many zeros can overpower most of this efficiency gain. Specifically, for n = 250 the standard error of the scaled estimator is almost back up to the noncovariate case (3.16 vs. 3.23). Still, the efficiency gains are recaptured for large n.
A second, perhaps a bit more subtle, point can be made about the standard error ofτ using the Linear 1 specification for q(X).
Here, the linear SLE acts as a correctly specified parametric estimator while the estimated standard errors are computed under the assumption that q is nonparametrically estimated. Therefore, the estimated standard errors are downward-biased, reflecting the fact that even when the instrument propensity score is known up to a finite-dimensional parameter vector, it is more efficient to use a nonparametric estimator in constructingτ as in Chen, Hong, and Tarozzi (2008). Indeed, as the SLE adds quadratic and cubic terms, that is, it starts "acting" more as a nonparametric estimator, the bias vanishes from the estimated standard errors, provided that the sample size expands simultaneously (n = 2500). Furthermore, the asymptotic standard errors associated with the quadratic and cubic SLE (2.72 and 2.93, respectively) are lower than for the linear (3.11). In cases where the variance ofτ is underestimated, the studentized estimator tends to have more mass in its tails than the standard normal distribution (see, e.g., the results for the linear SLE).
Third, as best demonstrated by the Linear 2 model for the instrument propensity score, the limit distribution provided in Theorem 1 can be a poor finite sample approximation when q(X) gets close to zero or one with relatively high probability. This is especially true when the estimator for q(X) is overspecified (quadratic or cubic). For n = 250 and n = 500, the bias of τ ranges from moderate to severe and is exacerbated by more aggressive trimming ofq. For any series choice, the standard error of the LATE estimator is larger than in the Linear 1 case (the 0.5% vs. 5% trimming does not change the actual standard errors all that much). Furthermore, forq quadratic or cubic, the estimated standard errors are severely upward biased with mild trimming, and still very much biased, though in the opposite direction, with aggressive trimming. Increasing the sample size to n = 2500 of course lessens these problems, though judging from the tail probabilities, the standard normal can remain a rather crude approximation to the studentized estimator. For example, for the cubic SLE with 0.5% trimming the standard error is grossly overestimated and there is evidence of skewness. On the other hand, for the linear and quadratic SLE the estimated asymptotic standard errors display downward bias, presumably due to the "correct parametric specification" issue discussed in the second point above. Somewhat surprisingly, though, the actual standard errors are the smallest for the linear SLE; apparently, even for n = 2500, there is more than "optimal" noise in the quadratic and cubic instrument propensity score estimates.
Fourth, when the instrument propensity score estimator is underspecified,τ is an asymptotically biased estimator of LATE. (Here "underspecified" refers to a misspecified model in the parametric sense or, in the context of series estimation, extending the power series too slowly as the sample size increases.) The bias is well seen in all cases in which the instrument propensity score depends on X, but is estimated by a constant. The Rational 1 and Rational 2 models provide further illustration. Here, any fixed power series implementation of the SLE is misspecified if regarded as a parametric model, though the estimator provides an increasingly better approximation to q(·) as the power series expands. For the Rational 1 model, the bias ofτ indeed decreases in magnitude as the SLE becomes more and more flexible, with the exception of n = 250. For Rational 2, even the linear SLE removes the bias almost completely and not much is gained, even asymptotically, by using a more flexible estimator. For Rational 1, there is noticeable asymptotic bias in estimating the standard error ofτ , which would presumably disappear if the sample size and the power series both expanded further. Nevertheless, for both rational models the normal approximation toτ works reasonably well in large samples across a range of implementations of the SLE.
Finally, the results as a whole show the sensitivity ofτ to the specification of the power series used in estimating the instrument propensity score q(·). If the power series has too few terms (or expands too slowly with the sample size), thenτ may be (asymptotically) biased. On the other hand, using too flexible a specification for a given sample size can causeτ to have severe small sample bias and inflated variance, which is also estimated with bias. More aggressive trimming of the instrument propensity score tends to increase the bias ofτ and reduce the bias of s.e.(τ ), though to an uncertain degree.

Properties of the Test and the Pretested Estimator
We first set b = 0 so that unconfoundedness holds for any specification of q(X) conditional on X 2 = X. All tests are conducted at the 5% nominal significance level and with X 1 = X 2 = X, that is, we drop the cases whereq is constant. To further economize on space, we also drop the 5% truncation for the Rational 1 specification. In each of the remaining cases, we consider four estimators of (L)ATT:τ t ,β t , their combination β t (â), and a pretested estimator, given byβ t (â) whenever the test accepts unconfoundedness andτ t when it rejects it. Trimming is also applied top(·).
In Tables C.C.2 and C.C.3, we report, for each estimator, the raw bias, the standard deviation of √ n( (L)ATT − (L)ATT), the mean of the estimated standard deviation, and the mean squared error of √ n( (L)ATT − (L)ATT). We use a naive (but natural) estimator for the standard error of the pretested estimator; namely, we take the estimated standard error of eitherβ t (a) or τ t , de-pending on which one is used. In addition, we report the actual rejection rates and the average weight across Monte Carlo cycles that the combined estimator assigns toτ t (the mean ofâ).
Again, several aspects of the results are worth discussing. First, there is adequate, though not perfect, asymptotic size control in all cases, where the specification of the SLE is sufficiently flexible and there is no excessive trimming. The extent to which the 5% trimming can distort the size of the test in the Linear 2 case is rather alarming; in the very least, this suggests that trimming should be gradually eliminated as the sample size increases.
Second, in almost all cases, the combined estimator has smaller standard errors in small samples than the HIR estimator β t , and the drop is especially large whenq is overspecified. While this tends to be accompanied by an uptick in absolute bias, in almost all cases the combined estimator has the lowest finite sample MSE-the only exceptions come from the Linear 2 model with aggressive trimming. As the DGP satisfies the conditions of Theorem 3, the combined estimator puts less and less weight onτ t in larger samples and becomes equivalent tô β t unless trimming interferes.
Third, even though the pretested estimator has a higher MSE thanβ t or the combined estimator, in almost all the cases this MSE is lower than that ofτ t . (Again, the only exceptions come in the Linear 2 case with 5% trimming for n = 2500, but herê β t itself has a higher MSE thanτ t .) Thus, while there is a price to pay for testing the validity of the unconfoundedness assumption, there is still a substantial gain relative to the case where one only has the IV estimator to fall back on. Of course, one would be better off taking unconfoundedness at face value when it actually holds. But as we will shortly see, there is a large cost in terms of bias if one happens to be wrong, and the consistency of the unconfoundedness test helps avoid paying this cost.
Fourth, the naive method described above underestimates the true standard error of the pretested estimator. We briefly examined a bootstrap estimator in a limited number of cases, and the results (not reported) appear upward biased. We do not consider these results conclusive as we took some shortcuts due to computational cost (to study this estimator one has to embed a bootstrap cycle inside a Monte Carlo cycle). We further note that the distribution of the pretested estimator can show severe departures from normality such as multimodality or extremely high kurtosis.
We now present cases where unconfoundedness does not hold conditional on X. Specifically, we set b = 0.5 again; some additional results for b = 0.25 are available in an online supplement. We focus only on those cases from the previous exercise where size was asymptotically controlled, as power has questionable value otherwise. The results are displayed in Table C.4.
Our first point is that the test appears consistent against these departures from the null-rejection rates approach unity as the sample size grows in all cases examined. Nevertheless, overspecifying the series estimators can seriously erode power in small samples; see the cubic SLE in Table C.4 for q = Linear 1, Rational 1, Rational 2. In fact, in these cases the test is not unbiased. A further odd consequence of overfitting is that power need not increase monotonically with n; see again the cubic SLE in Table C.4 for q = Linear 2.
Second,β t , and hence the combined estimator, is rather severely biased both in small samples and asymptotically (the bias is of course a decreasing function of b). Therefore, even thoughβ t generally has a lower standard error thanτ t , its MSE, in large enough samples, is substantially larger than that ofτ t . As the sample size grows, the pretested estimator behaves more and more similarly toτ t , eventually also dominatingβ t andβ t (â).
Third, in smaller samples the MSE of the pretested estimator is often larger than that of τ t as the pretested estimator usesβ t (â) with positive probability, andβ t (â) is usually inferior toτ t due to its bias inherited mostly fromβ t . However, there are cases in which the increased bias of the combined estimator is more than offset by a reduction in variance so that MSE(β t (â)) is lower than MSE(τ t ) or MSE(β t ) or both. This happens mainly when n = 250 andq is overspecified; see also the cubic SLE for q = Linear 1, Linear 2, Rational 1, Rational 2 in Table C.4. As in these cases power tends to be (very) low, the pretested estimator preserves most of the MSE gain delivered byβ t (â) or might even improve on it slightly. This property of the combined estimator mitigates the cost of the Type 2 errors made by the test.

EVALUATING THE UNCONFOUNDEDNESS TEST
USING REAL DATA

An Illustrative Empirical Application
We apply our method to estimate the impact of JTPA training programs on subsequent earnings and to test the unconfoundedness of the participation decision. We use the same dataset as Abadie, Angrist, and Imbens (2002), henceforth AAI, publicly available at http://econ-www.mit.edu/ faculty/angrist/data1/data/abangim02. As described by Bloom et al. (1997) and AAI, part of the JPTA program (the National JTPA study) involved collecting data specifically for purposes of evaluation. In some of the service delivery areas, between November 1987 and September 1989, randomly selected applicants were offered a job-related service (classroom training, on-the-job training, job search assistance, etc.) or were denied services and excluded from the program for 18 months (1 out of 3 on average).
Clearly, the random offer of services (Z) can be used, without further conditioning, as an instrument for evaluating the effect of actual program participation (D) on earnings (Y ), measured as the sum of earnings in the 30-month period following the offer. About 36% of those with an offer chose not to participate; conversely, a small fraction of applicants, less than 0.5%, ended up participating despite the fact that they were turned away. Hence, Z satisfies one-sided noncompliance almost perfectly; the small number of observations violating this condition were dropped from the sample. (AAI also ignores this small group in interpreting their results.) The total number of observations is then 11,150; of these, 6067 are females and 5083 are males. We treat the two genders separately throughout.
The full set of AAI covariates (X) include "dummies for black and Hispanic applicants, a dummy for highschool graduates (including GED holders), dummies for married applicants, 5 age-group dummies, and dummies for AFDC receipt (for women) and whether the applicant worked at least 12 weeks in the 12 months preceding random assignment. Also included are dummies for the original recommended service strategy [. . .] and a dummy for whether earnings data are from the second follow-up survey." (AAI, p. 101) See Table 1 of AAI for descriptive statistics. To illustrate the "sample splitting" method described in Comment 6 after Theorem 1 we also construct a smaller set of controls with dummies for high-school education, minority status (black or hispanic), and whether the applicant is below age 30 years.
In Table 1, we present four sets of estimation/test results. In the first exercise, we do not use any covariates in computingτ t andβ t . The LATT estimatorτ t is interpreted as follows. Take, for example, the value 1916.4 for females. This means that female compliers who actually participated in the program (i.e., were assigned Z = 1), are estimated to increase their 30-month earnings by $1916.4 on average. Since Z is randomly assigned, this number can also be interpreted as an estimate of LATE, that is, the average effect among all compliers. Further, by one-sided noncompliance, $1916.4 is also an estimate of the female ATT. As the difference betweenτ t andβ t = 2146.7 is not statistically significant, the hypothesis of completely random participation cannot be rejected for females. In contrast,β t for males is more

Subpop.
Obs than twice as large asτ t , and the difference is highly significant. This suggests that self-selection into the program among men is based partly on factors systematically related to the potential outcomes.
In the next two exercises, we set X 1 and X 2 equal to the restricted set of covariates. First, we split the male and female samples by the eight possible configurations of the three indicators and estimate the instrument propensity score by the subsample averages of Z; then we restrict the functional form to logit with a linear index. The two sets of results are similar both to each other and the results from the previous exercise. In particular, random participation is not rejected for females, while it is still strongly rejected for males. There are factors related to the male participation decision as well as the potential outcomes that are not captured by the set of covariates used.
Finally, in the fourth exercise we use the full set of AAI covariates in a linear logit model. Compared with the no-covariate case, the estimated standard errors are slightly lower across the board, but the changes in the point estimates are still within a small fraction of them. Once again, the test does not reject uncounfoundedness for females but it does for males.
Since the hypothesis of random treatment participation cannot be rejected for females,β t can also be interpreted as an estimate of ATE. In contrast,β t is likely to be substantially biased as an estimate of male ATE. Furthermore, based on Section 4, one can take a weighted average ofτ t andβ t to obtain a more efficient estimate of female ATE/ATT. Asâ ≈ 0 in all cases, the combined estimator is virtually the same asβ t and is not reported. Nevertheless, without testing for (and failing to reject) the unconfoundedness assumption, the only valid estimate of female ATT isτ t , which has a much larger standard error thanβ t .
While the result on male versus female self-selection is robust in this limited set of exercises, one would need to study the program design in more detail before jumping to conclusions about, say, behavioral differences. Understanding how the explicitly observed violations of one-sided noncompliance came about would be especially pertinent, and, as pointed out by a referee, the broader issue of control group substitution documented by Heckman, Hohmann, Smith, and Khoo (2000) would also have to be taken into account. Furthermore, there are potentially relevant covariates (e.g., indicators of the service delivery area) not available in the AAI version of the dataset. In short, the empirical results are best treated as illustrative or as a starting point for a more careful investigation.

An Empirical Monte Carlo Exercise
We supplement our illustrative application with an empirical Monte Carlo exercise. The basic idea, as described by Huber et al. (2013), is to build a data-generating process in which the number of variables, their distributions, and the relationships between them are based on empirical quantities from a relevant dataset (the AAI version of the JTPA data in our case). Of course, in evaluating our test procedure we need to control whether or not the null hypothesis holds, which makes it necessary to introduce some artificial variables and parameters. We focus the exercise on a question left unexplored in our "synthetic" Monte Carlo study; namely, the potential distortions introduced by violations of the one-sided noncompliance assumption. The simulations also provide additional evidence on the size and power properties of the test in a presumably more realistic setting. Similarly to the application, we condition on gender throughout, that is, treat males and females as separate populations.
Mimicking the experimental setup in the National JTPA study, the instrument Z is a random draw from a Bernoulli(2/3) distribution. Letθ denote the coefficient vector from a logit regression of the observed treatment indicator on a constant and the full set X of the AAI covariates. The potential treatment status indicator D(1) is generated according to where (·) is the logistic cdf, U ∼ uniform[0,1], ν ∼ N (0, 1), independent of each other, X and Z. The parameter b ≥ 0 governs how strong a role the unobserved variable ν plays in the selection process; D(0) and the potential outcomes will be specified in a way so that unconfoundedness holds if and only if b = 0, that is, ν is the only unobserved confounder. We define D(0) in two different ways: where (·) is the standard normal cdf, S is uniform[0, 1], independent of all other variables defined thus far, and c ∈ [0, 1] is a parameter that we calibrate to set π ≡ P [D(0) = 1] equal to various prespecified values. (For example, for c = 0, π = 0 for both specifications.) The multiplicative structure of D(0) ensures that monotonicity is satisfied, that is, there are no defiers in the population. In specification (i) the violation of one-sided noncompliance is due to the confounding variable ν; in case (ii) it is due to a completely exogenous variable. Actual treatment status is D = D(Z).
The potential outcome Y (1) is drawn randomly from the empirical distribution of earnings (regardless of treatment status in the data), and Y (0) is specified as where ∼ N (0, 1) is an independently generated error,σ is the standard deviation of earnings in the data, and α = −0.47 for males and −0.06 for females. The potential outcome equations were calibrated so that for b = 1 and π = 0, ATT = LATT ≈ 1667 in the male population, and for b = 0 and π = 0, ATT = LATT ≈ 1920 in the female population. These settings match the value ofτ t in the last panel of Table 1 as well as the conclusion of the test in the two populations. The actually observed outcome is of course Y = Y (D).
In each Monte Carlo cycle, we draw a sample of size n M for males and n F for females. We perform the unconfoundedness test at the 5% nominal significance level, using the specifications given in the bottom panel of Table 1. The chosen sample sizes are n F = 500; 5000 and n M = 600; 6000 (the larger figure for either gender matches the application). We report rejection rates for various values of the parameters b and π = P [D(0) = 1]. For females, we restrict attention to the size of the test (b = 0), while for males we study size as well as power (b = 0, 0.5, 0.75, 1, 1.25). We calibrate c to give Table 2. Empirical Monte Carlo: rejection rates of the unconfoundedness test at the 5% nominal significance level n M = 500, n F = 600 n M = 5000, n F = 6000 π = 0.03 π = 0.06 π = 0.12 π = 0.03 π = 0.06 π = 0.12 Note: For π = 0 one-sided noncompliance is satisfied and specifications (i) and (ii) coincide. π = 0, 0.03, 0.06, and 0.12, where π = 0 implies that onesided noncompliance is satisfied. When π > 0, we perform the test in two different ways: first we keep individuals that apparently violate one-sided noncompliance ("keep") and then we drop them from the sample ("drop"); the expected proportion of such observations is π/3. The number of Monte Carlo repetitions is 2500 for the smaller values of n M and n F , and 1000 for the larger.
The simulation results are presented in Table 2. Under specification (i), violation of one-sided noncompliance causes ATT = LATT. Theory predicts that in this case our test statistic explodes, even for b = 0, as n → ∞. Viewed as a test of the unconfoundedness assumption, this amounts to potentially severe size distortion, while the effect on finite sample power is generally ambiguous (the bias inβ t when b = 0 might partially offset the difference between ATT and LATT or add to it).
As shown by panel (i) in Table 2, there is little evidence of size distortion for n M = 500 and n F = 600, even when π = 0.12. However, power is also quite poor, likely because the number of covariates used in estimating the propensity score is fairly large relative to the sample size (dim(X) = 13 for males and 14 for females). This observation accords well with the earlier finding in the traditional Monte Carlo exercise that overspecifying the propensity score estimator can lead to severe reduction in power. For n M = 5000 and n F = 6000, size distortion becomes quite significant. For π = 0.06, actual size is roughly double the nominal size for either gender, while for π = 0.12 it is triple. At the same time, power also increases significantly, and it is interesting to note that even after adjusting for the size distortion, power tends to be larger for π > 0 than for π = 0, at least for smaller positive values of b and the "drop" option.
Specification (ii) for D(0) represents a polar case in which one-sided noncompliance is violated, but the significance level of the test is unaffected, because ATT = LATT. This is a con-sequence of the fact that Z is completely randomly assigned, and D(0) is completely randomly assigned when D(1) = 1. It is straightforward to show that in this case the common value of the two parameters is given by Indeed, as shown by panel (ii) of Table 2, the nominal 5% size remains valid even for n M = 5000 and n F = 6000. There is however evidence that violation of one-sided noncompliance reduces power, suggesting that the bias ofβ t is a (slightly) decreasing function of π for a given b > 0. Also, the "drop" option seems to result in a small upward shift in the entire finite sample power curve.
Generally speaking, the results suggest that the rejection of the unconfoundedness assumption for males in the empirical exercise is unlikely to be a product of potential size distortions caused by the apparently mild violation of one-sided noncompliance (in the application P [D(0) = 1] ≈ 3 × 0.005 = 0.015).
On the other hand, the nonrejection for females could reflect lack of power against moderate violations of unconfoundedness. In the setup considered above, the unobserved confounder ν must play a significant role in the selection process for unconfoundedness to be rejected with reasonably high probability. For example, for males var(bν)/var(X θ ) ≈ 1.5 for b = 0.5, and still power is only about 34% for π = 0 and n M = 5000.

CONCLUSION
Given a (conditionally) valid binary instrument, nonparametric estimators of LATE and LATT can be based on imputation or matching, as in Frölich (2007), or weighting by the estimated instrument propensity score, as proposed in this article. The two approaches are shown to be asymptotically equivalent; in particular, both types of estimators are √ n-consistent and efficient. When the available binary instrument satisfies one-sided noncompliance, the proposed estimator of LATT is compared with the ATT estimator of HIR to test the assumption that treatment assignment is unconfounded given a vector of observed covariates. To our knowledge, this is the first such test in the literature. Acceptance of unconfoundedness allows one to estimate ATE and improve on the asymptotic variance of the IV-based (L)ATT estimator. Simulations show that there are finite sample MSE gains even after the pretesting effect is taken into account. An illustrative application of the test using JTPA data rejects unconfoundedness for males but not for females.

APPENDIX A: PROOFS AND TABLES
A. The Proof of Theorem 1 We only show the proof forτ . The treatment ofτ t is similar and is available in the online supplement. To simplify notation, we set X 1 = X. Let so thatτ = / . The asymptotic properties ofˆ andˆ are established in the following lemma.
To make use of Lemma A.1 we take a first-order Taylor expansion ofˆ /ˆ around the point ( , ), yielding Applying the Lindeberg-Levy CLT to (8) shows The proof of Lemma A.1. Recall the definition of W (z). By Assumption 1(ii), it is true that E[W (z)|Z, X] = E[W (z)|X], z = 0, 1. That is, if we treat Z as the treatment assignment and W (z) as the potential outcomes, W (z) and Z are unconfounded given X. Also, it is straightforward to check that Assumptions 1-5 of Theorem 1 of HIR are satisfied. The result for follows directly from it. A similar argument applies to .

APPENDIX B. THE PROOF OF THEOREM 3
Put X = X 1 = X 2 . Under the conditions of Theorem 3, including one-sided noncompliance, the following hold: where the fourth equality holds because the IV assumption and the unconfoundedness assumption hold jointly as in (12). Also, where the second equality holds since D = 0 when Z = 0. Define and rewrite ψ t (Y, D, Z, X) as Note that where the first equality in second line holds since ZD = 1 with probability one and the second equality holds since E[D(Y (1) − ρ 1 (X))|X] = 0. Furthermore, .
This shows Theorem 3.