Correcting for Endogeneity in Models with Bunching

Abstract We develop a novel control function approach in models where the treatment variable has bunching at one corner of its support. This situation typically arises when the treatment variable is a constrained choice and some observations choose the corner solution. The method exploits distributional shape restrictions but makes no exclusion restrictions. We provide estimators and establish their asymptotic behavior, prove the convergence of the bootstrap, and develop tests of the identification assumptions. An application reveals that watching television has no effect on cognitive skills and a negative effect on noncognitive skills in children.


Introduction
When the treatment variable is constrained to be above or below a certain threshold, bunching of observations is often seen at the threshold itself.Caetano (2015) develops a test of exogeneity in these situations based on the idea that unobservables vary discontinuously at the threshold.In this article, we show that the same idea can be leveraged further to build a correction for endogeneity, provided that further structure is imposed.Specifically, we impose a restriction on the shape of the distribution of the confounders conditional on the controls, but we allow the parameters of that distribution to be nonparametric functions of the controls.In particular, all of the controls may be endogenous.The approach does not require exclusion restrictions or specific data structures (e.g., a panel), so it can be useful when none of the well-established selection-on-unobservables identification strategies are applicable, either because they are infeasible or because they do not identify the parameter of interest.
In a linear model, the method translates to adding a generated control to the original regression.The entire approach in this case can be implemented with packaged software.We derive the asymptotic behavior of the estimator of the treatment effect coefficient, provide a consistent estimator of the standard errors, and prove the consistency of the bootstrap.We also develop tests for all the assumptions in the model.Finally, we extend the approach beyond the linear model, showing that it can be applied in many widely-used nonlinear models.Examples include correlated random effects models, partially linear models, some types of nonparametric nonseparable models, and probit models with endogeneity.
We apply the approach to estimate the effect of time spent watching television on children's skills using time diary data from the Panel Study of Income Dynamics Child Development Supplement (PSID-CDS).We find strong evidence of selection on unobservables in this application.Our correction approach reveals that television time has insignificant, positive effects on children's cognitive skills and significant, negative effects on their noncognitive skills.We test all our identifying assumptions using the tests we propose.
Our method can be understood as a control function approach (e.g., Heckman and Robb 1985;Navarro 2010).However, unlike typical control function approaches, our method does not require exclusion restrictions, exploiting shape restrictions instead.There are alternative approaches in the literature that identify treatment effects without exclusion restrictions, but the assumptions in those methods apply to different contexts.For instance, Klein and Vella (2010) and Lewbel (2012) show that heteroscedasticity can be used to achieve identification in linear models.D 'Haultfoeuille, Hoderlein, and Sasaki (2021) show that if the instrument satisfies a local irrelevance condition, then it is possible to identify the causal effect of interest in nonseparable models without an exclusion restriction.For models with binary treatment, see Millimet and Tchernis (2013) and the references therein for approaches that achieve identification without exclusion restrictions.
Our model is also related to the sample selection correction literature following Heckman (1979).That literature typically exploits distributional assumptions, as in our case.However, the key distinction is that our setup is not a sample selection model-no variable in the structural equation is censored in our context.In particular, both the outcome and the treatment variable are observed for the whole sample.
There is a large applied literature using bunching in the outcome variable (rather than in the treatment variable) to infer treatment effect parameters.Recently, there have been advancements in the theoretical treatment of identification in that context, notably in Blomquist et al. (2019) and Bertanha, McCallum, and Seegert (2020), on the identification of elasticities with respect to changes in a schedule of incentives (e.g., taxable income elasticity), and in Goff's (2020) generalization to models where the outcome is a choice subject to a budget set with a kink.Blomquist et al. (2019) and Bertanha, McCallum, and Seegert (2020) establish that identification in that context is impossible without restrictions on the conditional distribution of the unobservables similar to those imposed in our approach.However, differently from this literature, it is not necessary in our setting to specify how the bunched variable is chosen: we specify neither the optimization function nor the budget constraint.
Bunching at one of the extremes of the distribution of the treatment variable is often observed when the treatment is a choice constrained to be nonnegative, as in the case of demand or inputs to production.Examples include behavioral variables like the consumption of vitamin supplements, cigarettes, alcohol, and coffee; 1 financial variables such as credit card debt, credit access, expenditure on advertisements, and bequests; 2 variables quantifying different uses of time such as exercising, working, doing homework, volunteering, and using social media; 3 and count data such as the number of children, the frequency of doctor visits, and the crime rate. 4Moreover, our approach is a natural solution whenever exogeneity is rejected by Caetano's (2015) test, which has been applied in a variety of settings in economics, political science, and finance. 5 We develop a general inference framework that makes extensive use of Chen, Linton, and Van Keilegom's (2003) work for extremum estimators with possibly infinite dimensional nuisance parameters.However, the primitive results for stochastic equicontinuity in that article, which would greatly simplify the analysis, cannot be applied in our context.Instead we must prove the stochastic equicontinuity directly, and for this we use results in Pakes and Pollard (1989) and Andrews (1994).We also relax Chen, Linton, and Van Keilegom's (2003) almost sure stochastic equicontinuity requirement for the bootstrap, which has no applicable primitive conditions in the literature, making it hard to defend in practice.We substitute it by stochastic equicontinuity in probability requirements which are standard and for which primitives are well known (generally Donsker classes, e.g., smooth functions, limited variation functions, etc.).
To establish the convergence of the estimator in the specific tail symmetry case we use in our application, we develop new results for the estimation of quantiles at random (estimated) probabilities, and trimmed means at random (estimated) trimming points.Since these results may be of interest beyond the matters studied in this article, we present them in general notation which can stand alone (see Lemmas G.1,G.2,and G.3 in Appendix G.6).The trimmed mean results apply Fang and Santos's (2019) recent findings on Hadamard directional differentiability.
The article is laid out as follows.Sections 2 and 3 present the identification approach, and Sections 4 and 5 present estimators and asymptotic results.Section 6 develops tests aimed at detecting violations of all the identifying assumptions.In Section 7, we estimate the effect of time spent watching television (TV) on children's cognitive and noncognitive skills using our method.Other applications of this method can be seen in Caetano, Caetano, and Nielsen (2022) and Caetano et al. (2022).Section 8 concludes.The Appendix develops extensions to nonlinear models, presents further material on estimation and on our application, an empirical Monte Carlo study, and proofs.The code for implementing the method is available at https://github.com/GregorioCaetano/Bunching.

Correction Strategy
The main idea behind our approach can be contextualized using the example of our application, discussed in more detail in Section 7. We are interested in estimating the effect of the treatment X, hours per week the child watches TV, on the outcome Y, the child's cognitive or noncognitive skills.Figure 1 shows the unconditional cdf (left panel) and the empirical distribution (right panel) of X.About 5% of the sample is bunched at X = 0.
Why would this bunching occur?One explanation may be given if X is chosen in an optimization problem.Suppose that the amount of TV watching is chosen taking the family and child's characteristics, preferences and constraints into account.One of the constraints of this problem is that the amount of time spent watching TV cannot be negative.We denote the solution of the optimization problem where the nonnegativity constraint is removed as X * .In this setting, an explanation for the bunching at X = 0 is that some individuals find the nonnegativity constraint binding (i.e., X * < 0), so they choose a "corner solution." 6  The variable X * indexes all factors that affect the demand for watching TV, such as preferences for TV watching, preferences for alternative activities, and any additional constraints (e.g., all activities must add up to 24 hr a day).Note that this conceptualization of X * motivates our approach, but it is not necessary as long as the model below holds.
6 It may be difficult to conceptualize the idea that X * can be negative, as it would mean that someone would want to choose negative amounts of TV watching.It may be easier to think of X * at X = 0 as a measure of the distance from exact indifference between watching some TV and alternative activities.For instance, those at X * = −0.1 have characteristics, preferences and constraints that led them to be nearly indifferent between watching TV and another activity at X = 0, while those at X * = −3 have characteristics, preferences and constraints that led them to be farther from indifference at X = 0 (e.g., this family could be equal in every way to a family of type X * = 0, except for having a higher relative preference for playing sports versus watching TV.)The treatment variable X is related to X * according to 7 X = max{0, X * }, with 0 < P(X * < 0) < 1. ( The model's key requirement, 0 < P(X * < 0) < 1 in (1), is that the nonnegativity constraint is binding for part of the population.For those observations, X * is different from the actual choice X.
We decompose X * into a part that is determined by the characteristics we observe in the data, Z, and a remainder η, that is, ( We now consider the outcome equation, which is where we impose structure. This equation specifies the unobservable determinant of the outcome as δη+ε.Here, η is a sufficient index of all confounders, in the sense that, if we were able to observe and control for it, there would be no endogeneity problem.If δ = 0, then X is exogenous (i.e., E[δη + ε|X, Z] = 0), and thus β is identifiable as in the standard selection-on-observables model.If δ = 0, then X is endogenous.We keep the linear specification here for concreteness, as linearity is often assumed by researchers in practice, but we show in Appendix A that ( 2) and ( 3) can be generalized substantially, including to some nonparametric nonseparable models.
By ( 1) and ( 2) can be identified for all z ∈ Z|X = 0, then Z and X + E[X * |X * ≤ 0, Z]1(X = 0) are a proxy for the omitted variable η.Since Z is already a control in the structural equation, we would only need one additional 7 All equations and results involving random variables should be read as holding almost surely.P denotes the probability, and details about the implied probability spaces and conditional sigma-algebras are self-evident and thus omitted.Z denotes the support of the distribution of Z, and Z|A denotes the support of the conditional distribution of Z given A. For brevity, we often mention the support of the variable V when we mean the support of the distribution of V. Finally, the expectation E is assumed to exist wherever written.
Correcting for endogeneity thus depends on the identification of E[X * |X * ≤ 0, Z].However, since X * is not observed when it is negative, E[X * |X * ≤ 0, Z] is not identifiable without additional assumptions, such as shape or parametric restrictions on the distribution of X * |Z.We examine some of these assumptions and derive the corresponding expressions for E[X * |X * ≤ 0, Z] in Section 3.
Our approach therefore exploits bunching and two assumptions: the linearity assumption in (3) (i.e., that it is sufficient to control for Z and η linearly), and whichever shape or distributional assumption we make for the identification of E[X * |X * ≤ 0, Z].With these assumptions, we can identify β, (γ − πδ) and δ in equation ( 4).In Section 6, we present several tests for both of these assumptions.In Appendix A, we relax the linearity assumption and show how the correction may be implemented in some nonlinear and nonparametric models.
In Section 4, we prove the asymptotic normality of the corrected regression coefficients ( β, ( γ − πδ) , δ) when the correction uses any consistent estimator Ê[X * |X * ≤ 0, Z] satisfying some general conditions.We also derive a consistent estimator of the standard errors and prove the consistency of the bootstrap.In Section 5, we discuss how to estimate Ê[X * |X * ≤ 0, Z].

Remark 2.1 (On the Rank Condition and the Probability of Bunching).
The rank condition for the identification of β in (4) holds if X, Z and X + E[X * |X * ≤ 0, Z]1(X = 0) are linearly independent.The linear independence of X and X +E[X * |X * ≤ 0, Z]1(X = 0) is guaranteed if, and only if, P(X * < 0) > 0 and P(X > 0) > 0. This means that the approach cannot be applied in standard models without bunching, as bunching is a necessary condition for identification.The linear independence of Z, X and X + E[X * |X * ≤ 0, Z]1(X = 0) imposes the constraint that no linear combination of the latter two terms be included in Z.
In Appendix B.1, we derive the expression of the rank condition as a function of the bunching probability for the nonextreme cases (i.e., 0 < P(X = 0) < 1), and we show that it is not generally possible to separate the effect of the probability of bunching from other functions of the distribution of X * which also affect the rank condition.Thus, any study of how the rank condition changes with P(X = 0) will depend on how the distribution of X * is modified in order to achieve a shift in P(X = 0).
In Appendix F.3, we simulate changes in P(X = 0) inside our Monte Carlo experiments by changing E[X * |Z], allowing all quantities in the rank condition to vary as a consequence.We find that the variance of our estimators is stable for bunching rates away from extremes, and increases sharply near extreme bunching probabilities.

Remark 2.2 (On the Role of Controls).
The approach can be implemented without controls, provided the assumptions are valid (i.e., (3) holds with Z ≡ 1, and E[X * |X * ≤ 0] is identified, see Section 3).In this case, there are two constants π 0 and γ 0 such that the correction term is linearly independent of X, and so β is identifiable.
If ( 3) holds for a given Z, what is the consequence of using the vector of controls Z = Z?In Appendix B.2, we study this situation in detail.In general, using a different control vector impacts the identification of β through the parts of ε and of the omitted controls (the elements of Z not included in Z) that are not linearly predicted by X * and the included controls Z.

Remark 2.3 (On the Effect of Misidentification of E[X * |X * ≤ 0, Z]).
What is the consequence of using a different function ẽ(Z), instead of the true expectation E[X * |X * ≤ 0, Z]?In Appendix B.3, we show that this introduces the omitted variable (ẽ(Z) − E[X * |X * ≤ 0, Z])1(X = 0) into the model, and study its consequences in detail.In general, using a mistaken expectation in the correction impacts the identification of β through δ and the part of the omitted variable which is not predicted by Z and the correction term X + ẽ(Z)1(X = 0).We also show that, if the rank condition holds when using the true correction term , then the identification of β is generally robust to small mistakes in the identification of the expectation.

Identification of E[X * |X * ≤ 0, Z]
Since we only observe X * when it is positive, E[X * |X * ≤ 0, Z] is an out-of-sample moment, and thus it cannot be nonparametrically identified.Nevertheless, if X * |Z has a stable distribution that maintains certain properties on its entire support, then the fact that we do observe the distribution of X * |Z when X * > 0 may be sufficient for the identification of this out-of-sample expectation.
More specifically, identification of the expectation can be achieved by relying on properties of the distribution of X * |Z = z that hold both when X * is negative and for at least some positive values of X * .The following theorem, proved in Appendix G.1, states formally a sufficient condition for identification of E[X * |X * ≤ 0, Z = z].Define F X|Z=z and F X * |Z=z as the cumulative distribution functions (cdf 's) of X|Z = z and X * |Z = z, respectively, and F the space of cdf 's in R.
Parametric families usually satisfy the condition in Theorem 3.1, provided all the parameters that determine E[X * |X * ≤ 0, Z = z] also affect the distribution of X|Z = z. 8 Next, we consider two large classes of distributions that satisfy Theorem 3.1, and for which the identification of E[X * |X * ≤ 0, Z = z] can be obtained in closed-form.Note that the identification of E[X * |X * ≤ 0, Z = z] for a specific z is not necessarily tied to the identification of the expectation for a different value z .Therefore, different identification assumptions may be made for different values of Z. Below, we denote the generalized inverse of a cdf F ∈ F as

Tail Symmetric Distributions
The class of distributions that are symmetric in the tail is formally characterized by the following condition: Figure 2 shows an example of a distribution that is symmetric in the tails.Following Powell (1986), conditional symmetry assumptions have been used extensively in the censoring literature.For a large list of citations in the applied and theoretical literature which use conditional symmetry assumptions, see Chen and Tripathi (2017).This class includes many well-known distribution families, such as Gaussian, Student's-t, logistic, and uniform.Tail symmetry is a weaker assumption than symmetry, since it also allows distributions that are not symmetric everywhere, as depicted in Figure 2.
Since for x < 0, for any distribution function F. The conditional expectation can be identified using the formula 8 This is not satisfied if elements of the parameter vector affect F X * |Z=z only when x ≤ 0. One example is the piecewise-linear distribution family with a kink at zero, which has cdf , and therefore varies with ω 1 .However, F X * |Z=z (x) varies with ω 1 only when x ≤ 0, so The expectation is not identifiable in this family.

Location-Scale Distributions
The conditional location-scale class generated by the function H is characterized by the following condition.
The location-scale family is a well-known class of distributions studied at length in statistics, finance and decision theory (see e.g., Meyer 1987;Wong and Ma 2008;Hazra et al. 2017).All parametric symmetric families, such as Gaussian, Student'st, logistic, and uniform are location-scale, as well as some parametric asymmetric families, such as the two-parameter exponential.Some nonparametric families also satisfy this condition, such as the class of symmetric distributions with nonnegative median.9Because −μ z /σ z = H −1 (F X|Z=z (0)) and Then, the conditional expectation can be identified using the formula . ( 6) Equation ( 6) is a deterministic function of H, except for the conditional expected treatment E[X|Z = z], and the conditional probability of bunching F X|Z=z (0).Note that this equation is well defined because we only need to identify the expectation for z ∈ Z|X = 0, which implies that F X * |Z=z (0) > 0 (this follows from a more general claim proved in Appendix G.1).
Example 3.1 (The Gaussian Family).Let be the standard normal cdf, φ its derivative, and λ = φ/ .The Gaussian family is the set of distribution functions which satisfy for some pair (μ z , σ z ), where σ z > 0.
The Gaussian family satisfies Assumption 2 with H = , and is therefore location-scale.The expectation may thus be identified using (6), which in this case gives the formula If P(X = 0|Z = z) ≤ 0.5, the Gaussian family also satisfies Assumption 1, and is therefore tail symmetric.In this case, the expectation may be also identified using (5).

General Estimation and Asymptotic Results
For a given estimator of the expectation Ê[X * |X * ≤ 0, Z], estimation of β follows ( 4) and consists of an OLS regression of Y onto X, Z and the estimated correction term, We prove in this section the asymptotic normality of β and δ, propose an estimator of the asymptotic variance, and prove its consistency.We also prove the consistency of the bootstrap.Supplementary details and further results are in Appendix C.1.Our theorems allow for broad classes of estimators of E[X * |X * ≤ 0, Z], including estimators not discussed in this paper.In Section 5, we examine the estimation of E[X * |X * ≤ 0, Z], and in Appendix C.2 we prove that the specific estimators used in the empirical application satisfy the conditions of the general theorems in this section.
Let the vector of regressors be W = (X, Z , X + ψ 0 (Z)1(X = 0)) with W i an observation of this vector.The corresponding parameter vector is denoted θ 0 = (β, (γ − πδ) , δ) .Suppose that Assumption 4 in Appendix C.1 holds, and Assumption 4 in Appendix C.1 includes the technical requirements that several moments are bounded (equivalent to White 1980), that θ 0 belongs to a compact set, and that 10 The use of E in this expression gives the impression that the left hand side is constant.However, the expectation here is taken with respect to X i and Z i , but conditional on the data that generated ψ.Thus, the left hand side term is a random variable.This notation can be confusing, but it is standard in the empirical process literature (e.g., Chen, Linton, and Van Keilegom 2003).To clarify the notation further, we give an example for two variables V and Q, where V is discrete, assuming values {v 1 , . . ., v M } with probabilities the expectation and its estimator are well behaved functions.Assumption 3(i) allows the data to not be identically distributed.Assumption 3(ii) allows the expectation to be estimated at nonparametric rates, provided the estimator is uniformly consistent at a minimum rate of n 1/4 .Given Assumptions 3(ii) and 4(iii), Assumption 3(iii) may be substituted by 1 Van der Vaart 1998).Note that the first element of ( ψ(Z i ) − ψ 0 (Z i ))1(X i = 0)W i is equal to zero, and therefore both the first row and the first column of are equal to zero.Other rows and columns of may also be equal to zero, for example if Z includes X 2 .The notation N (0, ) thus should be understood as a random variable with a degenerate normal distribution with (singular) covariance matrix .
Theorem 4.1 establishes that the coefficients estimated with the correction approach are asymptotically normal at parametric rates.
Theorem 4.1.If equations ( 1), (2), and (3), and Assumption 3 hold, then where is the asymptotic covariance matrix of the coefficients of a regression of Y onto X, Z, and The proof (Appendix G.2) applies Theorem 2 in Chen, Linton, and Van Keilegom (2003).Establishing the stochastic equicontinuity condition 2.5' in Chen, Linton, and Van Keilegom ( 2003) is challenging.For this, we show that Lemma 2.17 in Pakes and Pollard (1989) can be directly applied in our context.
Next, we present an estimator of the asymptotic variance.Let ŵ be the matrix of regressors, with rows equal to Ŵi := (X i , Z i , i=1 be the diagonal matrix of the square of the residuals.Finally, let V be a matrix with row i equal to ( Ĉi1 1(X 1 = 0, X i = 0), . . ., Ĉin 1(X n = 0, X i = 0)) , where the Ĉij are estimators of the asymptotic covariance of Ê[X * |X * ≤ 0, Z = Z i ] and Ê[X * |X * ≤ 0, Z = Z j ], precisely defined in Assumption 5 in Appendix C.1.Then, an estimator of the asymptotic variance of θ is The first term is simply the Eicker-White covariance estimator in a regression of Y onto X, Z and X + ψ(Z)1(X = 0).The second term is the penalty resulting from the fact that we are using an estimate of the expectation, instead of the true value ψ 0 (Z).Theorem 4.2.If equations ( 1) and ( 2), and Assumptions 3 and 5 (in Appendix C.1) hold, then The proof of this theorem is in Appendix G.3.It uses a special Strong Law of Large numbers for U-statistics when the The limit is well defined by Assumption 3(i) and Assumption 4(i), and is established in White (1980).
Supposing G admits a density g(x, z; κ), the conditional loglikelihood function is and the expectation estimator is The asymptotic properties of this estimator may be derived by applying the Delta Method and the Continuous Mapping Theorem to standard results on the asymptotic behavior of the censored maximum likelihood estimator (e.g., van der Vaart 1994).
Example 5.1 (Homoscedastic Tobit Estimator).Suppose that F X * |Z belongs to the Gaussian family (Example 3.1) with μ Z = Z μ and σ 2 Z = σ 2 .This condition and ( 1) specify a standard homoscedastic Tobit model (Tobin 1958).Then, The parameters μ and σ 2 can be estimated by a Tobit regression of X onto Z with censoring below zero.This can be implemented with any packaged Tobit software.Letting μ and σ be the resulting estimators of the coefficients of Z and standard deviation, respectively, then Ê[X * |X * ≤ 0, Z] is obtained by substituting these into (8).Details regarding the asymptotic normality of this estimator and consistency of the standard errors can be found in Footnote 27 in Appendix C.2.1.

Nonparametric Estimation with Discrete
The asymptotic properties of this estimator can be derived using empirical process theory, observing that F X|Z=z l is a function of the joint distribution of X and Z (see Step 2 in Appendix G.6), and therefore E[X * |X * ≤ 0, Z = z] can be written as an empirical process.
Example 5.2 (Semiparametric Tobit Estimator).Suppose that F X * |Z belongs to the Gaussian family (Example 3.1).For a given z l , this condition and (1) imply that X * |Z = z l satisfies a standard Tobit regression model with constant mean μ z l and constant variance σ 2 z l (Tobin 1958).Then, Because P(Z = z l ) > 0, the parameters μ z l and σ 2 z l can be estimated by a Tobit regression of X onto a constant with censoring below zero, using only observations such that Z = z l .This can be implemented with any packaged Tobit software.Letting μz l and σz l be the resulting estimators of the mean and standard deviation, Ê[X * |X * ≤ 0, Z = z l ] is then obtained by substituting these estimates into (10).Details regarding the asymptotic normality of this estimator and consistency of the standard errors can be found in Appendix C. 2.1. Example 5.3 (Tail Symmetry Estimator).Supposing that F X * |Z=z l satisfies Assumption 1, the expectation identification formula in ( 5) holds.That formula has three components which have to be estimated and then substituted into (5).First, we estimate the bunching probability using sample frequencies: . Second, we estimate the quantile F −1 X|Z=z l (1− FX|Z=z l (0)) by substituting the empirical quantile of the distribution of X among the observations such that Z = z l .Finally, we estimate the conditional trimmed mean using . ( 11) Theorems establishing the asymptotic normality of this estimator and consistency of the standard errors can be found in Appendix C.2.2.Although the estimator itself is very simple, the asymptotic result is very challenging to prove.The proof can be found in Appendix G.6 which, although long, can be well understood by reading the titles of the several steps.We would like to call the reader's attention, in particular, to the intermediate lemmas (Lemmas G.1 and G.2) and surrounding discussion, as these are of general interest for the estimation of quantiles at random (e.g., estimated) probabilities and trimmed means at random (e.g., estimated) trimming locations.These results are stated and proven in self-contained notation, and thus they should be understandable (and useful) independently of this paper's concerns.Lemma G.3 may also be useful, as it derives the Hadamard derivative of the conditional distribution, which allows one to transform an estimator that is a functional of the estimated conditional distribution into an empirical process estimator.

Nonparametric Estimation (General Case)
Estimation of the expectation in the general case requires the use of some multivariate nonparametric estimation technique.If G in Theorem 3.1 is known, a general framework for estimation can follow (9), substituting z l with z.In this case, F X|Z=z (•) may be estimated with a specialized conditional probability estimator (e.g., Durrieu et al. 2015 and the citations therein).Another option, since is to estimate this quantity as a nonparametric regression of 1(X ≤ •) onto Z at z.This opens up a number of possibilities, starting from the basic Nadaraya-Watson kernel regression, 14 or any of the classic nonparametric regression techniques (see, e.g., Masry 1996 for local polynomials and Song 2008 for series estimators), all the way to modern statistical learning prediction estimators (see Hastie, Tibshirani, and Friedman 2009 for several examples).
In the tail symmetric case (Assumption 1), E[X * |X * ≤ 0, Z] may be estimated following (5).The conditional quantile F −1  X|Z=z (1 − F X|Z=z (0)) can be estimated with any of several existing techniques for nonparametric estimation of conditional quantiles (e.g., Samanta 1989 andChaudhuri 1991) evaluated at the estimated probabililty 1 − FX|Z=z (0).The conditional trimmed mean E[X|X ≥ F −1  X|Z=z (1 − F X|Z=z (0)), Z = z] can be estimated as a nonparametric regression of X onto Z at Z = z, using only observations such that X ≥ F−1 X|Z=z ).This can be implemented with any of the regression methods discussed in the estimation of F X|Z=z (•).Note that the asymptotic results for quantile and trimmed mean estimation in the literature are developed for known probabilities and trimming points, respectively.Here, these quantities are evaluated at estimated points instead, which must be taken into account.See the discussion in Example 5.3.
In the location-scale case (Assumption 2), E[X * |X * ≤ 0, Z = z] may be estimated following equation ( 6).This is a deterministic equation of F X|Z=z (0) and E[X|Z = z].F X|Z=z (0) may be estimated as discussed above, and E[X|Z = z] may be estimated similarly, using any of the nonparametric regression techniques discussed above.Alternatively, E[X|Z = z] may also be estimated via a varying coefficient maximum likelihood method (e.g., Cai, Fan, and Li 2000).

Remark 5.1 (Large-Dimension/Mixed Discrete-Continuous/Sparse Support Controls).
There are well-known practical issues in implementing the approaches discussed above when Z has many dimensions, or when it is composed of multiple continuous and discrete variables, or when it is sparse in regions.Thus, in many applications, including ours, it may be necessary to implement a strategy that is adaptive for complex Z. Techniques aimed at nonparametric multivariate regression with complex regressors are a main concern of the statistical learning field.For a summary of this large literature, we refer the reader to Hastie, Tibshirani, and Friedman (2009), which examines many of the modern dimensionality reduction techniques for regressions, and thus covers the estimation of Analogous methods for conditional quantiles have also been studied extensively (e.g., White 1992;Chaudhuri and Loh 2002;Li and Racine 2008;Chen et al. 2019), which cover the estimation of F −1  X|Z=z (1−F X|Z=z (0)).In our application, we use a clustering strategy to reduce the dimension of Z.The idea is to group the values of Z into K clusters such that each cluster has the least amount of variation Conditions for uniform convergence of such estimators can be verified in the existing literature.See, for example, Andrews (1995) and Hansen (2008).
possible in Z.We then substitute Z with ĈK , the vector of cluster indicators, in the estimation of the expectation.That is, if Z belongs to the kth cluster, then , where e k (K) is the kth canonical vector with K coordinates.
Clustering is a convenient strategy for dimensionality reduction because it discretizes the support of Z for the purposes of estimation: no matter the dimension of Z, ĈK can assume only K values: e 1 (K), . . ., e K (K).This allows us to use the estimators developed for the discrete support case-see Examples 5.2 and 5.3, substituting Z with ĈK and z l with e l (K).These estimators may be implemented conveniently using optimized, off-theshelf packages in commonly used software.Details about the clustering strategy are found in Appendix D.

Testing the Identification Assumptions
In this section, we discuss how to test the two identification conditions: (3) and the distributional assumption, both separately and jointly.All tests we propose are simple modifications of existing, well-known tests.We implement the tests in our application, and the findings are discussed in Section 7 and Appendix E. A comprehensive study of the properties of these tests is beyond the scope of this article.

Testing Equation (3)
Let the null hypothesis be "H 0 : (3) holds." If H 0 is true, then ( 4) holds, and therefore That is, the conditional expectation for X > 0 must be linear in X and Z.Note that no assumption other than equation ( 3) was used to establish this fact.Therefore, we can test H 0 by applying any specification test to the regression of Y on X and Z for X > 0 (e.g., Ramsey's 1969 RESET test).In Appendix E.2, we show plots that illustrate in our application how this testing idea may be visually displayed.

Testing the Distributional Assumption
As discussed in Section 3, identification of E[X * |X * ≤ 0, Z] is often achieved by assuming that X * |Z = z belongs to some specific distribution family.Because the distribution of X * |Z is observed when X * > 0, this assumption is often directly testable.
For example, the null hypothesis "H 0 : ) can be tested by comparing the empirical distribution FX|Z=z (x), and the assumed normal distribution ((x − μ z )/σ z ) for x > 0, where μ z and σ z can be estimated as described in Example 5.2.This test can be implemented with any distribution comparison test, such as the well-known two-sample Kolmogorov-Smirnov test, as well as various more powerful alternatives such as the test developed in Goldman and Kaplan (2018). 15Analogous tests can be built for most well-known parametric families.
Tail symmetry (Assumption 1) is not directly testable.However, full symmetry around the mean implies tail symmetry and is testable if F X|Z=z (0) < 0.5. 16This follows because, if X * |Z = z is symmetrically distributed around its mean, then X|X ∈ I, Z = z, where I = (F X|Z=z (0), 1 − F X|Z=z (0)), is also symmetrically distributed around its mean.Therefore, if F X|Z=z (0) < 0.5, the null hypothesis "H 0 : X * |Z = z is symmetrically distributed around E[X * |Z = z]" can be tested by implementing any of a large number of existing tests of symmetry on F X|X∈I,Z=z .17In Appendix E.3, we show plots that illustrate in our application how this testing idea may be visually displayed.

Testing All Identifying Assumptions: Specification Tests
Suppose that the identified expectation is ẽ(Z), which may be different from the true expectation, E[X * |X * ≤ 0, Z].Let the best linear predictor of Y given X, Z and X + ẽ(Z)1(X = 0) be defined as If (3) holds and the expectation is correctly identified (i.e., and β = β, θZ = (γ − πδ), and δ = δ.This implies that it is possible to test the hypothesis "H 0 : equation ( 3) holds, and E[X * |X * ≤ 0, Z] is correctly identified" by testing whether E[Y|X, Z] = L(Y|X, Z, X + ẽ(Z)1(X = 0)) using a specification test. 18ince the regression employs a generated regressor, X + ê(Z)1(X = 0), standard specification tests will have the wrong size.A correctly-sized specification test can be performed by simply adding functions of X and Z to the sets of controls used in the regression and testing if the coefficients of these terms are equal to zero using the correct covariance matrix estimator or the bootstrapped critical values discussed in Section 4 (see Remark C.3 in Appendix C.1).
We also propose a specification test of H 0 that is particularly suited to our setting, where the support of the treatment variable is bounded at one extreme with bunching at the boundary point.The test leverages sample truncation to detect nonlinearities caused by failures of H 0 .Specifically, we propose restricting the sample to X ∈ [0, α], for a given value of α, and testing whether the coefficient of X in the truncated regression is different from the coefficient of X in the full sample regression.This test can be implemented using off-the-shelf tests that compare coefficients of non-nested regressions.The rationale of this test and further details can be seen in Appendix E.4.There, we also present plots that illustrate in our application how this testing idea may be visually displayed.

Testing All Identifying Assumptions: Additional Bunching Points
When there are multiple bunching points, including possibly some in the interior of the support of the treatment, one bunching point on the boundary of the support can be used to build the correction, while the remaining bunching points can be used to test the underlying assumptions of the correction method in the same regression.Specifically, suppose that a second bunching point at X = x exists, we propose testing the hypothesis "H 0 : equation ( 3) holds and E[X * |X * ≤ 0, Z] is correctly identified" by estimating the model and performing a t-test of whether α x = 0.This is equivalent to performing the dummy test from Caetano et al. (2021) in this model.This test detects whether linearity holds and the correction has solved the endogeneity problem. 19The t-test must be implemented using the correct standard errors.The standard error estimator and the consistency of the bootstrap critical values are established in Appendix C.1 (see Remark C.3).
To illustrate the type of situation where this testing approach can be employed, Figure 3 shows the empirical cdf of maternal labor supply, measured as the average number of weeks per year the mother worked during the three years following the child's birth.The figure shows two bunching points, one at X = 0, and the other at X = 52.Maternal labor supply is the treatment variable of a large literature (see e.g., Caetano et al. 2022 and the references therein).

Application: The Effect of TV on Children's Skills
In this section, we apply our method to estimate the effect of time spent watching TV on children's skills using the 1997, 2002,  Caetano, and Nielsen (2022), which studies the effect of enrichment activities on skills, and Caetano et al. (2022), which studies the effect of maternal labor supply on skills.As the application here uses the same sample as Caetano, Caetano, and Nielsen (2022), we refer the reader to that paper for details about the data and definition of skills.We would like to estimate β in ( 3), where Y is either the cognitive or noncognitive skills of the child, X is the number of hours the child spent watching TV in a typical week, and Z is a vector of controls which includes a constant as well as characteristics of the child, family, and environment. 21 Table 1 presents the estimates of β in (3) both with and without the endogeneity correction.Column (i) shows the simple regressions of TV time on skills without controls.Time spent watching TV is negatively correlated with both cognitive and noncognitive skills.Column (ii) adds observed controls, but does not include the correction control function.After controlling for observables, the estimates of β are closer to zero, but the cognitive estimate is still negative and significant.
In columns (iii)-(v), we show estimates of the corrected regression.Specifically, we follow (4), and estimate a regression of Y on X, Z and the control function X+ Ê[X * |X * ≤ 0, Z]1(X = 0).In column (iii), Ê[X * |X * ≤ 0, Z] uses the Homoscedastic Tobit estimator, described in Example 5.1.The estimates of E[X * |X * ≤ 0, Z] in columns (iv) and (v) are obtained in two steps.First, we cluster the controls according to the method described in Remark 5.1.We obtain K = 10 clusters, and the cluster indicator vector ĈK .Second, for observation Z i in 20 See Zavodny (2006), Gentzkow andShapiro (2008), andMunasib andBhattacharya (2010) for recent empirical papers in this literature.These papers are well aware that watching television may be endogenous.Zavodny (2006) tackles endogeneity using fixed effects, Munasib and Bhattacharya (2010) uses IV, while Gentzkow and Shapiro (2008) uses the timing of the roll-out of children's programming to different local markets to obtain causal estimates. 21The variables included as controls are the child's age and squared age (in months), and indicators for: CDS wave (1997, 2002, and 2007), grade (thirteen variables, from kindergarten through grade 12), gender, ethnicity (black, Hispanic and other nonwhite ethnicity), whether the child has siblings, family income tercile, whether the mother is alive, and whether the father is alive.
cluster l, we estimate Ê In column (iv) we use the Semiparametric Tobit estimator described in Example 5.2, substituting Z with ĈK , and z l with e l (K) in the formulas.In column (v) we use the tail symmetry estimator described in Example 5.3 with analogous substitutions.
The corrected β estimates are positive and insignificant for cognitive skills, but negative and significant for noncognitive skills.The Homoscedastic and Semiparametric Tobit estimates (columns (iii) and (iv)) are very similar.The tail symmetry estimates (column (v)) are closer to zero, yet the noncognitive estimate remains significant at 5%.Table 1 also displays the estimates of δ.All are significant at 5% and are negative for cognitive skills and positive for noncognitive skills, suggesting that selection on observables is strongly rejected in this context.
Note that the bootstrapped standard errors of β in the corrected models (columns (iii)-(v)) are much larger than the corresponding standard errors in the uncorrected model (column (ii)).The Eicker-White standard errors of the corrected model ( , see Theorem 4.1) gravitate around 95% of the bootstrapped standard errors for all specifications, so the penalty due to the estimation of the expectation turns out not to be important in explaining the larger standard errors.Rather, the standard errors are larger because much of the raw variation in X is contaminated by variation in the confounder.The uncorrected models attribute to X the confounding variation in Y, thus yielding artificially smaller standard errors.In this context, inference is affected twice by not appropriately controlling for unobservables: not only is there bias in the uncorrected estimator, but its standard error leads to the belief that effects might be estimated with a higher degree of precision than is achievable by a correct estimator.

Testing the Identification Assumptions
We mention here the key findings from the implementation of the tests discussed in Section 6 in our application.The precise null hypothesis, method used and details can be seen in Appendix E. In Appendix E.1, we show that there is strong evidence of endogeneity in the uncorrected model, which justifies our correction approach.In Appendix E.2, we test the hypothesis that (3) holds, and we do not reject it.In Appendix E.3, we test the distributional assumption of X * |Z alone.The Homoscedastic Tobit estimator (column(iii)) is consistent under the homoscedastic Gaussian assumption, as discussed in Example 5.1, and the Semiparametric Tobit estimator (column (iv)) is consistent under the semiparametric Gaussian assumption, as discussed in Example 5.2.Our tests reject both assumptions.The tail symmetry estimator (column (v)) is consistent under the tail symmetry assumption, as discussed in Example 5.3.Our tests do not reject the stronger condition of symmetry around the mean (see Footnote 16).Therefore, our preferred estimates are those in column (v) of Table 1.It is worth noting that the estimates in columns (iii) and (iv) are similar to the results in column (v).This suggests that our tests are able to detect violations of the distributional assumptions even when the bias generated by these violations is likely small.This is consistent with what we find in our Monte Carlo study (Appendix F).Finally, in Appendix E.4, we perform a joint test of whether (3) and the tail symmetry assumption hold and do not reject this joint hypothesis.

Concluding Remarks
This article shows how to leverage bunching at the lower (or upper) extremum of the distribution of the treatment variable to build a control function to correct for endogeneity.In linear models, the method consists of adding a generated regressor to the original regression.We also show that the method can be applied in many models used by practitioners, including some nonparametric and nonseparable models.The approach does not make exclusion restrictions.Instead, it makes testable functional and distributional shape assumptions.We study the asymptotic behavior of the estimated coefficients, prove the consistency of the bootstrap, and show that the method performs well in an empirical Monte Carlo study.Finally, we apply the method to estimate the effect of watching television on the skills of the child.A common finding in both the application and the Monte Carlo study is that, when violations of the identifying assumptions are large enough to generate a noticeable bias, these violations are easy to detect with the tests of the identification assumptions we develop.
The method proposed in this article opens up many paths for new research.Here we highlight a few: (a) discretizing the controls Z before estimating the expectation is convenient for the implementation of the method proposed in this article, as evidenced in our application, Caetano, Caetano, andNielsen (2022), andCaetano et al. (2022).The advantages/drawbacks of the use of clusters in this context need to be investigated further.(b) The interaction of the correction with existing methods is promising.We mentioned the combination of this method with Caetano et al.'s (2021) test in Section 6.4.The interaction of this approach with instrumental variables methods may also prove valuable.

Figure 1 .
Figure 1.Unconditional distribution of X. NOTE: The left panel shows the cumulative distribution function of X (TV time).The right panel shows the kernel density estimate along with the histogram for X > 0 (bandwidth equals to 2).The darker bar is the proportion of observations with X = 0. See Section 7 for more details about the application.
Figure2shows an example of a distribution that is symmetric in the tails.FollowingPowell (1986), conditional symmetry assumptions have been used extensively in the censoring literature.For a large list of citations in the applied and theoretical literature which use conditional symmetry assumptions, seeChen and Tripathi (2017).This class includes many well-known distribution families, such as Gaussian, Student's-t, logistic, and uniform.Tail symmetry is a weaker assumption than symmetry, since it also allows distributions that are not symmetric everywhere, as depicted in Figure2.Since for x < 0,F −1 X * |Z=z (1 − F X * |Z=z (0)) − x > 0, it follows that F X * |Z=z (F −1 X * |Z=z (1−F X * |Z=z (0))−x) = F X|Z=z (F −1 X|Z=z (1− F X|Z=z (0)) − x),and thus Tail Symmetric distributions satisfy the condition of Theorem 3.1 with functionalG(x, F) = 1 − F(F −1 (1 − F(0) − x))for any distribution function F. The conditional expectation can be identified using the formula

Figure 3 .
Figure 3. Evidence of bunching, maternal labor NOTE: This figure shows the empirical cdf for the average number of weeks per year in which the mother worked in the three years following her child's birth.Source: National Longitudinal Study of Youth, 1979 cohort, sample of mothers whose children were born from 1979 to 2002.

Table 1 .
The effect of TV time on children's skills.Results are reported in terms of percentage points of the standard deviation of the outcome variable.For example, the results in the last column suggest that an increase of one hour per week watching TV leads to a reduction in noncognitive skills of 2.11 percentage points of one standard deviation.The list of observed controls in columns (ii)-(v) is described in Footnote 21.Bootstrapped standard errors in parentheses (1000 bootstrap samples).Columns (iii), (iv), and (v) show results using the Homoscedastic Tobit expectation estimator (Example 5.1), the Semiparametric Tobit expectation estimator (Example 5.2) and the Tail Symmetry expectation estimator (Example 5.3), respectively.Specifications in columns (iv) and (v) use 10 clusters (see Remark 5.1).See Figure 4 in Appendix D for a reproduction of the results in column (v) for different numbers of clusters.**p <0.05, *p <0.1.and 2007 Waves of the Child Development Supplement from the Panel Study of Income Dynamics (CDS-PSID). 20More detailed applications of the method proposed in this article can be found in Caetano,