Inference for Misspecified Models With Fixed Regressors

Following the work by Eicker, Huber, and White it is common in empirical work to report standard errors that are robust against general misspecification. In a regression setting, these standard errors are valid for the parameter that minimizes the squared difference between the conditional expectation and a linear approximation, averaged over the population distribution of the covariates. Here, we discuss an alternative parameter that corresponds to the approximation to the conditional expectation based on minimization of the squared difference averaged over the sample, rather than the population, distribution of the covariates. We argue that in some cases this may be a more interesting parameter. We derive the asymptotic variance for this parameter, which is generally smaller than the Eicker–Huber–White robust variance, and propose a consistent estimator for this asymptotic variance. Supplementary materials for this article are available online.


INTRODUCTION
Following the seminal work by Eicker (1967), Huber (1967), and White (1980aWhite ( , 1980bWhite ( , 1982, researchers estimating regression functions routinely report standard errors that are robust to misspecification of the models that are being estimated. Müller (2013) gave the corresponding confidence intervals a Bayesian interpretation. A key feature of the approach developed by Eicker, Huber, and White (EHW from hereon), is that in regression settings it focuses on the best linear predictor that minimizes the distance between a linear function and the true conditional expectation, averaged over the joint distribution of all variables, with extensions to nonlinear settings. We argue that in some regression settings it may be more appropriate to focus on the conditional best linear predictor defined by minimizing this distance averaged over the empirical instead of the population distribution of the covariates. The first contribution of this article is to extend the EHW results to such settings. For a large class of estimators, including maximum likelihood and method of moment estimators, we formally characterize the generalization to nonlinear models of the conditional best linear predictor. We then derive a large sample approximation to the variance of the least squares and method of moments estimators relative to this conditional estimand. In general, in misspecified models, the robust variance for the conditional estimand is smaller than or equal to the EHW robust variance. Second, we propose a consistent estimator for this variance so that asymptotically valid confidence intervals can be constructed. The proposed estimator generalizes the variance estimator proposed by Abadie and Imbens (2006) for matching estimators and is related to the differencing methods used in Yatchew (1997Yatchew ( , 1999. In correctly specified models, the new variance estimator is simply an alternative to the standard EHW robust variance estimator. In misspecified models, it is the only consistent estimator available for the asymptotic variance for the estimand conditional on covariates. Whether conditional or unconditional estimand should be the primary focus is context specific and we do not take the position that either the conditional or unconditional estimand is always the appropriate one. We discuss some examples, first to clarify the distinctions between the two estimands and, second, to make an argument for our view that in some settings the conditional estimand, corresponding to the fixed regressor notion, is of interest. For example, we argue that in cases where the sample is the population there is a strong case for using the estimand conditional on at least some covariates, see also Abadie et al. (2014). Such cases are common in economic analyses, for example, when analyzing data where the units are all states of the United States, or all countries of the world. Most importantly, we argue that there is a choice to be made by the researcher that has direct implications for inference. In making this choice the researcher should bear in mind that the variance for the conditional estimand is generally smaller than that for the population or unconditional estimand, and thus tests for the former will generally have better power than tests for the latter.
Note that although we focus on estimands defined in terms of the finite sample distribution of the covariates, our inference relies on large sample approximations. To focus on the conceptual contribution of the current article and maintain comparability with the preceding literature, we focus on unconditional inference.
The rest of this article is organized as follows. Section 2 contains a heuristic discussion of the conceptual issues raised by this article in a linear regression model setting. In Section 3 we discuss the motivation for the conditional estimand. Next, in Section 4 we present formal results covering least squares, maximum likelihood, and method of moments estimators. In Section 5 we apply the methods developed in this article to a dataset previously analyzed by Sachs and Warner (1997) to study the relation between country-level growth rates and government fiscal policies. In Section 6, we present two simulation studies, one in a linear and one in a nonlinear setting. Section 7 concludes. The Appendix contains proofs.

THE CONDITIONAL BEST LINEAR PREDICTOR
In this section, we lay out some of the conceptual issues in this article informally in the setting of a linear regression model. In Section 4, we provide formal results, covering both this linear model setting and more general cases including maximum likelihood and method of moments.
Consider the standard linear model with Y i being the outcome of interest, X i a K-vector of observed covariates, possibly including an intercept, and ε i an unobserved error. Let X, Y, and ε be the N × K matrix with ith row equal to X i , the N-vector with ith element equal to Y i , and the N-vector with ith element equal to ε i , respectively. In this setting, researchers have often assumed homoscedasticity, independence of the errors terms, and Normality of the error terms, where I N is the N × N identity matrix. Under those assumptions the exact (conditional) distribution of the least-squares estimator However, assumptions of linearity of the regression function, independence, homoscedasticity, and Normality of the error terms are often unrealistic. Eicker (1967), Huber (1967), and White (1980aWhite ( , 1980b, considered the properties of the leastsquares estimatorθ under substantially weaker assumptions. For the most general case one needs to define the estimand if the regression function is not linear. Suppose the sample (Y i , X is a random sample from a large population satisfying some moment restrictions. Let μ(x) = E[Y i |X i = x] be the conditional expectation of Y i given X i = x, and let σ 2 (x) be the conditional variance. Even if this conditional expectation μ(x) is not linear, one might still wish to approximate it by a linear function x θ , and be interested in the value of the slope coefficient of this linear approximation. Traditionally, the optimal approximation is defined as the value of θ that minimizes the expectation of the squared difference between the outcomes and the linear approximation to the regression function. This is generally referred to as the best linear predictor, formally defined as with the last term free of dependence on θ , it follows that we can characterize θ pop as which in turn shows that θ pop can be interpreted as the value of θ that minimizes the discrepancy between the true regression function μ(x) and the linear approximation, weighted by the population distribution of the covariates. The results in EHW imply that, under some regularity conditions, where the asymptotic variance is White also proposed a consistent estimator for V pop , (2.4) Using the EHW variance estimatorV pop is currently the standard practice in empirical work in economics, see, for example, Angrist and Pischke (2009). See Imbens and Kolesár (2012) for a discussion of finite sample improvements. Resampling methods such as the jackknife and the bootstrap (Efron 1982;Efron and Tibshirani 1993) can also be used to construct confidence intervals for θ pop .
In this article, we explore an alternative linear approximation to the possibly nonlinear regression function μ(x). Instead of minimizing the marginal expectation of the squared difference between the outcomes and the regression function, we minimize this expectation conditional on the observed covariates. Define the conditional best linear predictor θ cond (X) as The difference with the best linear predictor defined in (2.2) is that in (2.5) the expectation is taken over the empirical distribution of the covariates, whereas in (2.2) the expectation is taken over the population distribution of the covariates. To be explicit about the dependence of the conditional best linear predictor on the sample values of the covariates we write θ cond (X) as a function of the matrix of covariate values X. Denoting the N-vector with ith element equal to μ(X i ) by μ(X), we can write θ cond (X) as to stress the interpretation of θ cond (X) as the best approximation to the true regression function, now with the weights based on the empirical distribution of the covariates. Both θ pop and θ cond (X) base the linear approximation to μ(x) on a minimizing of the squared difference between the true regression function μ(x) and the linear approximation x θ . The difference between the two approximations is solely in how they weight, as a function of the covariates, the squared difference between the regression function and the linear approximation for each x. The first approximation, leading to θ pop , uses the population distribution of the covariates. The second approximation, leading to θ cond (X), uses the empirical distribution of the covariates. We defer to Section 3 the important question whether, and why, in a specific application, θ cond (X) rather than θ pop might be the object of interest. In some applications we argue that θ pop is the estimand of interest. However, as discussed in detail in Section 3, we also think that in other applications θ cond (X) is of more interest than θ pop . Given that the main focus of the previous literature is on population parameters like θ pop , we view the question of inference for θ cond (X) as of general interest.
Next, we point out the implications of the difference between θ pop and θ cond (X). The first issue to note is that for point estimation it is irrelevant whether we are interested in θ pop or θ cond (X). In both cases, the least-squares estimatorθ is the natural estimator. However, for inference it does matter whether we are interested in estimating θ pop or θ cond (X), unless E[ε|X] = 0 and the conditional expectation is truly linear. Consider the variance of the least-squares estimatorθ, viewed as an estimator of θ cond (X). The exact (conditional) variance ofθ is Directly comparing the normalized variance N · V(θ |X) to the EHW variance V pop is complicated by the fact that N · V(θ |X) is a conditional variance, rather than an asymptotic variance like V pop . We therefore look at the unconditional variance of the ols estimatorθ as an estimator of θ cond (X). Becauseθ is unbiased for θ cond (X), it follows that the marginal variance is the expected value of the conditional variance. Under random sampling the asymptotic variance is and we have, under regularity conditions, a large sample approximation to the distribution of √ N · θ − θ cond (X) : The key difference between the robust variance V pop proposed by White and the robust variance V cond arises from the difference between the conditional variance σ 2 (X i ) in (2.7) and the expectation of the squared residual . The latter is in general larger: where μ(X i ) − X i θ pop captures the difference between the linear approximation and the conditional expectation. For the asymptotic variances ofθ we have The last expectation is over the distribution of θ cond (X) as a function of X. Thus in general V pop exceeds V cond , and as a result inference based on V pop is conservative for θ cond (X). The difference between the two variances is the result of the misspecification in the regression function, that is, the difference between the conditional expectation and the best linear predictor, μ(x) − x θ pop . The final question we address in this section is how to estimate V cond . Simple bootstrapping methods do not work, see Tibshirani (1986) and Wu (1986). The challenge is that the conditional variance function σ 2 (x) is generally unknown. Estimating this is straightforward in the case with discrete covariates. One can consistently estimate the conditional variance σ 2 (X i ) at each distinct value of the covariates and plug that in (2.7), followed by replacing the expectations by averages over the sample. If the covariates are continuous, however, this is not feasible. In the remainder of this discussion, we focus on the continuous covariate case. Dealing with the setting where some of the covariates are discrete is conceptually straightforward, but would require carrying along additional notation and come at the expense of clarity. In the continuous covariate case estimating σ 2 (x) consistently for all x would require nonparametric estimation involving bandwidth choices. Such an estimator would be more complicated than the EHW robust variance estimator which simply uses squared residuals to estimate the expectation of the squared errors. Here we build on work by Yatchew (1997Yatchew ( , 1999 and Abadie and Imbens (2006, 2008, 2010 to develop a general estimator for V cond that does not require consistent estimation of σ 2 (x), much like the EHW variance estimator does not con- Next define X (i) to be the index of the unit closest to i in terms of X: where the norm we use is the Mahalanobis distance, x = x V −1 X x, although others could be used. Then, our proposed variance estimator iŝ In Section 4, we show in a more general setting that this variance estimator is consistent for V cond . An alternative estimator for V cond exploits the fact that the conditional variance of ε i X i conditional on X i is the same as X i times the conditional variance of ε i given X i , Although in this linear regression case with conditioning on all covariates both V cond andṼ cond are consistent for V cond , for nonlinear settings, or with conditioning on a subset of the covariates, only the first estimator V cond generalizes. To be specific, suppose that the covariate vector X i can be partitioned as X i = (X 1i , X 2i ) and correspondingly X = (X 1 , X 2 ), and suppose we wish to estimate the variance conditional on X 1 only. In this case, the probability limit of the normalized variance for the least-squares estimator is (2.12) Our proposed estimator for this conditional variance is This estimator is consistent for the conditional variance V cond . In contrast, replacingε X (i) byε X 1 (i) in the expression forṼ cond would not lead to a consistent estimator for the variance. Although the asymptotic variance V cond is less than or equal to the EHW variance V pop , this need not hold for the estimators.
In finite samples, it may well be the case that V cond is larger than V pop . We study the finite sample behavior of the variance estimator in a simulation study in Section 6.
In the remainder of this article, we will generalize the results in this section to maximum likelihood and method of moments settings, and state formal results concerning the large sample properties of the variance estimators.

MOTIVATION FOR CONDITIONAL ESTIMANDS
In this section, we address the question whether, when, and why the estimand conditional on the covariates may be of interest. We emphatically do not wish to argue that the conditional estimand is the appropriate object of interest in all cases. Rather, we wish to make the case, through two examples, that it depends on the context what the appropriate object is, and that in some settings the conditional best linear predictor is more appropriate than the standard unconditional estimand.
One way to frame the question is in terms of different repeated sampling perspectives one can take. We can consider the distribution of the least-squares estimator over repeated samples where we redraw the pairs X i and Y i (the random regressor case), or we can consider the distribution over repeated samples where we keep the values of X i fixed and only redraw the Y i (the fixed regressor case). Under general misspecification, both the mean and variance of these two distributions of the estimator will differ. The population estimand θ pop is the approximate (in a large sample sense) average over the repeated samples when we redraw both X i and Y i , and θ cond (X) is the approximate average over the repeated samples where X i is held fixed. Many introductory treatments of regression analysis briefly introduce the fixed and random regressor concepts, with a variety of opinions on what the most relevant perspective is. Wooldridge writes that "reliance on fixed regressors . . .
can have unintended consequences. . . . Because our focus is on asymptotic analysis, we have the luxury of allowing for random explanatory variables throughout the book" (Wooldridge 2002, pp. 10-11). Goldberger (1991) takes a different position, assuming "X nonstochastic, which says that the elements of X are constants, that is, degenerate random variables. Their values are fixed in repeated samples . . ." (Goldberger,p. 164). Vander-Vaart (2000) wrote "We assume that the independent variables are a random sample to fit the example in our iid notation, but the analysis could be carried out conditionally as well." (Van-derVaart, p. 57), and Gelman and Hill (2007) focus on the fixed regressor perspective, writing "This book follows the usual approach of setting up regression models in the measurement-error framework (y = a + bx + ), with the sampling interpretation implicit in that the errors 1 , . . . , n , can be considered as a random sample from a distribution" (Gelman and Hill,p. 17). These discussions are in the context of correctly specified regression models, however, where the averages of the distributions under the two repeated sampling perspectives coincide, and their variances agree in large samples. A point that has not received attention in the literature is that under general misspecification, the random versus fixed regressor distinction has implications for inference that do not vanish with the sample size.
Another point is that the sole difference between the population and conditional estimands is the weight function used to measure the difference between the model and the true data generating process. For the population estimand the weight function depends on the population distribution of the potential conditioning variables, and for the conditional estimand it is the sample distribution of these variables. Because the population distribution of these variables, unlike the sample distribution, is unknown, in general there is more uncertainty about the population estimand. Thus, focusing on the conditional estimand θ cond generally leads to smaller standard errors than focusing on the population estimand θ pop .

Example I (Convenience Sample)
In the first example, we want to make the case that sometimes there is intrinsically no more interest in θ pop than θ cond because neither the weighting scheme corresponding to the population distribution, nor the weighting scheme corresponding to the empirical distribution function, is obviously of primary interest.
Consider the study of lottery winners by Imbens, Rubin and Sacerdote (2001). Imbens, Rubin, and Sacerdote surveyed individuals who won large prizes in the lottery. Using a standard life-cycle model of labor supply, they focused on linear regressions of subsequent labor earnings on the annual prize and some additional covariates including prior earnings. The coefficient on the prize in this linear regression can be interpreted as the marginal propensity to consume out of unearned income, an economically meaningful parameter (e.g., Pencavel 1986). Even if the conditional expectation as a function of the prize is nonlinear, it may still be interesting to focus on the coefficient in the linear regression, partly because it facilitates comparison across studies. The question is whether the linear approximation should be based on weighting the squared difference between the true regression function and the linear predictor by the population or empirical distribution of lottery prizes. There does not appear to be a strong substantive argument for preferring one weighting function (and thus the corresponding estimand) over the other.

Example II (Experimental Design)
Karlan and List (2007) carried out an experimental evaluation of incentives for charitable giving. Among the results Karlan and List report are probit regression estimates where the object of interest is the regression coefficient on the indicator for being offered a matching incentive for charitable giving. The specification of the probit regression function also includes characteristics of the matching incentives.
In this case, the difference between V pop and V cond is that V pop takes into account sampling variation inθ due to variation in the sample values of the matching incentives over the repeated samples, whereas V cond conditions on these values. Given that the distribution of these incentives in this experiment is fixed by the researchers there appears to be no reason to take this uncertainty into account, and we submit that the appropriate measure of uncertainty is V cond rather than V pop .

INFERENCE FOR CONDITIONAL ESTIMANDS
In this section, we present the main formal results of the article, covering linear regression, maximum likelihood, and method of moments estimators. We cover settings where we condition on the full set of regressors as well as cases where we condition on a subset of the regressors. We focus on the justidentified case, although the results can be extended to overidentified generalized method of moment (GMM) settings, for example, using empirical likelihood approaches (e.g., Qin and Lawless 1994;Imbens 1997;Imbens, Johnson and Spady 1998;Newey and Smith 2004).
Suppose we have a random sample of size N of a pair of random vectors, (X i , Y i ), i = 1, . . . , N. Let X and Y be the N × K X and N × K Y matrices with ith rows equal to X i and Y i respectively. The distinction between X and Y is that we may wish to condition on the X i in defining the estimand. We are interested in a finite-dimensional parameter θ , defined in general as some function of the joint distribution of (X i , Y i ). Under some statistical model, it follows that with the dimension of θ equal to that of ψ. The model may have additional implications beyond this moment restriction, but these are not used for estimation. For example, it may be the case that the conditional moment has expectation zero, Alternatively, we may have specified the joint distribution of Y i and X i , in which case ψ(y, x, θ) could equal to the score function. In that case, the model has the additional implication that minus the expected value of the derivatives of ψ(y, x, θ) with respect to θ is equal to the expected value of the second moments of ψ(y, x, θ). Based only on (4.1), and not on any other implications of the motivating model, we may wish to estimate θ byθ , which satisfies We are interested in the properties of the estimatorθ under general misspecification of the model that motivated the moment restriction.
The standard approach to GMM and empirical likelihood estimation (Hansen 1984;Qin and Lawless 1993;Newey and McFadden 1994;Wooldridge 2002;Imbens, Johnson and Spady 1997) focuses on the value θ pop that solves If the pairs (X i , Y i ), for i = 1, . . . , N are independent and identically distributed, then under regularity conditions, . Now we focus on the conditional estimand, where we condition on X. Define θ cond (X) as the solution to If the original model implied that the conditional expectation of ψ(Y i , X i , θ) given X i is equal to zero, then θ (X) = θ pop for all X, but this need not hold in general. The motivation for the estimand is the same as in the best-linear-predictor case. In cases where the model implies a conditional moment restriction, but we are concerned about misspecification, we may wish to focus on the value for θ that minimizes the discrepancy between E[ψ(Y i , X i , θ)|X i ] and zero. We can weight the discrepancy by the population distribution of the X i 's, or by the empirical distribution. The conditional estimand corresponds to the case, where the weights are based on the empirical distribution function. We make the following assumptions. These are closely related to standard assumptions used for establishing asymptotic properties for moment-based estimators. See, for example, Newey and McFadden (1994).
Theorem 1. If Assumptions 1 and 2 hold, then: All proofs are given in the appendix.
= 0 for almost all x in the support of X i , then θ cond (X) = θ pop for all X and (θ − θ pop ) and √ N (θ − θ cond (X)) have the same asymptotic distribution.
Assumption 3(ii) requires differentiability of ψ (y, x, θ). This assumption can, however, be replaced by asymptotic equicontinuity conditions as in Huber (1967), Pakes and Pollard (1989), Andrews (1994), or Newey and McFadden (1994). In a supplementary Web Appendix we show that the results of Theorem 2 and Corollary 1 hold under an asymptotic equicontinuity condition, with the only change that for the nondifferentiable case we have = ∂E[ψ(Y i , X i , θ pop )]/∂θ . Example VI below discusses the case of L 1 (quantile) regression. Notice that the consistency result in Theorem 1 does not require everywhere differentiability of ψ(y, x, θ).
We now discuss two additional examples that illustrate the differences between the large sample variances of √ N (θ − θ pop ) and √ N (θ − θ cond (X)). The first example is related to the discussion in Chow (1984).

Example III (Maximum Likelihood Estimation)
Suppose we specify the conditional distribution of Y i given X i as f (y|x; θ ). We estimate the model by maximum likelihood: The normalized asymptotic variance under correct specification, and under some regularity conditions, is equal to the inverse of the information matrix I −1 θ , where Huber (1967) and White (1982) analyzed the properties of the maximum likelihood estimator under general misspecification of the conditional density. Let They showed that under general misspecification, The conditional version of the estimand under general misspecification is where the expectation is taken only over the conditional distribution of Y i given X i . Theorem 2 implies that If the model is correctly specified, then pop = cond . If the model is misspecified, then for x in a set of positive probability. For such x, implying that in general pop − cond is positive semidefinite.

Example IV (Quantile Regression)
Suppose that the τ th conditional quantile of Y i given X i is a linear function, so The quantile regression estimatorθ (Koenker and Bassett 1978) solves the analogous sample moment restrictions: (see Powell 1984). If the quantile regression model is misspeci- for some x in a set of positive probability, there will generally still be a value θ pop that solves (4.3). Under regularity conditions the quantile regression estimator estimates that parameter, and its distribution is (see, e.g., Angrist, Chernozhukov, and Fernández-Val (2006), or the online supplementary materials). Angrist, Chernozhukov, and Fernández-Val (2006) provided an interpretation of quantile regression under misspecification. In the online supplementary meterials, we show that, in addition:

Variance Estimation
Next, we consider estimation of the variance in the general case. Estimation of is the same as for the population estimand, The key question concerns estimation of cond . Our proposed estimator matches each unit to the closest unit in terms of X i , and then differences the values of the moment function: where X (i) is as defined in (2.10). We then combine these estimates to get an estimator for the variance of the conditional estimand: Assumption 4. The support of X i is compact. The conditional expectation E[ψ k (Y i , X i , θ)|X i = x] is Lipschitz in x with constant C k for k ≤ 4, for all θ in an open neighborhood of θ pop , where C k does not depend on θ .

AN APPLICATION TO CROSS-COUNTRY GROWTH REGRESSIONS
For an illustration of the methods discussed in this article, we turn to an analysis in Sachs and Warner (1997) of the determinants of country-level growth rates. Sachs and Warner have data for 83 countries on the country's per capita growth rate between 1965 and 1990, and wish to relate this outcome to country-level fiscal policies. These policies include the degree of openness of the country ("open") and the central government budget balance ("cgb"). Sachs and Warner estimated a linear regression of the per capita growth rate on these variables, also including a number of characteristics of the country such as its location relative to the tropics and the sea (landlocked or not),   in which the country is rated as an open economy according to the criteria in Sachs and Warner (1995). open65: open*gdp65. dpop: Difference between the growth rate of the economically active population (between ages 15 and 65) and growth of total population. cgb: Current revenues minus current expenditures of the central government, expressed as a fraction of GDP. inst: Institutional quality index. tropics: Approximate proportion of land area subject to a tropical climate. land: Dummy variable that equals one if a country is landlocked. sxp: Share of exports of primary products in GNP in 1970. life: Life expectancy at birth, ca. 1965-1970. life2: life squared. and some measures of the economic conditions at the beginning of this period, including gross domestic product in 1965 ("gdp65").
The estimates are reported in Table 1, with the variables described at the bottom of the table. We calculate the EHW standard errors, as well as our proposed conditional standard errors where the variables we condition on include all characteristics of the countries other than the economic policy variables open, open×gdp65, and cgb which are directly under the control of the government. It would appear reasonable that at least some of these variables should be conditioned on, including whether a country is landlocked and what share of its landmass is in the tropics.
We find that the standard errors for the key variables, the indicator for being open and its interaction with gdp in 1965 go down by about 7%.

TWO SIMULATION STUDIES
In this section, we assess the small sample properties of the variance estimators. We focus on two models, first a linear regression and second a logistic regression model.

A Simulation Study of a Linear Model
We consider estimating a regression function with K regressors. the first regressor, X 1i , has a mixture of a normal distribution with mean zero and unit variance, and a log normal distribution with parameters μ = 0 and σ 2 = 0.5. The mixture probability for the log normal component is p. We use two values for p in the simulations, p = 0 and p = 0.1 with the latter corresponding to a design with high leverage covariates. The remaining K − 1 covariates have normal distributions with mean zero and unit variance. All covariates are independent. We use two values for the number of covariates: K = 1 where only X 1i is present in the regression function, and K = 5 where there are four additional regressors. We use two sample sizes, N = 50 and N = 200. The conditional distribution of Y i given (X 1i , . . . , X Ki ) is Normal: A nonzero value for δ makes the model nonlinear and implies that the linear regression model is misspecified. We use two values for δ. In the first design, we fix δ = 0 (correct specification), and in the second design we use a larger value, δ = 1 (misspecification). A nonzero value for γ implies heteroscedasticity. We use two values for γ , γ = 0 (homoscedasticity) and γ = 0.5 (heteroscedasticity  Table 2 presents the results, based on 50,000 replications for each design. We focus on the coefficient on X 1i , denoted by θ (dropping the subscript 1 for ease of notation). For all designs, we report four coverage rates. First, the coverage frequency of the conventional (EHW standard error based) 95% confidence interval for θ pop . This coverage frequency is calculated as the frequency with which (θ − θ pop )/ V pop is less than 1.96 in absolute value. Note that both θ pop and θ cond need to be numerically evaluated for these data-generating processes. The nominal coverage rate of the confidence intervals is 0.95. Next, the frequency with which the same confidence interval covers θ cond , that is, the frequency with which (θ − θ cond (X))/ V pop is less than 1.96 in absolute value. This should in large samples be at least 0.95, and more than 0.95 in misspecified models according to our formal results. We also report the coverage rates for confidence intervals based on the conditional standard errors. Now the coverage for θ pop could be less than 0.95, but the coverage for θ cond (X) should be 0.95. In the first design (Design I) with a single covariate, 50 observations, a linear conditional expectation and a normal regressor and homoscedasticity, both variance estimators lead to coverage rates around 92%-93%, with the EHW variance doing slightly better. With five covariates (Design II), the difference between the two variance estimators (in favor of the EHW variance estimator) becomes more pronounced. Having a skewed distribution for the covariate with some high leverage values does not change the coverage rates very much in Design III. With 200 observations (Design V), the coverage rates become closer to the nominal coverage rates for both variance estimators. Given heteroscedasticity (Design IX), the EHW variance estimator does substantially better with a coverage rate of 91%, whereas the conditional variance estimator leads to confidence intervals with a coverage rate of 88% Allowing for misspecification of the regression function (Design XVII) changes the coverage rates substantially. The coverage rate, based on the EHW estimator, for θ pop , is 90%. The coverage rate based on the conditional variance estimator, for θ cond , is much closer to the nominal level, at 0.94.
Over the 32 designs, the worst performance of the EHW variance estimator is in Design XX, with misspecification and high leverage covariates, 50 observations, and 5 covariates, where the coverage rate is 79% instead of 95%. The worst performance of the conditional variance estimator is in Design XII, with a linear model, heteroscedasticity, five covariates, with high leverage, and 50 observations, with an actual coverage rate of 88%. It appears that the conditional variance estimator is more sensitive to heteroscedasticity, but less sensitive to the distribution of the covariates. Overall the worst case for the conditional variance estimator is substantially better than for the EHW variance estimator.

A Simulation Study of a Logistic Regression Model
Next, we do a similar simulation study in a nonlinear setting. Here, the outcome is a binary indicator. We estimate a logistic regression model specified as .
The data are generated through a model where a latent index Y and the observed outcome is the indicator that Y * i is nonnegative: In the base case, there are 50 observations, and ε i has a logistic distribution so that the logistic regression model is correctly specified. In this case there is a single covariate (K = 1), θ 1 = 1, θ 0 = 0, and the covariate has a standard Normal distribution with unit variance. We can consider combinations of five modifications, similar to those in the linear model. First, we allow for the presence of four additional covariates (K = 5), with the additional covariates all having independent normal distributions with zero coefficients. Second, we change the distribution of the first covariate to include high leverage points by making it a mixture of a standard Normal distribution and a log normal distribution with parameters 0 and 0.5, and the probability of the log normal component equal to 0.1. Third, we change the sample size to 200. Fourth, we multiply the ε i for all units by exp(1 − 0.5 · X 1i ). In the linear case, this corresponds to introducing heteroscedasticity, but here this also implies misspecification of the logistic regression model. Finally, we directly misspecify the regression function by changing the specification of Y * i to Table 3 presents the results for the 32 designs generated as combinations of these changes to the base design, based on 50,000 replications. There are some qualitative differences with the simulations for the linear case. There are generally bigger differences between the two variance estimators,V cond and V pop . The coverage rates for confidence intervals, for θ pop based onV pop , and for θ cond (X) based onV cond , are closer to nominal levels. In contrast, inference for θ cond (X) based onV pop leads to confidence intervals with substantially higher coverage, and inference for θ cond based onV cond (X) leads to substantial undercoverage.
In general, inference for θ cond (X) is less affected by the changes in the design than inference for θ pop . For example, the worst design for θ pop is still Design XX, with both misspecification and high leverage covariates, where the coverage rate is 0.930. For the conditional estimand, the worst designs are those with misspecification, with coverage rates around 0.924, still close to the nominal 0.95 level.

CONCLUSION
In this article, we discuss inference for conditional estimands in misspecified models. Following the work by Eicker (1967), Huber (1967), and White (1980aWhite ( , 1980bWhite ( , 1982, it is common in empirical work to report robust standard errors. These robust standard errors are valid for the population value of the estimator given random sampling. We show that if one is interested in the conditional estimand, conditional on all or a subset of the variables, robust standard errors are generally smaller than the White robust standard errors. We derive a general characterization of the variance for the conditional estimand and propose a consistent estimator for this variance. We argue that in some settings the conditional estimand may be of more interest than the unconditional one.

APPENDIX A: PROOFS OF THEOREMS
A.1 Proof of Theorem 1 Given Assumptions 1 and 2, Theorem 2.6 in Newey and Mc-Fadden (1994) implies the first result. To prove the second result, Therefore, θ cond (X) can be thought of as an extremum estimator that minimizes We will prove θ cond (X) − θ pop p → 0 by showing that Assumption 2 also it follows that part (i) in Assumption 2 holds also with ρ(X i , θ) replacing ψ(Y i , X i , θ). Part (ii) of Assumption 2 follows from dominated convergence because, by Assumption 2(iii), E[sup θ∈ ψ(Y i , X i , θ) |X i ] < ∞ with probability one. To prove that part (iii) holds also after replacing ψ(Y i , X i , θ) with ρ(X i , θ), notice that, because the norm is a convex function by the triangle inequality. Therefore,

Taking expectations on both sides of the previous equation and using
Assumption 2(iii), we obtain E[sup θ∈ ρ(X i , θ) ] < ∞. Now, Theorem 2.6 in Newey and McFadden (1994) implies θ cond (X) − θ pop p → 0 and, therefore, the second result of the theorem.

A.2 Proof of Theorem 2
The first result follows from Theorem 3.4 in Newey and McFadden (1994).
To prove the second result, we will first establish the joint asymptotic distribution of √ N (θ − θ pop ) and √ N (θ cond (X) − θ pop ), and then we use this result to derive the asymptotic distribution of By Assumptions 3(ii) and (iv) and Lemma 3.6 in Newey and McFadden (1994) we obtain that, for x in a set of probability one, ρ(x, θ) is continuously differentiable with respect to θ in an open neighborhood N of θ pop , with Taking expectation eliminates the cross-product term, which implies Using convexity of the norm, we obtain sup θ∈N ∂ρ(x, θ) ∂θ Taking averages on both sides of the last equation and using Assumption 3(iv) we obtain: which is nonsingular by Assumption 3(v). As a result, Theorem 3.4 in Newey and McFadden (1994) holds for the estimator that minimizes with respect to θ 1 and θ 2 . Applying Theorem 3.4 of Newey and Mc-Fadden (1994), we obtain where V joint is equal to where V gmm,cond = −1 cond ( −1 ) , and

A.3 Proof of Corollary 1
The result follows directly from We next state a lemma from Abadie and Imbens (2010) . . , N, be a sequence of independent, identically distributed random variables, with V i scalar, and with compact support for W i . For some positive integer n, and for j = 1, 2, . . . , n, let μ p (w) = E[V p i |W i = w] be Lipschitz in w with constant C p . Then for all nonnegative k, m such that max(k, m) ≤ n/2, Therefore, by Lemma A.1 and dominated convergence. This finishes the proof of (A.2). Next, we will show that which, together with (A.2), proves the claim in the Lemma. First, we expand the square: By (A.2), this is equal to Consider the first term in (A.4). Using the independence of V i and V W (i) conditional on W we have because the terms are bounded by the Lipschitz condition on μ p (x) for all p at least equal to 2k and 2m. Therefore, the first term in (A.4) is o(1), and the entire expression is We write the expectation of the first term conditional on W as The number of terms in the second sum is limited by the "kissing number" the number of units a given unit can be the closest match for (Miller et al.1997; see also Abadie and Imbens 2010), which depends on the dimension of W i . Let the kissing number be denoted by L.
Then, for given i there is only one j such that W (i) = j , and at most L j such that W (j ) = i.
Because of the Lipschitz condition on μ p (w) = E[V p which goes to zero by Lemma A.1. Hence (A.6) is Next we show that this is equal to The difference between (A.7) and (A.8) is Then:V cond p −→ E V(V i |W i ) . (A.10)

A.5 Proof of Lemma A.3
To proveV cond Without loss of generality we focus on the case with V scalar: i ] by the law of large numbers, it is sufficient to show (A.11) The first part of (A.11) follows from applying Lemma A.2 with k = 0 and m = 2, and the second part follows from applying Lemma A.2 with k = m = 1.

A.6 Proof of Theorem 3
Sinceθ p −→ θ pop and ψ(Y i , X i , θ) is differentiable in θ ,ˆ p −→ by the law of large numbers. Then, it is sufficient to showˆ cond p −→ cond . Definẽ

SUPPLEMENTARY MATERIALS
The supplementary materials contain the proof of Theorem 2 and Corollary 1 under asymptotic equicontinuity condition and an application to quantile regression. [Received October 2012. Revised May 2014