Weak Identification in Fuzzy Regression Discontinuity Designs

In fuzzy regression discontinuity (FRD) designs, the treatment effect is identified through a discontinuity in the conditional probability of treatment assignment. We show that when identification is weak (i.e., when the discontinuity is of a small magnitude), the usual t-test based on the FRD estimator and its standard error suffers from asymptotic size distortions as in a standard instrumental variables setting. This problem can be especially severe in the FRD setting since only observations close to the discontinuity are useful for estimating the treatment effect. To eliminate those size distortions, we propose a modified t-statistic that uses a null-restricted version of the standard error of the FRD estimator. Simple and asymptotically valid confidence sets for the treatment effect can be also constructed using this null-restricted standard error. An extension to testing for constancy of the regression discontinuity effect across covariates is also discussed. Supplementary materials for this article are available online.


INTRODUCTION
Since the late 1990s regression discontinuity (RD) and fuzzy regression discontinuity (FRD) designs have been of growing importance in applied economics. There is extensive theoretical work on RD and FRD designs. A few examples include Hahn, Todd, andVan der Klaauw (1999, 2001); Porter (2003); Buddelmeyer and Skoufias (2004); McCrary (2008); Frölich (2007); Frölich and Melly (2008) ;Otsu, Xu, and Matsushita (2015); Imbens and Kalyanaraman (2012) ;Calonico, Cattaneo, and Titiunik (2014); Arai and Ichimura (2013) ;Papay, Willett, and Murnane (2011); Imbens and Zajonc (2011); Dong and Lewbel (2010); and Fe (2012). See Van der Klaauw (2008) and Lee and Lemieux (2010) for a review of much of this literature. Hundreds of recent applied articles have used RD, and in many cases FRD designs. (For example, as of July 18, 2013, Imbens and Lemieux (2008) review of RD and FRD best practices was cited in 990 articles according to Google Scholar, with 372 of these articles explicitly considering FRD.) Around the same time, the seminal works of Bound, Jaeger, and Baker (1995) and Staiger and Stock (1997) made weak identification in an instrumental variables (IV) context an important consideration in applied work (see Stock, Wright, and Yogo 2002;Andrews and Stock 2007 for surveys of the literature). However, despite the close parallel between an IV setting and the FRD design (see Hahn, Todd, and Van der Klaauw 2001) there has been no theoretical or practical attempt to deal with weak identification in the FRD design more broadly.
To get a sense of the practical importance of weak identification in the FRD design, we have examined a sample of influential applied articles that use the design. We then apply the F-statistic standards discussed below to see how many of these articles may suffer from a weak identification problem. We find that in about half of the articles where enough information is reported to compute the F-statistic, weak identification appears to be a problem in at least one of the empirical specifications. (For the procedure followed to obtain the sample of articles, see the online supplement, Section 1.) We take this as evidence that weak identification is a serious concern in the applied FRD design literature. Since it is a matter of practical importance, we examine weak identification in the context of the FRD design, demonstrate the problems that arise, and propose uniformly valid testing procedures for treatment (RD) effects.
In this article, we show that the local-to-zero analytical framework common in the weak instruments literature can be adapted to FRD, and when identification is weak, we show that the usual t-test based on the FRD estimator and its standard error suffers from asymptotic size distortions. The usual confidence intervals constructed as estimate ± constant × standard error are also invalid because their asymptotic coverage probability can be below the assumed nominal coverage when identification is weak. We rely on novel techniques recently developed in the literature on uniform size properties of tests and confidence sets (Andrews, Cheng, and Guggenberger 2011) to formally justify our local-to-zero framework. Unlike the framework used in the weak IV literature, ours depends not only on the sample size but also on a smoothing parameter (the bandwidth).
We suggest a simple modification to the t-test that eliminates the asymptotic size distortions caused by weak identification. Unlike the usual t-statistic, the modified t-statistic uses a nullrestricted version of the standard error of the FRD estimator. The modified statistic can be used with standard normal critical values for two-sided testing. For two-sided testing, the proposed test is equivalent to the Anderson-Rubin test (Anderson and Rubin 1949) adopted in the weak IV literature (Staiger and Stock 1997). For one-sided testing, the modified t-statistic has to be used with nonstandard critical values that must be simulated on a case-by-case basis following the approach of Moreira (2001Moreira ( , 2003. We discuss how to evaluate the magnitude of potential size distortions in practice following the approach of Stock and Yogo (2005). The strength of identification is measured by the concentration parameter, which in the case of FRD depends on the magnitude of the discontinuity in the treatment variable and on the density of the assignment variable (the variable that determines treatment assignment). The magnitude of potential size distortions can be tested by testing hypotheses about the concentration parameter with noncentral χ 2 1 critical values using the F-statistic, which is an analog of the first-stage F-statistic in IV regression. Surprisingly, we find critical values that are much higher than would be required in a simple IV setting. When the F-statistic is only around 10, which is often used as a threshold value for weak/strong identification in the IV literature, a twosided test with nominal size of 5% is in fact a 13.6% test, and a 5% one-sided test is in fact a 16.9% test. Nearly zero (under 0.5%) size distortions of a 5% two-sided test correspond to the values of the F-statistic above 93.
Asymptotically valid confidence sets for the treatment effect can be obtained by inverting tests based on the modified t-statistic. Since the FRD is an exactly identified model, these confidence sets are easy to compute, as their construction only involves solving a quadratic equation. Most of the literature on weak instruments deals with the case of over identified models (see, e.g., Andrews and Stock 2007). In exactly identified models, the approach suggested by Anderson and Rubin (1949) results in efficient inference if instruments turn out to be strong and remains valid if instruments are weak. However, in over-identified models, Anderson and Rubin's tests are no longer efficient even when instruments are strong. Several articles (Kleibergen 2002;Moreira 2003;Andrews, Moreira, and Stock 2006) proposed modifications to Anderson and Rubin's basic procedure to gain back efficiency in over identified models. Since the FRD design is an exactly identified model, we can adapt Anderson and Rubin's approach without any loss of power. These confidence sets are expected to be as informative as the standard ones, when identification is strong. However, unlike the usual confidence intervals, the confidence sets we propose can be unbounded with positive probability. This property is expected from valid confidence sets in the situations with local identification failure and an unbounded parameter space (see Dufour 1997). In a recent article, Otsu, Xu, and Matsushita (2015) proposed empirical likelihood-based inference for the RD effect. Using the profile empirical likelihood function, they proposed confidence sets for the RD effect, which are expected to be robust against weak identification. However, they did not provide a formal analysis of the weak identification. While their method does not involve variances estimation and for that reason can enjoy better higher-order properties than our approach, it requires computation of the empirical likelihood function numerically and is computationally more demanding.
We also discuss testing whether the RD effect is homogenous over differing values of some covariates. The proposed testing approach is designed to remain asymptotically valid when identification is weak. This is achieved by building a robust confidence set for a common RD effect across covariates. The null hypothesis of the common RD effect is rejected when that confidence set is empty.
To illustrate how our proposed confidence sets may differ from the standard ones in practice, we compare the results of applying the standard confidence sets and the proposed confidence sets in two separate applications that use the FRD design to estimate the effect of class size on student achievement. Our main finding is that, as weak identification becomes more likely, the standard confidence sets and the weak identification robust confidence sets become increasingly divergent. Interestingly, in a number of cases the robust confidence sets provide more informative answers than the standard ones. More generally, the empirical applications, along with a Monte Carlo study reported in an online supplement, suggest that our simple and robust procedure for computing confidence sets performs well when identification is either strong or weak.
The rest of the article proceeds as follows. In Section 2, we describe the FRD model, derive the uniform asymptotic size of usual t-tests for FRD, discuss size distortions and testing for potential size distortions, and describe weak-identificationrobust inference for FRD. Section 3 discusses robust testing for constancy of the RD effect across covariates. We present our empirical applications in Section 4. The online supplement contains additional materials including the proofs and the Monte Carlo results.

The Model, Estimation, and Standard Inference Approach
In RD designs, the observed outcome variable y i is modeled as y i = y 0i + x i β i , where x i is the treatment indicator variable, y 0i is the outcome without treatment, and β i is the random treatment effect for observation i. If x i is binary, it takes on value one if the treatment is received and zero otherwise. When there are treatments of different intensity, x i may be nonbinary. The treatment assignment depends on another observable assignment variable, z i through E(x i |z i = z). The main feature in this framework is that E (x i |z i = z) is discontinuous at some known cutoff point z 0 , while E (y 0i |z i ) is assumed to be continuous at z 0 .
For binary x i , when lim z↑z 0 E(x i |z i = z) − lim z↓z 0 E(x i |z i = z)| = 1, we have a sharp RD design, and a fuzzy design otherwise. When x i is a continuous treatment variable, the design is sharp if x i is a deterministic function of z i , and fuzzy otherwise.
The focus of this article is fuzzy designs, and the main object of interest is the RD effect: where , and x + and x − are defined similarly with y i replaced by x i . The exact interpretation of β depends on the assumptions that the econometrician is willing to make in addition to Assumption 1. As discussed by Hahn, Todd, and Van der Klaauw (2001), if β i and x i are assumed to be independent conditional on z i , then β captures the average treatment effect (ATE) at z i = z 0 : . When x i is binary and under an alternative set of conditions, which allow for dependence between x i and β i , Hahn, Todd, and Van der Klaauw (2001) showed that the RD effect captures the local ATE (LATE) or ATE for compliers at z 0 , where compliers are observations for which x i switches its value from zero to one when z i changes from z 0 − e to z 0 + e for all small e > 0. (See the discussion on page 204 of their article.) Regardless of its interpretation, the RD effect is estimated by replacing the unknown population objects in (1) with their estimates. Following Hahn, Todd, and Van der Klaauw (2001), it is now a standard approach to estimate y + , y − , x + , and x − using local linear kernel regression. Let K(·) and h n denote the kernel function and bandwidth, respectively. For estimation of y + , the local linear regression is and the local linear estimator of y + is given byŷ + n =â n . The local linear estimator for y − can be constructed analogously by replacing 1{z i ≥ z 0 } with 1{z i < z 0 } in (2). Similarly, one can estimate x + and x − by replacing y i with x i . Letŷ − n ,x + n , andx − n denote the local linear estimators of y − , x + , and x − , respectively. The corresponding estimator of β is given byβ The asymptotic properties of the local linear estimators and β n are discussed in Hahn, Todd, and Van der Klaauw (1999) and Imbens and Lemieux (2008). We assume that the following conditions are satisfied.
a. K(·) is continuous, symmetric around zero, nonnegative, and compactly supported second-order kernel. b. {(y i , x i , z i )} n i=1 are iid; y i , x i , z i have a joint distribution F such that i. f z (·) (the marginal PDF of z i ) exists and is bounded from above, bounded away from zero, and twice continuously differentiable with bounded derivatives on N z 0 (a small neighborhood of z 0 ). ii. E(y i |z i ) and E(x i |z i ) are bounded on N z 0 and twice continuously differentiable with bounded derivatives on N z 0 \{z 0 }; lim e↓0 d p dz p E(y i |z i = z 0 ± e) and lim e↓0 bounded from above and bounded away from zero on N z 0 ; lim e↓0 σ 2 y (z 0 ± e), lim e↓0 σ 2 x (z 0 ± e), and lim e↓0 σ xy (z 0 ± e) exist, where σ xy (z i ) = cov(x i , y i |z i ); |ρ xy | ≤ρ for someρ < 1, where ρ xy = σ xy /(σ x σ y ), σ xy = lim e↓0 (σ xy (z 0 + e) + σ xy (z 0 − e)), and σ 2 x and σ 2 y defined similarly with the conditional covariance replaced by the conditional variances of x i and y i , respectively. iv. For some Remark.
(1) The smoothness conditions imposed in Assumption 2(b) are standard for kernel estimation except for the left/right limit conditions in parts (ii) and (iii), which are due to the discontinuity design and have been used in Hahn, Todd, and Van der Klaauw (1999). (2) Asymptotic normality of the local linear estimators is established using Lyapounov's central limit theorem (CLT), and part (iv) of Assumption 2(b) can be used to verify Lyapounov's condition (see Davidson 1994, Theorem 23.12, p. 373). (3) With twice differentiable functions, the bias of the local linear estimators is of order h 2 n even near the boundaries. The condition √ nh n h 2 n → 0 in Assumption 2(c) is an under-smoothing condition, which makes the contribution of the bias term to the asymptotic distribution negligible. The condition nh 3 n → ∞ ensures that the variance of the local linear estimator tends to zero. Assumption 2(c) is satisfied if the bandwidth is chosen according to the rule h n = constant × n −r with 1/5 < r < 1/3. It is convenient for our purposes to present the asymptotic properties of the local linear estimators and the FRD estimator as follows. Define The constant k is known as it depends only on the kernel function. In the case of asymmetric kernels, we will have two different constants for the left and right estimators, with the bounds of integration replaced by (−∞, 0] for the left estimators. For y = y + − y − , y n =ŷ + n −ŷ − n , and similarly defined x and x n , by Assumption 2 and Lyapounov's CLT we have where Y and X are two bivariate normal variables with zero means, unit variances, and correlation coefficient ρ xy (the latter is defined in Assumption 2(b)iii together with σ x and σ y ). This in turn implies that under standard asymptotics, x − 2bσ xy . The last result holds due to Assumption 1(a), that is, only when x = 0 and is fixed.
The asymptotic variance σ 2 y can be consistently estimated bŷ and σ xy can be constructed similarly by replacing ( Hence, a consistent estimator of σ 2 (b) can be constructed aŝ A common inference approach for the FRD effect is based on the usual t-statistic. Thus, when testing H 0 : β = β 0 one typically computes ) and compares it with standard normal critical values, as T n (β) → d N (0, 1), when x = 0 and is fixed. Confidence intervals for β are constructed by collecting all values β 0 for which H 0 : β = β 0 cannot be rejected using a test based on T n (β 0 ).

Weak Identification in FRD
Weak identification is a finite-sample problem, which occurs when the noise due to sampling errors is of the same magnitude or even dominates the signal in estimation of a model's parameters. In such cases, the asymptotic normality result T n (β) → d N (0, 1) provides a poor approximation to the actual distribution of the t-statistic, and as a result inference may be distorted.
Assuming that H 0 : β = β 0 , we can rewrite the t-statistic as When testing H 0 against two-sided alternatives, one uses the absolute value of T n (β), which eliminates the sign term. Since under standard (fixed distribution) asymptotics √ nh n y n − β x n → d N (0, kσ 2 (β)/f z (z 0 )), the usual ttest has no size distortions as long asβ n is consistent andσ 2 n (β n ) approximates σ 2 (β) very well. Define Y n = (f z (z 0 )/k) 1/2 (nh n ) 1/2 ( y n − y) and X n = (f z (z 0 )/k) 1/2 (nh n ) 1/2 ( x n − x). We can now writê Note that in the above expression, the estimation errors Y n and X n represent the noise components, while the signal component is given by (nh n ) 1/2 x. Since the noise terms have bounded variances, the signal dominates the noise as long as (nh n ) 1/2 x → ∞. In this case,β n → p β. If, however, lim n→∞ |(nh n ) 1/2 x| < ∞, the signal and noise are of the same magnitude, which results in inconsistency of the FRD estimator and weak identification. Thus, similarly to the weak IV literature (Staiger and Stock 1997), it is appropriate to model weak identification by assuming that x is inversely related to the square root of the sample size. However, the kernel estimation framework and presence of the bandwidth, which is chosen by the econometrician, require some adjustments. Suppose one models weak identification as x ∼ 1/(ng n ) 1/2 , for some sequence g n → 0 as n → ∞. In this case, the econometrician can obtain consistency ofβ n and resolve weak identification simply by choosing h n so that h n /g n → ∞. This situation resembles so-called nearly weak or semistrong identification, see Hahn and Kuersteiner (2002), Caner (2009), Renault (2009, 2012), and Antoine and Lavergne (2014). Hence, the worst-case scenario, in which the econometrician cannot resolve weak identification by tweaking the bandwidth, occurs when g n = h n , that is, x ∼ 1/(nh n ) 1/2 . This idea can be formalized using the results obtained in the recent literature on uniform size properties of tests and confidence sets: Andrews and Guggenberger (2010), Andrews and Cheng (2012), and Andrews, Cheng, and Guggenberger (2011). The latter article provides a general framework of establishing uniform size properties of tests and confidence sets. To describe this framework, let S n be a test statistic with exact finite-sample distribution (in a sample of size n) determined by λ ∈ . Note that λ may include infinite-dimensional components such as distribution functions. Let cr n (α) denote a possibly data-dependent critical region for nominal significance level α. The test rejects a null hypothesis when S n ∈ cr n (α), and the rejection probability is given by RP n (λ) = P λ (S n ∈ cr n (α)), where subscript λ in P λ indicates that the probability is computed for a given value of λ ∈ . The exact size is defined as ExSz n = sup λ∈ RP n (λ). Note that ExSz n captures the maximum rejection probability for any combination of parameters λ (the worst case scenario). In large samples, the exact size is approximated by asymptotic size AsySz = lim sup n→∞ sup λ∈ RP n (λ). Contrary to the usual point-wise asymptotic approach, AsySz is determined by taking supremum over the parameter space before taking limit with respect to n. It has been argued in many articles that controlling AsySz is crucial for ensuring reliable inference when test statistics have discontinuous asymptotic distribution, that is, when point-wise asymptotic distribution is discontinuous in a parameter. On the importance of uniform size, see, for example, Manski (2004, p. 1848), Mikusheva (2007), and references in Andrews, Cheng, and Guggenberger (2011). In what follows, we rely on the following result of Andrews, Cheng, and Guggenberger (2011): (Lemma 1 combines Assumption B and Theorems 2.1 and 2.2 in Andrews, Cheng, and Guggenberger (2011).
Lemma 1 (Andrews, Cheng, and Guggenberger 2011). Let {d n (λ) : n ≥ 1} be a sequence of functions, where d n : pose that for any subsequence {p n } of {n} and any sequence {λ p n ∈ } for which d p n (λ p n ) → d ∈ D, we have that RP p n (λ p n ) → RP(d) for some function RP(d) ∈ [0, 1]. Then, AsySz = sup d∈D RP(d).
To apply Lemma 1, we define We define λ 4 = F , where F is the joint distribution of x i , y i , z i and is such that, given λ 1 ∈ R + , λ 2 ∈ [−ρ,ρ], and λ 3 ∈ R, the three equations in (6) hold. Note that λ 4 is an infinitedimensional parameter that depends on λ 1 , λ 2 , and λ 3 . As explained by Andrews, Cheng, and Guggenberger (2011, pp. 8-9), d n (λ) is chosen so that when d n (λ n ) converges to d ∈ D for some sequence of parameters {λ n ∈ λ : n ≥ 1}, the test statistic converges to some limiting distribution, which might depend on d.
In view of (4) and (5), we therefore define While λ 4 = F affects the finite-sample distribution of the test statistic, it does not enter its asymptotic distribution, and therefore can be dropped from d n (λ) as discussed by Andrews, Cheng, and Guggenberger (2011, p. 8).
Next, we describe the asymptotic size of tests for FRD based on the usual t-statistic and standard normal critical value. Let z ν denote the νth quantile of the standard normal distribution.
Theorem 1. Suppose that Assumption 2 holds. Let X , Y be two bivariate normal variables with zero means, unit variances, and correlation d 2 . Define a. For tests that reject H 0 : β = β 0 in favor of Remark. A commonly. used measure of identification strength is the so-called concentration parameter. On the importance of the concentration parameter in IV estimation, see, for example, Stock and Yogo (2005). In our framework, the concentration parameter is given by d 2 n,1 , where d 2 n,1 → ∞ corresponds to strong (or semistrong) identification, and identification is weak when the limit of d 2 n,1 is finite. As it is apparent from the expressions for λ 1 and d n,1 in (6) and (7), the concentration parameter and, therefore, the strength of identification depend not only on the size of discontinuity in treatment assignment x, but also on f z (z 0 ), the PDF of the assignment variable at z 0 . Hence, smaller values of f z (z 0 ) would correspond to a more severe weak identification problem.
For any permitted values of d 2 and d 3 , when d 1 = ∞, we have T ∞,d 2 ,d 3 ∼ N (0, 1). Thus, the asymptotic size of tests based on T n (β 0 ) is equal to nominal size α under strong or semistrong identification. When d 1 < ∞, it is straightforward to compute AsySz numerically. To compute asymptotic rejection probabilities given d 1 , d 2 , d 3 , first using bivariate normal PDFs, one integrates numerically 1(|T d 1 ,d 2 ,d 3 | > z 1−α/2 ) or 1(T d 1 ,d 2 ,d 3 > z 1−α ) over the support of the joint distribution of Y, X . Rejection probabilities then can be numerically maximized over d's. Table 1 reports maximal rejection probabilities of one-and two-sided tests based on the usual t-statistic. The rejection probabilities reported in Table 1 were computed by numerical integration using quad2d function in Matlab. Integration bounds for normal variables were set to [−7, 7], and the rejection probabilities were maximized over the following grids of values: from −0.99 to 0.99 at 0.01 intervals for d 2 , and from −1000 to 1000 at 0.5 intervals for d 3 . It shows that AsySz approaches one as the concentration parameter approaches zero. Size distortions decrease monotonically as the concentration parameter increases. In the case of two-sided testing, nearly zero size distortions (under 0.5%) correspond to the concentration parameter of order d 2 1 ≥ 64 for asymptotic 5% tests, and d 2 1 ≥ 50 2 for asymptotic 1% tests. The table also shows that one-sided tests suffer from more substantial size distortions than two-sided tests, which is due to asymmetries in the distribution of T d 1 ,d 2 ,d 3 .

Testing for Potential Size Distortions
Following the approach of Stock and Yogo (2005), Table 1 can be used for testing a null hypothesis about the largest potential size distortion against an alternative hypothesis under which the largest potential size distortion does not exceed a certain prespecified level. Suppose that the econometrician decides that identification is strong enough if, in the case of 1% twosided testing, the maximal rejection probability does not exceed 5%. Thus, the econometrician effectively adopts tests with 5% significance level, however uses the 1% standard normal critical value. According to the results in Table 1, the corresponding null hypothesis and its alternative in this case can be stated in terms of the concentration parameter d 2 1 as H W 0 : d 2 1 ≤ 9 and H S 1 : d 2 1 > 9, respectively. A test of H W 0 can be based on the estimator of discontinuity x. Define As long as the concentration parameter is finite, F n → d χ 2 1 (d 2 1 ), a noncentral χ 2 1 distribution with noncentrality parameter d 2 1 . Let χ 2 1,1−τ (d 2 1 ) denote the (1 − τ )th quantile of the χ 2 1 (d 2 1 ) distribution. Since size distortions are monotonically decreasing when the concentration parameter increases, an asymptotic size τ test of H W 0 should reject it when F n > χ 2 1,1−τ (d 2 1 ). Noncentral χ 2 1 critical values are reported in the last two columns of Table 1 for selected values of the concentration parameter and τ = 0.05, 0.01. For example, H W 0 : d 2 1 ≤ 9 should be rejected in favor of H S 1 : d 2 1 > 9 by a 5% test when F n > 21.57. In the case of 5% two-sided testing of β, one needs the concentration parameter of at least 64 to en-  Table 1 substantially exceed the rule-of-thumb of 10, which is often used in the literature as a threshold value for weak IVs. According to our calculations, with an F-statistic of only 10, one cannot reject H W 0 : d 2 1 ≤ 1.51 2 at 5% significance level. However, a concentration parameter of 1.51 2 corresponds to maximal rejection probabilities of 16.9% and 13.6% for 5% one-sided and twosided tests, respectively.
The results from Table 1 can also be used for designing valid tests (for the FRD effect β) based on usual t-statistics in combination with somewhat larger than usual critical values. For example, suppose one is interested in a 5% two-sided test about β, and rejects the null hypothesis when F n > 21.57 and |T n (β 0 )| exceeds the 1% standard normal critical value. According to Table 1, if the concentration parameter d 2 1 ≥ 9, the asymptotic size does not exceed 5%. On the other hand, if d 2 1 ≤ 9, lim n→∞ P (F n > 21.75) ≤ 0.05. Hence, overall this test has an asymptotic 5% significance level. Intuitively, such a test is valid because the null-hypothesis for the F-pretest assumes size distortions, and one proceeds using the t-statistic only if it is rejected, that is, if the concentration parameter is found to be large enough. Note, however, that the procedure is conservative. Furthermore, passing the F-test does not completely safeguard against size distortions, and the usual t-statistic must be used with somewhat larger critical values.
Although the F-test provides useful guidance on the potential magnitude of size distortions, practitioners should not solely rely on this test to decide whether it is worth proceeding with the estimation. With this in mind, we present a robust inference approach in the next section that always yields valid confidence intervals regardless of the strength of identification and does not rely on any pretests.

Weak-Identification-Robust Inference for FRD
A common approach adopted in the weak IV literature is to use weak-identification-robust statistics to test hypotheses about structural parameters directly, instead of using their estimates and standard errors. The Anderson-Rubin (AR) statistic (Anderson and Rubin 1949; Staiger and Stock 1997) is often used for that purpose. In the context of IV regression, the AR statistic can be used to test H 0 : β = β 0 against H 1 : β = β 0 by testing whether the null-restricted residuals computed for β = β 0 are uncorrelated with the instruments.
In our case, the structural parameter is defined by (1). Hence, to test H 0 : β = β 0 against H 1 : β = β 0 , following the AR approach, we can test instead H 0 : y − β 0 x = 0 against H 1 : y − β 0 x = 0. A test, therefore, can be based on nh n y n − β 0 x n 2 kσ 2 n (β 0 )/f z,n (z 0 ) where T R n (β 0 ) denotes a modified or null-restricted version of the usual t-statistic: and the equality holds by (4). Unlike the usual t-statistic, T R n (β 0 ) uses the null-restricted value β 0 instead ofβ n when computing the standard error. In view of the discussion at the beginning of Section 2.2 and since the asymptotic distribution of |T R n (β 0 )| does not depend on the concentration parameter, replacingσ 2 n (β n ) byσ 2 n (β 0 ) eliminates size distortions. Theorem 2. Suppose that Assumption 2 holds. Tests that reject H 0 : β = β 0 in favor of H 1 : β = β 0 when |T R n (β 0 )| > z 1−α/2 have AsySz equal to α.
Consider now a one-sided testing problem H 0 : β ≤ β 0 versus H 1 : β > β 0 . Again, one can base a test on the null-restricted statistic. In this case under H 0 when β = β 0 , we have T R n (β) = ( Y n − β X n ) × sign X n ± d n,1 /σ (β) + o p (1). When identification is strong or semistrong, d n,1 → ∞, and the sign term is constant with probability one. Since the first term is asymptotically N (0, 1), T R n (β) is also asymptotically N(0, 1), and one could use standard normal critical values. On the other hand, when identification is weak and the concentration parameter is small, the sign term is random, and therefore, the null asymptotic distribution of the product differs from standard normal. To obtain an asymptotically uniformly valid test, one can use data-dependent critical values that automatically adjust to the strength of identification. Such critical values can be generated using the approach of Moreira (2001Moreira ( , 2003 by conditioning on a statistic that is (i) asymptotically independent of Y n − β X n , and (ii) summarizes the information on the strength of identification (see also Andrews, Moreira, and Stock 2006;Mills, Moreira, and Vilela 2014).
Weak-identification-robust confidence sets for β can be constructed by inversion of the robust tests. For example, a confidence set for β with asymptotic coverage probability 1 − α can be constructed by collecting all values β 0 that cannot be rejected by the two-sided robust test: This confidence set can be easily computed analytically by solving for values of β 0 that satisfy the inequality (β n − β 0 ) 2σ 2 x,n F n − z 2 1−α/2 (σ 2 y,n + β 2 0σ 2 x,n − 2σ xy,n β 0 ) ≤ 0, (10) where F n is defined in (8).
Depending on the coefficients of the second-order polynomial (in β 0 ) in Equation (10), CS 1−α,n can take one of the following forms: (i) an interval, (ii) a union of two disconnected half-lines (−∞, a 1 ] ∪ [a 2 , ∞), where a 1 < a 2 , or (iii) the entire real line. One will see cases (ii) or (iii) if the coefficient on β 2 0 in (10) is negative, which occurs when Thus, in practice one will see nonstandard confidence sets if the null hypothesis x = 0 cannot be rejected using the F-statistic and central χ 2 1,1−α critical values. Case (iii) arises when the discriminant of the quadratic polynomial in (10) is negative, which occurs if F nσ 2 n (β n ) − z 2 1−α/2 σ 2 y,n −σ 2 xy,n /σ 2 x,n < 0.
When identification is strong or semistrong, the concentration parameter and, therefore, F n diverge to infinity. In such cases, both the discriminant and the coefficient on β 2 0 tend to be positive, and consequently, CS 1−α,n will be an interval with probability approaching one.
Furthermore, one can show that when identification is strong and under local alternatives of the form β = β 0 + μ/(nh n ) 1/2 , tests based on T n (β 0 ) and T R n (β 0 ) have the same asymptotic power. Thus, in practice there is no loss of asymptotic power from adopting the robust inference approach if identification is strong.

TESTING FOR CONSTANCY OF THE RD EFFECT ACROSS COVARIATES
In this section, we develop a test of constancy of the RD effect across covariates, which is robust to weak identification issues. Such a test can be useful in practice when the econometrician wants to argue that the treatment effect is different for different population subgroups. For example, in Section 4, we use this test to argue that the effect of class sizes on educational achievements is different for secular and religious schools, and therefore it might be optimal to implement different rules concerning class sizes in those two categories of schools. The problem is related to the classical analysis of variance (ANOVA) hypothesis of homogenous populations (see, e.g., Casella and Berger 2002, chap. 11).
Similarly to Otsu, Xu, and Matsushita (2015), we consider the RD effect conditional on some covariate w i . (See also Frölich 2007.) Let W denote the support of the distribution of w i . Next, for w ∈ W we define y + (w) using the conditional expectation given z i and w i = w: y + (w) = lim z↓z 0 E (y i |z i = z, w i = w) . Let y − (w), x + (w), and x − (w) be defined similarly. The conditional RD effect given w i = w is defined as β(w) = (y + (w) − y − (w))/(x + (w) − x − (w)). Similarly to the case without covariates, under an appropriate set of assumptions, β(w) captures the (local) ATE at z 0 conditional on w i = w. We are interested in testing the null hypothesis of constancy of the RD effect H 0 : β(w) = β for some β ∈ R and all w ∈ W, against a general alternative H 1 : When identification is strong, the econometrician can esti-mate the conditional RD effect function consistently and then use it for testing of H 0 . (Such a test can be constructed similarly to the ANOVA F-test as in Casella and Berger (2002, chap. 11) and is discussed in the supplement.) However, this approach can be unreliable if identification is weak. We therefore take an alternative approach. Suppose that W = {w 1 , . . . ,w J }, that is, the covariate is categorical and divides the population into J groups. The assumption of a categorical covariate is plausible in many practical applications where the econometrician may be interested in the effect of gender, school type, etc. However, even when the covariate is continuous, in a nonparametric framework it might be sensible to categorize it to have sufficient power (as is often done in practice). For j = 1, . . . , J , letŷ + n (w j ),ŷ − n (w j ),x + n (w j ), and x − j,n (w j ) denote the local linear estimators of the corresponding population terms computed using only the observations with w i =w j . Let n j be the number of such observations. Define σ 2 y (w j ), σ 2 x (w j ), and σ xy (w j ) as the conditional versions of the corresponding population terms, and letσ 2 y,n (w j ),σ 2 x,n (w j ), and σ xy,n (w j ) denote the corresponding estimators.
Suppose that Assumption 2 holds for each of the J categories, and none of the categories is redundant asymptotically: n j h n j /(nh n ) → p j > 0 for j = 1, . . . , J , where n = J j =1 n j . If H 0 is true and the FRD effect is independent of w, one can construct a robust confidence set for the common effect: , β n (w j ) = y n (w j )/ x n (w j ), x n (w j ) =x + n (w j ) −x − n (w j ); σ 2 n (β 0 ,w j ) is defined similarly toσ 2 n (β 0 ) in (3) using the estimators conditional on w i =w j ; andf z,n (z 0 |w j ) = (n j h n j ) −1 n i=1 K((z i − z 0 )/h n j )1{w i =w j } is the estimator for f z (z 0 |w j ), which denotes the conditional density of z i at z 0 conditional on w i =w j .
Under H 0 : β(w) = β for some β ∈ R, CS J 1−α,n is an asymptotically valid confidence set since G n (β) → d χ 2 J under weak or strong identification. We consider the following size α asymptotic test: Reject H 0 if CS J 1−α,n is empty. The test is asymptotically valid because under H 0 , P (CS J 1−α,n = ∅) ≤ P (β / ∈ CS J 1−α,n ) = P (G n (β) > χ 2 J,1−α ) → α, which again holds under weak or strong identification. Under the alternative, there is no common value β that will provide a proper recentering for all J categories, and therefore, one can expect deviations from the asymptotic χ 2 J distribution. We show below that the test is consistent if there is strong (or semistrong) identification for at least two valuesw j 1 andw j 2 that satisfy β(w j 1 ) = β(w j 2 ). Let d 2 n,1 (w j ) = n j h n j |x + (w j ) − x − (w j )| 2 f z (z 0 |w j )/(kσ 2 x (w j )) be the conditional version of the concentration parameter.

EMPIRICAL APPLICATIONS
In this section, we compare the results of standard and weak identification robust inference in two separate, but related, applications. We show that the standard method and our proposed method yield significantly different conclusions when weak identification is a problem, but similar results when it is not. We also show that the robust confidence sets can provide more informative answers than the standard confidence intervals in cases when the usual assumptions are violated. We also apply our weak identification robust constancy test.
We begin with a case where weak identification is not a serious issue. In an influential article, Angrist and Lavy (1999) studied the effect of class size on academic success in Israel using the fact class size in Israeli public schools was capped at 40 students during their sample period. As demonstrated in Figure 1, this cap results in discontinuities in the relationship between class size and total school enrollment for a given grade. In practice, school enrollment does not perfectly predict class size and thus the appropriate design is fuzzy rather than sharp. We use the same sample selection rules as Angrist and Lavy (1999) and focus on language scores among 4th graders. The data can be found at http://econ-www.mit.edu/faculty/angrist/data1/data/anglavy99. There is a total of 2049 classes in 1013 schools with valid test results. Here, we only look at the first discontinuity at the 40-student cutoff. The number of observations used in the estimation depends on the bandwidth. It ranges from 471 classes in 118 schools for the smallest bandwidth (6), to 722 observations in 484 schools for the widest bandwidth (20). We use the uniform kernel in all cases. Table 2 shows that the estimated discontinuity in the treatment variable ranges from 8 to 14 students depending on the  Angrist and Lavy (1999): Estimated discontinuity in the treatment variable for the first cutoff and their standard errors, estimated effect of class size on class average verbal score, and standard and robust 95% confidence sets (CSs)  bandwidth chosen. The table also shows that, as expected, the F-statistic becomes smaller as the bandwidth gets smaller. Silverman's normal rule-of-thumb and the optimal bandwidth procedure of Imbens and Kalyanaraman (2012) both suggest a bandwidth value of approximately 8, which corresponds to a relatively large value of the F-statistic (approximately 62). Applying the standards of Table 1, we then conclude that weak identification is not a serious concern in this application. Using the 5% noncentral χ 2 critical value, we reject the null hypothesis that the concentration parameter is below 36, and therefore, the maximal size distortions of the 5% two-sided tests are expected to be under 1%. Note that even at the smallest bandwidth, the F-statistic is relatively large. This is consistent with Figure 2 that shows that the 95% standard and robust confidence sets for the class size effect are very similar. The figure shows that the two sets of confidence intervals are essentially indistinguishable for larger bandwidths, and only differ slightly for smaller bandwidths.
In this application, we also compare the results of the standard constancy test of the treatment effect across subgroups to the results of our robust constancy test. The first set of results reported in Section 5 of the online supplement compares the treatment effect for secular and religious schools. The null hypothesis (the treatment effect is the same across subgroups) can never be rejected using a standard test. By contrast, the robust constancy test rejects the null hypothesis for the largest values of the bandwidth (18 and 20). We reach similar conclusions when comparing the treatment effect for schools with above and below median proportions of disadvantaged students. The null hypothesis is rejected by the robust test under the largest bandwidth (20). This suggests that our proposed test may have greater power against alternatives than the standard test in some contexts.
The second application considers a similar policy in Chile originally studied by Urquiola and Verhoogen (2009). It should be noted that Urquiola and Verhoogen (2009) are not attempting to provide causal estimates of the effect of class size on tests score. They instead showed how the RD design can be invalid when there is manipulation around the cutoff, which results in a violation of Assumption 1(b) (exogeneity of z i ). So while this particular application is useful for illustrating some pitfalls linked to weak identification in an FRD design, the results should be interpreted with caution. In this application, the class sizes are capped at 45 students. Figure 3 shows the fuzzy discontinuity in the empirical relationship between class size  NOTES: Silverman's rule-of-thumb bandwidth is 8.59. The optimal bandwidth suggested by Imbens and Kalyanaraman (2012) for the cutoff of 45 is 9.67 and for the cutoff of 90, the suggested bandwidth is 11.60. The optimal bandwidth suggested by Imbens and Kalyanaraman (2012) for the cutoff of 135 is 14.12 and for the cutoff of 180, the suggested bandwidth is 17.81. The scores are given in terms of standard deviations from the mean. and enrollment at the various multiples of 45. The figure also shows that the discontinuity becomes smaller as enrollment increases. In this example, the outcome variable is average class scores on state standardized math exams and we restrict attention to 4th graders. We also strictly adhere to the sample selection rules used by Urquiola and Verhoogen (2009). The total number of observations is 1636. The effective number of observations varies with the bandwidth and the enrollment cutoff. The range of the number of observations is 201 to 402 at the 90 student enrollment cut off; 45 to 95 at the 135 student enrollment cutoff, and 17 to 34 at the 180 student enrollment cutoff. The uniform kernel is used to compute all the results below. Table 3 reports the FRD estimates and the confidence sets for the different values of the bandwidth and cutoff points. As before, we set the size of the test at 5%. Starting with the first cutoff point, Table 3 shows that the robust and conventional confidence sets diverge dramatically as the bandwidth gets smaller. Interestingly, while the robust confidence interval is much wider than the conventional one, it nevertheless rejects the null hypothesis that the effect of class size is equal to zero while the conventional fails to reject the null. To help interpret the results, we also graphically illustrate the difference between standard and robust confidence sets in Figure 4. The first panel plots the standard confidence sets as a function of the bandwidth. The second panel does the same for the weak identification robust method. The shaded area is the region covered by the confidence sets. As the bandwidth increases, the robust confidence sets evolve from two disjoint sections of the real line to a well-defined interval. Note that class size is a discrete rather than a strictly continuous variable, hence the break between bandwidths 11 and 12 when the robust confidence set switches from two disjoint half lines to a single interval. This is consistent with the size of the discontinuity in class size as a function of enrollment estimated at different bandwidths and the corresponding F-statistic. At bandwidths below 10, the estimated discontinuity is small and the F-statistic is below 7. However at bandwidths higher than 12, the estimated discontinuity is progressively closer to 10 students and the F-statistic ranges from just over 40 to just over 188. This is important since the bandwidth suggested by Silverman's normal rule-of-thumb is only 8.59 and the optimal bandwidth suggested by Imbens and Kalyanaraman (2012) is 9.67. See Section 5 in the online supplement for a complete listing of the F-statistic and discontinuity estimates at different bandwidths.
Identification is considerably weaker for the second cutoff point. At all bandwidths, the standard confidence intervals fail to reject the null that the effect of class size is zero. However, for most bandwidths, the robust confidence sets do not include a zero effect. For example, for a bandwidth of 8, we cannot reject the null that class size is not related to grades when using the standard method, while the robust method suggests rejecting the null.
Identification is even weaker at the third cutoff and, for most bandwidths, the robust confidence sets consist of two disjoint intervals. Finally, results get very imprecise at the fourth cutoff and the robust confidence sets now map the entire real line. This suggests that identification is very weak at these levels and the standard confidence intervals are overly narrow.
In summary, our results suggest that when weak identification is not a problem, the robust and standard confidence sets are similar. But when the discontinuity in the treatment variable is not large enough, the robust confidence sets are very different from those obtained using the standard method. We also demonstrate that our robust inference method provides more informative results than the standard method.

SUPPLEMENTARY MATERIALS
The supplementary materials contain: (i) the description of the procedure for selection and evaluation of the influential empirical RD papers; (ii) the proofs of Theorem 1, 3, and 4; (iii) the Monte Carlo results for standard and weak-identificationrobust confidence sets; and (iv) the additional tables from the empirical application.