Optimal Data-Driven Regression Discontinuity Plots

Exploratory data analysis plays a central role in applied statistics and econometrics. In the popular regression-discontinuity (RD) design, the use of graphical analysis has been strongly advocated because it provides both easy presentation and transparent validation of the design. RD plots are nowadays widely used in applications, despite its formal properties being unknown: these plots are typically presented employing ad hoc choices of tuning parameters, which makes these procedures less automatic and more subjective. In this article, we formally study the most common RD plot based on an evenly spaced binning of the data, and propose several (optimal) data-driven choices for the number of bins depending on the goal of the researcher. These RD plots are constructed either to approximate the underlying unknown regression functions without imposing smoothness in the estimator, or to approximate the underlying variability of the raw data while smoothing out the otherwise uninformative scatterplot of the data. In addition, we introduce an alternative RD plot based on quantile spaced binning, study its formal properties, and propose similar (optimal) data-driven choices for the number of bins. The main proposed data-driven selectors employ spacings estimators, which are simple and easy to implement in applications because they do not require additional choices of tuning parameters. Altogether, our results offer an array of alternative RD plots that are objective and automatic when implemented, providing a reliable benchmark for graphical analysis in RD designs. We illustrate the performance of our automatic RD plots using several empirical examples and a Monte Carlo study. All results are readily available in R and STATA using the software packages described in Calonico, Cattaneo, and Titiunik. Supplementary materials for this article are available online.


INTRODUCTION
The regression discontinuity (RD) design, originally introduced by Thistlethwaite and Campbell (1960), is among the most popular quasi-experimental empirical strategies to estimate (local) causal treatment effects in economics, political science, and many other social, behavioral, and natural sciences. In this research design, for each unit i = 1, 2, . . . , n, researchers observe an outcome variable Y i and a continuous covariate X i , and units are assigned to treatment or control depending on whether their observed covariate exceeds a known cutoff. Provided the units of analysis cannot systematically sort around the cutoff, the RD design employs observations just below and just above the cutoff as control and treatment groups to conduct inference on the (local) causal effect of the treatment. The underlying idea, and crucial assumption, is that units around the cutoff do not systematically differ in their unobservable characteristics, thereby offering valid counterfactual comparisons between control and treatment groups. For recent reviews on the RD design, including references to a large number of empirical applications employing RD designs, see, for example, Cook (2008), Imbens and Lemieux (2008), and Lee and Lemieux (2010).
A key feature of the RD design is its simplicity and transparency. The empirical analysis relies on simple and easy-tointerpret identifying assumptions to study the effect of a policy or intervention for units near the threshold, involving only a univariate outcome Y i and a univariate continuous covariate X i (which determines treatment assignment). Estimation and inference of RD treatment effects is usually conducted using local polynomial estimators, and great attention has been devoted to these estimators in the recent methodological RD literature (see Hahn, Todd, and van der Klaauw 2001;Porter 2003;Imbens and Kalyanaraman 2012;Calonico, Cattaneo, and Titiunik 2014b, and references therein). Other approaches are also possible, such as those employing randomization inference methods (Cattaneo, Frandsen, and Titiunik 2015).
No matter the inference approach employed, graphical exploratory analysis and graphical falsification tests are essential when employing RD designs. These methods have been strongly advocated in the literature because they play an important role in both the presentation and validation of RD research designs-see, for example, Imbens and Lemieux (2008, sec. 3) and Lee and Lemieux (2010, sec. 4.1). The most common graphical representation of RD designs is a plot that contains two main ingredients. The first shows two smooth polynomial approximations of the underlying conditional expectations of the outcome variable Y i given the observed covariate X i , for control and treatment units separately. The second ingredient is a collection of local sample means of the outcome variable constructed by partitioning the support of the covariate X i into disjoint bins for control and treatment units separately, and computing sample averages of the outcome variable Y i for each bin using only observations whose value of the covariate X i falls within that bin. Figure 1 gives three examples of these RD plots using the data of Lee (2008), who studied the vote share advantage enjoyed by the incumbent party in U.S. House of Representatives electoral races. This figure also includes the scatterplot of the raw data for comparison. In this empirical example, the identification strategy is based on the discontinuity generated by the rule that assigns electoral victory to the party that obtains the most votes. The forcing variable (X i ) is the Democratic margin of victory in a given election-the difference in vote share between the Democratic candidate and her strongest opponent-and the normalized threshold isx = 0, since the party wins the election when its margin of victory is positive and loses otherwise. The outcome variable (Y i ) is the Democratic vote share in the following U.S. House election. (We further discuss this empirical application in Section 6.) Each plot in Figure 1 includes fourthorder polynomial fits for control and treatment units separately, and the gray dots in Figure 1(a) represent a raw observation while the black dots in Figure 1(b)-1(d) represent the sample average for each disjoint bin.
The two ingredients of the RD plots serve different goals. The polynomial fits seek to represent the behavior of the underlying conditional expectations in a smooth fashion and from a global perspective. On the other hand, the local sample means have the general goal of providing a visual representation of the design without relying on parametric assumptions regarding the underlying regression functions, while also capturing the local behavior of the data. In particular, local means may serve two purposes: 1. Detection of discontinuities. Local means can provide important information regarding the validity of the key identifying assumption of the design-the continuity of the conditional expectations at the cutoffx. By providing a plot of the underlying regression function that is by construction discontinuous, the local means can highlight the presence of potential discontinuities in the conditional expectations away from the cutoff, which would cast doubt on the key identifying assumption of the design. From this perspective, the binning structure of the RD plot is fundamental: while the global polynomial fits will typically hide the presence of such discontinuities, the local sample means will not. In other words, constructing two distinct estimators of the underlying regression functions, one smooth (the global polynomial fit) and the other discontinuous (the local means), is particularly useful when the goal is to identify the presence of potential discontinuities. 2. Representation of variability. A second, equally important, goal of the local sample means is to provide a disciplined representation of the overall variability of the data. As shown in Figure 1, a scatterplot of the raw data is uninformative and not particularly revealing of the features of the RD design. In this case, the local sample means play a different role: they are used to construct a somewhat smoothed, yet variable scatterplot of the data by averaging the observations within each of the disjoint bins. From this perspective, the goal is not to "trace out" the underlying regression functions but rather to construct an undersmoothed estimate that highlights visually the overall variability of the data.
Despite general agreement around the purpose of RD plots and their widespread use, their formal properties remain unknown. In particular, these plots are constructed using an ad hoc choice of the partitions' size (i.e., the number of bins used to construct the local sample means), making the procedure less automatic and more subjective than is ideal for a tool whose main role is to provide objective evidence about the plausibility of the research design's main assumptions. Given the absence of concrete guidance on these choices, practitioners typically experiment and select an arbitrary number of bins, which may misrepresent the actual behavior of the data. Figure 1 illustrates some of the potential problems underlying ad hoc RD plots. First, Figure 1(a) shows that the scatterplot of the raw data is highly uninformative: despite the highly significant nonzero RD treatment effect at the cutoff, no discontinuities are noticeable when looking only at the cloud of points-a problem that is exacerbated for binary outcomes. Second, Figures 1(b) through 1(d) show that by choosing the number of bins in an ad hoc way, the type of information conveyed by RD plots can vary widely, which implies that different ad hoc RD plots may give very different representations of the underlying data and, by implication, the validity of the design.
We address these concerns in the construction of RD plots by proposing automatic, data-driven procedures to select the number of bins in RD plots specifically tailored to each of the two goals mentioned above: detection of discontinuities and representation of variability. To provide a plot that is well suited to detecting potential discontinuities in the underlying regression functions, we optimize the number of bins used to compute the local sample means so that the integrated mean square error (IMSE) of the resulting (discontinuous, binned) estimator of the underlying regression functions is minimized. To provide a plot that is appropriate to represent the underlying variability in the data, we develop a bin selector that employs more bins than the optimal number selected by the IMSE-minimization strategy. We formalize this second approach in two distinct ways. First, we propose a different optimal choice of the number of bins based on a weighted integrated mean square error (WIMSE), which gives a formal justification for undersmoothing (i.e., selecting a larger number of bins than the IMSE-optimal choice). Second, we propose another choice of the number of bins, which generates local sample means with an asymptotic variability mimicking the overall variability of the data. Both of these choices lead to undersmoothed tuning parameter selectors relative to the IMSE-optimal choices, thereby generating more variability in the local sample means depicted in the RD plots.
We derive all these optimal choices of the number of bins for two distinct types of RD plots. First, we study the properties of the most common RD plot used in the literature, one that employs an evenly spaced (ES) binning of the data. Second, we introduce an alternative RD plot based on quantile spaced (QS) binning. The latter approach forces each bin to have approximately the same number of observations, a feature that may be appealing when the data are sparse: this partitioning scheme may be interpreted as covariate design adaptive. For each type of RD plot, we derive formally the optimal number of bins selectors mentioned above, and develop data-driven nonparametric consistent implementations thereof. Our main implementations employ spacings estimation techniques to construct the datadriven optimal partition size choices because these estimators do not require additional tuning parameter choices, and thus may be seen as more robust in applications. However, this technique requires continuity of the outcome variable, and hence is not applicable in all possible empirical settings (e.g., binary outcomes). To handle noncontinuous outcomes, we also propose and formally analyze partition size data-driven selectors employing nonparametric polynomial estimators. For this case, the underlying tuning parameter for implementation (i.e., the polynomial power) may be chosen using cross-validation or related methods; see, for example, Ruppert, Wand, and Carroll (2009) for further discussion.
Finally, we also analyze the performance of our automatic RD plots visually and numerically. First, we apply our results to the incumbency advantage example already introduced, and find that our optimal data-driven RD plots perform well when using real data. We also offer a similar analysis of three other empirical applications in the supplemental appendix. Second, we study the finite-sample properties of our results in a Monte Carlo experiment employing several data-generating processes, and find that our RD plots tuning parameter selectors perform extremely well. Third, we compare numerically the two RD plotting alternatives analyzed in this article: ES versus QS. Our results highlight the fact that neither approach dominates the other in general, because features of the underlying (unknown) data-generating process (i.e., distribution of X i and shapes of the conditional expectation and conditional heteroscedasticity) ultimately determine which RD plot may be preferred.
The rest of the article is organized as follows. Section 2 introduces the RD design, reviews basic results and concepts, and presents a formal description of RD plots. Section 3 introduces the popular ES RD plot, derives formal asymptotic expansions for the variance and bias of the underlying estimator, and employs these results to develop several number of bins selectors depending on the researcher's goal. Section 4 proceeds analogously but for the alternative RD plot based on QS bins. Section 5 presents data-driven, fully automatic implementations of our tuning parameter selectors for RD plots and establishes their consistency properties. Section 6 showcases how our datadriven RD plots perform visually and numerically using both real and simulated data, and briefly compares the quantile and ES approaches. Section 7 discusses two simple extensions, and Section 8 concludes. The supplemental appendix contains the proofs of our main theorems, additional methodological and technical results, detailed simulation evidence, and further empirical illustrations not included here to conserve space. Companion R and STATA software packages are described in Calonico, Cattaneo, andTitiunik (2014a, 2015).

SETUP
In the regression discontinuity design, the observed data are a random sample (Y i , X i ) , i = 1, 2, . . . , n, from a large population, with X i a continuous random variable with (possibly restricted) support [x l , x u ] and continuous density f (x). All units with a value of the observed "score" or "forcing" variable X i greater than a known thresholdx are assigned to the treatment group (T i = 1), while all units with X i <x are assigned to the control group (T i = 0). Thus, under perfect compliance, treatment received is defined as T i = 1(X i ≥x) with 1(·) denoting the indicator function. As is common in the program evaluation literature (e.g., Imbens and Wooldridge 2009), we employ potential outcomes notation to characterize the two underlying counterfactual states (control or treatment). Letting Y i (1) and Y i (0) denote the potential outcome with and without treatment, respectively, the observed outcome is The most popular parameter of interest is the average treatment effect at the threshold, given by τ SRD This parameter is nonparametrically identifiable under a mild continuity condition (Hahn, Todd, and van der Klaauw 2001), and RD estimators employing local polynomial techniques have become the default choice in the literature (see Porter 2003; Imbens and Kalyanaraman 2012; Calonico, Cattaneo, and Titiunik 2014b, and references therein). In the so-called sharp RD design, T i is a deterministic function of treatment assignment (perfect compliance), while in the so-called fuzzy RD design treatment take-up and treatment assignment may differ. This distinction, however, is mostly irrelevant for our purposes because we do not focus on estimation and inference for RD treatment effects, but rather on the RD plots commonly encountered in empirical work. These plots may be used for presentation and falsification of both sharp and fuzzy RD research designs. See Section 7 for a brief discussion of how our results may be applied to fuzzy RD designs or extended to allow for other covariates entering the analysis.
We set and impose the following assumption through the article.
Assumption 1. For x l , x u ∈ R with x l <x < x u , and all x ∈ [x l , x u ]: is bounded, and f (x) is continuous and bounded away from zero. b. μ − (x) and μ + (x) are S times continuously differentiable (S ≥ 1).
Part (a) in Assumption 1 imposes existence of moments and requires that the running variable X i be continuously distributed. Part (b) imposes smoothness on the underlying regression functions, while part (c) requires that the conditional variance be continuous; all these functions may be different at either side of the threshold. Notice that for all x ≥x, enabling (consistent) estimation of these conditional expectations for control and treatment units, respectively.

RD Plots
The main features of an RD design are easily summarized employing RD plots. As mentioned previously, these plots include two main ingredients: (i) smooth polynomial estimation, and (ii) disjoint local sample means estimation. We now formalize the underlying estimation approaches used to construct the RD plots, which provides the basis for our analysis. Our main focus is on tuning parameter selection for the construction of the collection of local sample means under two distinct partitioning schemes: ES and QS partitions of [x l ,x) and [x, x u ], that is, of the observations to the left and right of the cutoff.

Global Polynomial Estimation. In the RD plots, the unknown functions
are estimated using global polynomials for control and treatment observations separately. To formalize this approach, let k ∈ Z + and r k (x) = (1, x, x 2 , . . . , x k ) , and definê In words,μ −,k (x) andμ +,k (x) are kth-order polynomial fits of Y i on X i employing only control and treatment units, respectively.
These polynomial regressions may be viewed as a nonparametric approach, usually called series or (linear) sieve estimation, for the approximation of the underlying population conditional expectations when k = k n → ∞ as n → ∞ (see, e.g., Newey 1997;Chen 2007;Ruppert, Wand, and Carroll 2009;Belloni et al. 2015 for reviews). Below we will exploit this interpretation explicitly to construct consistent plug-in rules for the optimal tuning parameter choices. Employing results from the nonparametrics literature, it is possible to select k n using some data-driven approach such as (plug-in) IMSE minimization or cross-validation. In practice, however, k = 4 or k = 5 are almost always the preferred choices. Either way, we do not discuss further the choice of k for RD plots because this is a well-understood problem. Instead, our main focus is on choosing the partition size for the local means, a result that is not currently available in the literature.
Global polynomial approximations may not perform well in RD applications and, more generally, in approximating regression functions locally. These polynomial approximations for regression functions tend to (i) generate counterintuitive weighting schemes (Gelman and Imbens 2014), (ii) have erratic behavior near the boundaries of the support (usually known as the Runge's phenomenon in approximation theory), and (iii) oversmooth (by construction) potential discontinuities in the interior of the support. Thus, the other crucial ingredient of RD plots is a collection of disjoint local sample means, which we describe formally next.

Local Sample Mean Estimation.
The second ingredient in the RD plot is a collection of local sample means of the outcome variable computed over a disjoint partition of the support of the running variable, for control and treatment units separately. To describe this construction formally, we employ ideas from the nonparametric literature on partitioning estimators (for further details, see Cattaneo and Farrell 2013, and references therein).
We define P −,n = {P −,j : j = 1, 2, . . . , J −,n } and P +,n = {P +,j : j = 1, 2, . . . , J +,n }, two generic disjoint partitions of the support of the running variable X i to the left and right of the cutoff, which vary with the sample size n. More precisely, with J −,n , J +,n ∈ Z ++ denoting the partition sizes for control and treatment groups, respectively. As an example, in the incumbency advantage illustration we introduced above x u = 100,x = 0, and x l = −100, so a partition to the right of the cutoff in 20-percentage-point increments would be [x, We set 1 A (x) = 1(x ∈ A) to save notation. The partitioning estimators (of order 1), sometimes called binning estimators or local constant regression estimators, are formally described as follows:μ The estimatorsμ − (x; J −,n ) andμ + (x; J +,n ) collect the sample means of the outcomes Y i for observations with covariate X i taking values within each bin in the partitions P −,n and P +,n , and may be interpreted as nonparametric estimators of μ − (x) and μ + (x), respectively. Like other nonparametric procedures, these binning-type estimators involve a choice of tuning and smoothing parameters. In this case, (J −,n , J +,n ) may be regarded as the tuning parameters (e.g., similar to a bandwidth for conventional kernel estimators) and (P −,n , P +,n ) may be viewed as the smoothing parameters (e.g., similar to the shape of kernel function for conventional kernel estimators). Under Assumption 1, and provided a well-behaved partitioning scheme is used, it is not difficult to show that , provided that J −,n → ∞ and J +,n → ∞ as n → ∞ and some regularity conditions hold. Throughout the article all limits are taken as n → ∞ unless otherwise stated.
The behavior of these estimators is dependent on how the partitions are constructed and, as mentioned above, this article considers two approaches for choosing the partitions: ES partitions and QS partitions. Given a chosen partitioning scheme, the parameters J −,n and J +,n control the rate of approximation of the partitioning estimators, capturing the usual variance and bias trade-off: larger (J −,n , J +,n ) imply more variance but less bias (more, smaller bins), while smaller (J −,n , J +,n ) imply less variance but more bias (fewer, larger bins). The main contribution of this article is to formalize these ideas for each of the two partitioning schemes, to derive several (optimal) choices of (J −,n , J +,n ) explicitly capturing the specific objective of the RD plot (i.e., tracing out the regression function or capturing the underlying variability of the data), and to develop consistent data-driven implementations thereof.

EVENLY SPACED RD PLOTS
In this section, we consider ES bins for the construction of the partitioning scheme underlying the RD plots. Thus, we set leading to the ES partitioning estimators denoted bŷ μ ES,− (x; J −,n ) andμ ES,+ (x; J +,n ), with nonrandom partitioning schemes denoted by P ES,−,n and P ES,+,n , respectively. The use of ES bins is the most common strategy in the ad hoc construction of RD plots. For example, the original incumbency advantage plots in Lee (2008) present local means in fixed-length bins that are 0.5 percentage points wide. Using the notation just introduced, this translates into J −,n = J +,n = 200, since there are 200 bins of length 0.5 on either side of the cutoff between x l = −100 and x u = 100, and the two bins closest to the cutoff on either side of it are p −,199 = −100 + 199 · 100 200 = −0.5 and p +,1 = 0 + 1 · 100 200 = 0.5.

Variance and Bias Properties
To study formally the properties of the ES RD plots, we begin by developing formal asymptotic expansions for the integrated variance and bias of the underlying partitioning estimators. Let w(x) denote a weighting function, formally introduced in Theorem 1, and set X n = (X 1 , X 2 , . . . , X n ) to save notation. The integrated variance of the estimators, for control and treatment groups,μ ES,− (x; J −,n ) andμ ES,+ (x; J +,n ), are Similarly, the integrated squared bias for these estimators is Since variability plays a crucial role in the construction of the RD plots, all of our selectors will use the variance quantities var ES,− (J −,n ) and var ES,+ (J +,n ). In some cases, we will also employ the bias quantities Bias ES,− (J −,n ) and Bias ES,+ (J +,n ) to construct choices of number of bins J −,n and J +,n . The next result gives a formal first-order nonparametric approximation to the integrated variance and squared bias of the estimators.
Theorem 1. Suppose Assumption 1 holds with S ≥ 2, and w : [x l , (+) If J +,n log(J +,n )/n → 0 and J +,n → ∞, then All the results presented in this article remain valid if w(x) = w + (x)1(x ≥x) + w − (x)1(x <x), thus allowing for w(x) to be discontinuous atx. Theorem 1 captures formally the natural trade-off between variability and bias in approximating the underlying regression function using local sample means computed using disjoint ES partitions of size J ∈ {J −,n , J +,n }: the larger the J, the smaller the bias because each bin is smaller and hence the sample mean approximates the underlying function better, while the larger the J, the larger the variance because each bin is small and hence has only a few observations. In what follows, we use this intuition explicitly to develop different tuning parameter selectors depending on the explicit goal in mind.

Approximating the Underlying Regression Functions
As a first goal, we consider optimal choices of the number of bins J −,n and J +,n with the explicit goal of approximating the underlying regression function in an IMSE sense. As discussed previously, the resulting selectors are important and useful for empirical work because they validate the otherwise possibly oversmoothed polynomial approximations to the underlying regression functions. Thus, we recommend these selectors to construct RD plots to visually check for potential discontinuities in the regression functions. Once potential discontinuities have been identified, formal hypothesis tests ("placebo" tests) may be conducted using robust inference procedures from the RD literature (e.g., Calonico, Cattaneo, and Titiunik 2014b).
Under the conditions of Theorem 1, the IMSE loss function of the estimators underlying the ES RD plots satisfies These results give an approximation to a family of IMSE loss functions, depending on the choice of weight function w(x). In general, assuming that B ES,− = 0 and B ES,+ = 0, the expansions of IMSE ES,− (J n,− ) and IMSE ES,+ (J n,+ ) give the optimal choices of number of bins: with y = x ∈ N, x ∈ R ++ , denoting the smallest integer y such that x ≤ y.

Approximating the Underlying Variability of the Data
In addition to developing optimal choices of tuning parameters for approximating the underlying regression functions using local sample means, we propose two distinct approaches specifically developed to represent the overall variability of the data. As discussed in Section 5, the resulting implementations are objective, fully automatic, and easy-to-implement, yet disciplined, thus providing useful benchmarks for smoothing out the scatterplot of the raw data in empirical applications.
The first approach employs a natural loss function and leads to a formal justification for "manual" increases in the variability of the ES RD plots (i.e., undersmoothing by increasing the number of bins employed), which is commonly done in practice. The second approach is fully automatic but is not loss function based-it is rather specifically tailored to mimic the underlying variability of the data while employing binned sample means. Naturally, as we discuss in more detail below, given a fixed sample size, the resulting choices from the second approach (or any other approach) can always be rationalized as emerging from the first approach under a particular weighting scheme, and hence could be regarded as "optimal." In this sense, the first approach is useful at the very least insofar as it gives a formal, intuitive justification for any specific choice of number of bins in terms of trading off variance and bias of the underlying local means estimators entering the construction of the ES RD plot.

Weighted IMSE.
This approach is based on a family of loss functions constructed by trading off variance and bias of the partitioning estimators. Specifically, to capture the variability of the underlying raw data, a natural approach is to undersmooth the binned sample means estimators (i.e., select a larger number of bins J −,n and J +,n ). This can be accomplished by trading off variance and bias differently: let ω V ,− , ω B,− , ω V ,+ , ω B,+ > 0 be fixed weights satisfying ω V ,− + ω B,− = 1 and ω V ,+ + ω B,+ = 1, then consider the family of weighted IMSE loss functions given by with ω − = (ω B,− /ω V ,− ) 1/3 and ω + = (ω B,+ /ω V ,+ ) 1/3 . The WIMSE objective functions are meant to offer more flexibility on the relative importance of variance and bias when searching for an optimal number bins, and hence the associated weights could be interpreted as capturing researchers' prior beliefs on the relative importance of variance and bias. The result in (2) is a generalization of the choices given in (1) because J ES-ω,−,n = ω − J ES-μ,−,n and J ES-ω,+,n = ω + J ES-μ,+,n , and when variance and bias are weighted equally (i.e., ω V ,− = ω B,− and ω V ,+ = ω B,+ ), then J ES-μ,−,n = J ES-ω,−,n and J ES-μ,+,n = J ES-ω,+,n . More generally, the larger the ω B,− , ω B,+ ∈ (0, 1), the larger the choice of number of bins J ES-ω,−,n and J ES-ω,+,n because the loss function puts more weight on bias and less on variance, allowing for more variability in the underlying local sample mean estimates. While it is not obvious how to choose a particular weighting scheme in empirical practice, this approach is very useful in justifying "manual" undersmoothing after selecting the number of bins using the IMSE-optimal choices. Specifically, for each choice of rescaling constants ω − , ω + > 0, there exists a unique compatible weighting scheme: which rationalizes the resulting choices of number of bins as optimal in the sense of minimizing the WIMSE loss function. This result may be of interest for practitioners because it helps explain how variability and bias are traded off when choosing a scaling factor to modify the IMSE-optimal choices for the number of bins, which can always be used as a starting point in the empirical investigation. In the supplemental appendix, we provide a table with the weights implied by different scaling factors ω − and ω + . Furthermore, for any initial ad hoc choice of number of bins used to construct the ES RD plot, the above logic can be used to find an IMSE-optimal choice and a scaling factor that is consistent with the ad hoc choice, thereby offering an objective interpretation of the ad hoc choice in terms of variance and bias trade-off.

Mimicking
Variability. The weighted IMSE approach is useful to give a natural interpretation to "manual" undersmoothing and, more generally, to other ad hoc choices of number of bins used to construct ES RD plots approximating the underlying variability of the data. However, this approach is not fully automatic in general: while clearly objective and interpretable, its main drawback is that it requires the choice of a weighting scheme, and it is difficult to justify a scheme that works generally for all applications. For this reason, we also propose a second approach specifically targeted to capture the variability of the data while employing local sample means, which is fully automatic and can be easily implemented.
In this second approach, we choose the number of bins so that the binned sample means have an asymptotic (integrated) variability approximately equal to the amount of variability of the raw data. To describe the approach formally, let V − and V + denote, respectively, the sample variance of the outcome variables for control and treatment units, that is, the sample variance of the subsamples {Y i : X i <x} and {Y i : X i ≥x}. Then, we select J −,n and J +,n so that var ES,− (J −,n ) = V − and var ES,+ (J +,n ) = V + leading, respectively, to the "optimal" choices V − V ES,− n and V + V ES,+ n. The main intuition behind these choices is that we set the number of bins used so that the overall variability of the sample means, as measured by the asymptotic approximation obtained in Theorem 1, mimics the overall variability of the unrestricted scatterplot of the data.
This idea, while very intuitive, has a minor technical drawback: it leads to tuning parameter choices that do not satisfy the rate conditions of the results in Theorem 1. Thus, to make the end result theoretically coherent, we modify it slightly as follows: To summarize, the choice emerging for the number of bins in (3) mimics the overall variability of the data, up to a log(n) factor, and is fully consistent with the theoretical results given in Theorem 1. Importantly, the resulting number of bins will be in general larger than the one obtained in (1), which is consistent with the underlying distinct goals justifying these rules: (J ES-μ,−,n , J ES-μ,+,n ) are developed explicitly to approximate the underlying regression function and hence they optimally tradeoff variance and bias, while (J ES-ϑ,−,n , J ES-ϑ,+,n ) are developed explicitly to approximate the variability of the data and hence the resulting underlying estimators lead to undersmoothing relative to the IMSE-optimal choices.

QUANTILE SPACED RD PLOTS
In addition to the popular ES RD plot, we also introduce and study an alternative plotting approach based on QS bins. This approach takes into account the sparsity of the data, forcing each bin to have approximately the same number of observations. This feature may be appealing because with QS bins the variability of the local sample means will change across bins only due to nonconstant conditional variances (i.e., due to the presence of heteroscedasticity), but not due to different sample sizes in each bin (as it occurs with an ES partition).
This section parallels the previous discussion for ES RD plots in Section 3, but now focusing on QS RD plots. In this case, we construct the partitioning scheme as follows: In words, the QS RD plot sets p −,j and p +,j to be the approximately 100(j/J −,n )th quantiles of the subsample {X i : X i < x} and the approximately 100(j/J +,n )th quantile of the subsample {X i : X i ≥x}, respectively. This construction leads to the QS partitioning estimators denoted byμ QS,− (x; J −,n ) and μ QS,+ (x; J +,n ), which are estimators now employing the random partitioning schemes denoted by P QS,−,n and P QS,+,n , respectively.

Variance and Bias Properties
We study first the integrated variance and squared bias of the estimatorsμ QS,− (x; J −,n ) andμ QS,+ (x; J +,n ), which are given by As in the case of ES RD plots, we propose several (optimal, datadriven) choices of the number of bins J −,n and J +,n by either trading off variance and bias of the underlying estimators, or by mimicking the overall variability of the raw data. The following result gives the formal expansions for the variance and bias of the underlying partitioning estimators.
(+) If J +,n log(J +,n )/n → 0 and J +,n / log(n) → ∞, then var QS,+ (J +,n ) = J +,n n V QS,+ {1 + o P (1)}, The conclusion in this theorem is similar to that of Theorem 1, but its proof is different because the estimators are constructed using a random partitioning scheme. The partitioning scheme used in the ES RD plots (P ES,−,n and P ES,+,n ) requires J −,n → ∞ and J +,n → ∞ but could lead to empty bins in finite samples (this possibility disappears asymptotically; see Lemma SA1 in the supplemental appendix). In contrast, the partitioning scheme underlying the QS RD plots (P QS,−,n and P QS,+,n ) guarantees roughly the same number of observations (≈ N − /J −,n and ≈ N + /J +,n ) in each bin. The slightly stronger rate conditions J −,n / log(n) → ∞ and J +,n / log(n) → ∞ are imposed to ensure consistency of the sample quantiles functions at the appropriate rate; see Lemma SA2 in the supplemental appendix.
The main difference between the conclusions in Theorems 1 and 2 is that the fixed, leading constants in the variance and bias approximations are different. Importantly, the rates derived are the same in both theorems. The fixed constants are different because the partitioning schemes used are different in each case, but nonetheless all the ideas previously discussed for ES RD plots also apply directly to QS RD plots. Thus, in the remainder of this section, we only briefly summarize the main results for completeness.

Approximating the Underlying Regression Functions
Using the results above, and under the assumptions of Theorem 2, we obtain an asymptotic expansion of the IMSE for QS RD plots given by with ω − = (ω B,− /ω V ,− ) 1/3 and ω + = (ω B,+ /ω V ,+ ) 1/3 . The discussion given above for ES RD plots applies here as well, replacing the subindex ES with QS as appropriate in the different expressions given previously.

Mimicking Variability.
We employ the same logic outlined for the case of ES RD plots, but now using QS binning. That is, letting V − and V + denote a population measure of variability of the outcome variables for control and treatment units, respectively, we select the number of bins for each group so that the asymptotic variability of the QS-based local sample means is approximately equal to the overall variability of the data. Thus, we propose the following "optimal" choice of number of bins: which has the same structure as given in (3) but with the subindex ES replaced by QS.

DATA-DRIVEN IMPLEMENTATIONS
Employing some reference model, we could easily construct rule-of-thumb estimates of the unknown constants (V ES,− , B ES,− , V ES,+ , B ES,+ ) and (V QS,− , B QS,− , V QS,+ , B QS,+ ) featuring in the different optimal choices of number of bins for ES and QS RD plots. Such implementations would require a given choice of weighting function w(x) in practice, but would otherwise be straightforward to derive and easy to implement in practice; for further discussion see, for example, Wand and Jones (1995) for kernel estimation and Ruppert, Wand, and Carroll (2009) for series and penalized estimation.
A potential drawback of rule-of-thumb estimates is that they are inconsistent whenever the reference model used is incorrect. Thus, we propose instead easy-to-implement consistent nonparametric estimators for the unknown constants entering the optimal choices of number of bins in Equations (1) through (6). In the supplemental appendix, we outline a general approach allowing for any user-chosen known weighting function w(x), which needs to be set in advance. Here, we discuss in detail our recommended choice, w(x) = f (x), which removes a density from the denominators of the unknown constants and leads to particularly simple and intuitive data-driven rules. As we discuss further below, all of our approaches are not only theoretically justified, but also simple, easy-to-interpret and often more robust than the usual nonparametric alternatives.
We estimate the unknown constants using ideas related to spacings estimators (see, e.g., Ghosh and Jammalamadaka 2001;Lewbel and Schennach 2007;Baryshnikov, Penrose, and Yurich 2009, and references therein) and series estimators (see, e.g., Newey 1997;Chen 2007;Ruppert, Wand, and Carroll 2009;Belloni et al. 2015 for reviews). Spacings estimators are closely related to nearest neighbor estimators with fixed neighbors (e.g., Abadie andImbens 2006, 2010), and may be more robust than other nonparametric estimators such as kernel-based estimators because they do not require additional tuning parameter choices in their implementation. To describe these estimators, we need to introduce notation for order statistics and concomitants; see David andNagaraja (1998, 2003) for more details. For a collection of continuous random variables {(Z i , W i ) : i = 1, 2, . . . , n}, we let W (i) be the ith-order statistic of W i and Z [i] its corresponding concomitant. That is, W (1) < W (2) < · · · < W (n) and Z [i] denotes the Z-value associated with W (i) for all i = 1, 2, . . . , n.
Spacings estimators are useful because they exploit properties of order statistics and concomitants to approximate the unknown density and moments of the random variables nonparametrically. To see this, heuristically, recall that U i = F W (W i ) with {U i : 1 ≤ i ≤ n} uniform [0, 1] random variables, F W (·) the c.d.f. of W i and f W (·) the p.d.f. of W i . Then, for somȇ u i ∈ [U (i−1) , U (i) ] a Taylor series expansion gives , whereŪ (i) = (U (i−1) + U (i) )/2, and because |U (i) − U (i−1) − 1/n| ≈ 0. Thus, heuristically, an average of a smooth function of the spacings statistic, W (i) − W (i−1) , will converge to the expectation of this function inversely weighted by the unknown density f W (·), up to a scaling factor and some additional (technical) constants that may arise in the derivation. Similar arguments using concomitants and order statistics give Lemma SA3 in the supplemental appendix formalizes these results, which we use in the sequel to construct simple, nonparametric estimators of the unknown constants entering the number of bins selectors proposed in this article. We also employ series (polynomial) nonparametric approximations to estimate μ (1) − (x) and μ (1) + (x), and σ 2 − (x) and σ 2 + (x) in some cases, trying to mimic as closely as possible current empirical practices-these polynomial approximations are already available as part of the RD plots.

ES RD Plots
Taking w(x) = f (x), with f (x) unknown, leads to the following simplified constants in Theorem 1: which feature in the number of bins selectors discussed above for the ES RD plots. Letting {(Y −,i , X −,i ) : i = 1, 2, . . . , N − } and {(Y +,i , X +,i ) : i = 1, 2, . . . , N + } be the subsamples of control (X i <x) and treatment (X i ≥x) units, respectively, we propose the following estimators: and r (1) k (x) = ∂r k (x)/∂x = (0, 1, 2x, 3x 2 , . . . , kx k−1 ) . These estimators are particularly well suited for our purposes because they (i) avoid explicit estimation of the density f (x) appearing in the denominators and (ii) do not require specific choices of tuning parameters (e.g., bandwidths in kernel-based estimation). For these reasons, and given their simple implementation, we recommend employing these spacings-based estimators whenever possible.
Thus, our proposed data-driven selectors for ES RD plots take the formĴ using the estimators in (7)-(10), and whereV − andV + are consistent estimators of their population counterparts V − and V + . The following theorem shows that, when the polynomial fits are viewed as nonparametric approximations with k = k n → ∞, the different number of bins selectors are nonparametric consistent.
Theorem 3. Suppose Assumption 1 holds with S ≥ 5, and Y i (0) and Y i (1) are continuously distributed. If k 7 n /n → 0 and k n → ∞, then Remark 1 (Discontinuous Outcomes). When Y i (0) and Y i (1) are not continuously distributed, the concomitant-based estimation method becomes invalid. In this case, we need to employ other more standard nonparametric techniques. For example, assuming that E[Y i (t) 2 |X i = x], t = 0, 1, are twice continuously differentiable, we can use the following estimators: for k ∈ Z + and p ∈ Z ++ , μ −,k (x) =μ −,k,1 (x) andμ +,k (x) =μ +,k,1 (x) with our notation. We show in the appendix that the resulting partition-size selectors using the above estimators, are also consistent in the sense of Theorem 3, under the conditions imposed in that theorem.

QS RD Plots
Paralleling the discussion above for ES RD plots, we propose consistent estimators for J QS-μ,−,n , J QS-μ,+,n , J QS-ω,−,n , J QS-ω,+,n , J QS-ϑ,−,n , and J QS-ϑ,+,n , when w(x) = f (x) with f (x) unknown. In this case, the target constants take the form Our preferred selectors employ spacings estimators, which are simple and easy-to-implement but require continuous outcomes. See Remark 2 for the case of noncontinuous outcomes. The estimators of the optimal selectors for QS partition size are based on the following estimators: using the notation introduced above. Therefore, in the QS partitions case, our data-driven selectors take the formĴ using the estimators in (17)-(20), and appropriate consistent estimatorsV − andV + . As in the case of Theorem 3 for ES RD plots, the following theorem shows that these automatic partition-size selectors are nonparametric consistent if the polynomial fits are viewed as nonparametric approximations with k = k n → ∞.
In the supplemental appendix, we also propose consistent estimators for a generic, known weighting scheme w(x). These estimators are more flexible but also more complicated in general.
Remark 2 (Noncontinuous Outcomes). As mentioned in Remark 1, the concomitant-based estimation approach cannot be used when Y i (0) and Y i (1) are not continuously distributed. For the latter cases, alternatively, we can use the series polynomial estimation approach already introduced above. Assuming that E[Y i (t) 2 |X i = x], t = 0, 1, are twice continuously differentiable, we may use the following estimators: whereσ 2 −,k (x) andσ 2 +,k (x) are the polynomial approximations discussed in Remark 1. The corresponding data-driven partitionsize selectors in this case arě which, as we show in the appendix, are also consistent in the sense of Theorem 4, provided the conditions in that theorem hold.

NUMERICAL RESULTS
This section reports numerical evidence on the performance of our proposed methods employing real data from several empirical applications, and simulated data from a Monte Carlo experiment. We also compare numerically the two partitioning schemes studied in this article, ES and QS, in terms of their asymptotic IMSE. All the results in this section are obtained using the R and STATA software packages described in Calonico, Cattaneo, andTitiunik (2014a, 2015).

Empirical Applications
We illustrate our methods using data from several RD empirical applications. To conserve space, we report here only results using the data from Lee (2008), already mentioned in the Introduction. The supplemental appendix includes the other empirical applications, which employ data from (i) U.S. Senate elections (see Cattaneo, Frandsen, and Titiunik 2015 for details), (ii) Progresa/Oportunidades anti-poverty conditional cash transfer program (see Calonico, Cattaneo, and Titiunik 2014c, sec. S.4 for details), and (iii) Head Start funding program (see Ludwig and Miller 2007 for details). As mentioned above, Lee (2008) studied the incumbency advantage in U.S. House elections; the forcing variable is the margin of victory of the Democratic party in a given U.S. House election, the threshold isx = 0, and the outcome variable is the Democratic vote share in the following U.S. House election, which occurs 2 years later. The unit of observation is the U.S. House district. All U.S. House elections between 1948 and 2008 are included, with the exception of years when district boundaries change; the dataset we employ has a total of n = 6558 complete district-year observations.
The main goal of this empirical application is to show how our selectors perform when applied to a realistic dataset. It is difficult to compare our results to alternatives or benchmarks because we are not aware of any other selectors available in the literature to construct RD plots. The standard practice appears to be that each researcher explores the data and selects the number of bins in an ad hoc, nonsystematic way. Thus, we focus on discussing the graphical properties of the resulting RD plots when our methods are employed. Figures 2 and 3 collect six graphs in two rows. Each graph depicts the global fourth degree polynomial fitsμ −,4 (x) and μ +,4 (x) as a solid blue line. Graphs (a) and (d) are the scatterplot of the raw data, which we include here for visual comparison. The remaining four graphs in each figure are RD plots constructed using ES bins (Figure 2) or QS bins (Figure 3) with data-driven choices employing either spacings estimators or series estimators. In each of these four graphs, the black dots correspond to the sample mean within each bin when the number of bins is selected to mimic the variability of the data (graphs (b) and (e)), while the black triangles correspond to the sample mean within each bin when the number of bins is selected to minimize the IMSE of the underlying regression function estimator (graphs (c) and (f)).
When analyzing each figure row-wise, we may see graphically how the variability of the data is summarized by each method. In particular, the scatterplots (graphs (a) and (d) in each figure) give a graphical representation of the raw data and are therefore extremely variable and arguably uninformative. Next, the mimicking variance RD plots (graphs (b) and (e) in each figure) reduce variability substantially when plotting the binned sample means of the raw data, but they are still able to provide a disciplined graphical representation of the overall variability of the RD design. Finally, the IMSE-optimal RD plots (graphs (c) and (f) in each figure) deliver "smoother" local (disjoint) sample means essentially trying to trace out the underlying unknown conditional expectation functions.
To summarize, the data-driven selectors introduced in this article seem to perform very well in all the empirical applications we considered. Specifically, the data-driven IMSE-optimal spacings-based selectors (Ĵ ES-μ,−,n ,Ĵ ES-μ,+,n ) and series-based selectors (J ES-μ,−,n ,J ES-μ,+,n ) generate a collection of binned sample means "tracing out" the estimated regression function (we use the polynomial fit as benchmark), which provides visual evidence in favor of continuity of the conditional expectations. Furthermore, the data-driven spacingsbased selectors (Ĵ ES-ϑ,−,n ,Ĵ ES-ϑ,+,n ) and series-based selectors (J ES-ϑ,−,n ,J ES-ϑ,+,n ) mimicking the underlying variability of the data generate a disciplined scatterplot with substantial more variability than the IMSE-optimal binned means case, but yet less variable than the raw data, which in this case provides a nice visual representation of the RD design. As shown in the supplemental appendix, very similar findings emerge from the other empirical applications mentioned previously.

Simulated Data
We briefly report an example of the results from an extensive Monte Carlo experiment we conducted to study the finite-sample behavior of our proposed methods. The full simulation experiment considers 16 distinct data-generating processes, which vary in the distribution of the running variable, the form of the conditional variance, and the distribution of the unobserved error term in the regression function. To conserve space, here we only discuss the simplest case that assumes a uniform distribution for X i and homoscedasticity, but the full set of results is reported in the supplemental appendix. We found the same qualitative results in all cases.
Specifically, we discuss the simulation model {(Y i , X i ) : i = 1, 2, . . . , n} iid with Y i = μ(X i ) + ε i , X i ∼ Uniform(−1, 1), ε i ∼ Normal(0, 1), and where The functional form of μ(x) is obtained using the original data of Lee (2008). All details are given in the supplemental appendix.    when using either spacings estimators (Ĵ ES-μ,−,n ) or polynomial estimators (J ES-μ,−,n ). In this case, the target quantity is J ES-μ,−,n = 25, as mentioned above, and both data-driven implementations are well centered (e.g., sample means are 25.95 and 25.93) and concentrated (e.g., standard deviation are 0.93 and 0.87). Similarly, the third and fourth rows of Panel B present the sampling behavior of the mimicking variance estimatorŝ J ES-ϑ,−,n andJ ES-ϑ,−,n , which perform equally well. In sum, our simulation results briefly discussed here (and available in the supplemental appendix in full) suggest that our proposed optimal data-driven tuning parameter choices for constructing RD plots perform well in finite samples.

Comparison of Partitioning Schemes
We proposed two alternative ways of constructing RD plots, one employing ES partitioning and the other employing QS partitioning. Our proposed selection rules for ES and QS partitioning coincide when the underlying distribution of X i is uniform. In general, however, neither partitioning approach dominates the other in terms of asymptotic variability or IMSE. For example, the IMSE of ES and QS sample means will depend on the unknown density of the running variable, as well as the unknown regression and conditional variance functional forms. In the supplemental appendix, we derive the exact formulas for the optimal IMSE for both ES and QS partitioning, and show formally that neither dominates the other in general. We also employ the 16 simulation models mentioned before to compare the performance of the partition-size selectors for ES and QS RD plots: according to our numerical evidence, QS RD plots seem to perform better than ES RD plots in terms of IMSE when the density f (x) is low in some regions of the support, although this conclusion depends in part on the other unknown features of the data-generating process.

EXTENSIONS
We discuss briefly two extensions that are practically relevant. (We thank a reviewer for suggesting that we address these issues.) The first extension discusses how our results apply to fuzzy RD designs, while the second focuses on how other covariates could be incorporated in the analysis.

Fuzzy RD Designs
In the so-called fuzzy RD design, treatment assignment and treatment status may be different for each unit (i.e., imperfect treatment compliance). The basic RD model introduced in Section 2 can be expanded to account for this possibility. Specifically, similarly to the potential outcomes Y i (0) and Y i (1) already introduced, define T i = T i (0) · 1(X i <x) + T i (1) · 1(X i ≥x) where T i (0) and T i (1) denote the potential actual treatment status for each unit. In this more general setting, the treatment effect of interest is defined as the ratio of the reduced form effect (i.e., the effect of treatment assignment on the outcome) and the firststage effect (i.e., the effect of treatment assignment on actual treatment status).
Our previous results for RD plots continue to apply without change to the reduced form model for Y i and X i : that is, for the conditional expectations E[Y i (0)|X i = x], x <x, and E[Y i (1)|X i = x], x ≥x. Furthermore, by imposing conditions analogous to Assumption 1 but now for the conditional expectations E[T i (0)|X i = x] and E[T i (1)|X i = x], and the conditional variances V [T i (0)|X i = x] and V [T i (1)|X i = x], our results will also apply to the first-stage model. The only distinct aspect of the first-stage model is that T i ∈ {0, 1}, and thus one needs to employ the data-driven selectors discussed in Remarks 1 and 2, for ES and QS RD plots, respectively.
From a practical perspective, in the fuzzy RD design, RD plots can be used to depict both the reduced form effect as well as the take-up effect (i.e., the first-stage effect). Our results apply to both by simply changing the outcome variable used (either Y i or T i ). For a related approach, see also Bertanha and Imbens (2014).

Incorporating Covariates
In some applications, researchers may want to incorporate covariates in the empirical analysis employing RD plots. We briefly discuss two such approaches for sharp RD designs but, as mentioned above, the upcoming discussion also applies to fuzzy RD designs. Suppose Z i ∈ R d , i = 1, 2, . . . , n, is an observed covariate for each unit, and consider constructing an RD plot for the conditional expectations E[Y i (0)|X i = x, Z i = z], x <x, and E[Y i (1)|X i = x, Z i = z], x ≥x. Two simple approaches are (i) conditioning and (ii) employing generalized additive models (GAM).
A first conceptually straightforward approach to incorporate covariates in RD plots is to condition on them. Handling continuous covariates in this case is hard, while incorporating discrete covariates such as gender or age is easy. Specifically, the results in this article can be applied to the subsamples generated by the conditioning set of interest, or an approximation thereof (when the number of conditioning variables is large or Z i is continuously distributed), provided the assumptions given above are extended to hold conditional on the appropriate conditioning set (and other appropriate regularity conditions hold).
A second way of incorporating covariates in the RD plots is to employ ideas from the GAM literature (Hastie and Tibshirani 1990), together with some method to remove the covariates such as backfitting. Specifically, in this approach it is assumed that the underlying conditional expectations take the form (1)|X i = x] + g 1 (z), for some unknown smooth functions g 0 (·) and g 1 (·), and then a flexible (nonparametric) method is used to remove the effect of the covariates. Once the covariates are removed, the RD plot can be constructed as discussed in this article but employing the appropriately adjusted outcome Y i and running variable X i .

CONCLUSION
This article introduced several optimal data-driven partitionsize selectors for RD plots, focusing on the commonly used ES RD plot and also on an alternative QS RD plot. The resulting selectors lead to practical RD plots that are constructed in an automatic and objective way using the available data.
We developed two kinds of selectors, one tailored to approximate the underlying regression function and another to represent the underlying variability of the raw data. These selectors provide a benchmark for graphical analysis in the context of RD designs: the optimal choices of number of bins introduced can be interpreted as balancing variance and bias of a partitioning estimator of the underlying conditional expectations, and hence an empirical researcher may use these selectors to construct undersmoothed (more bins) or oversmoothed (fewer bins) RD plots or to give a formal interpretation to an ad hoc choice.

SUPPLEMENTARY MATERIALS
The supplemental appendix contains the proofs of the main theorems, additional methodological and technical results, more detailed simulation evidence, and other empirical illustrations. [Received October 2014. Revised January 2015