Generalizing the Results from Social Experiments: Theory and Evidence from India

Abstract How informative are treatment effects estimated in one region or time period for another region or time? In this article, I derive bounds on the average treatment effect in a context of interest using experimental evidence from another context. The bounds are based on (a) the information identified about treatment effect heterogeneity due to unobservables in the experiment and (b) using differences in outcome distributions across contexts to learn about differences in distributions of unobservables. Empirically, using data from a pair of remedial education experiments carried out in India, I show the bounds are able to recover average treatment effects in one location using results from the other while the benchmark method cannot.


Introduction
What do causal effects measured in one place tell us about causal effects in another place or at another time?It is clear that not every finding applies in every context.Some authors have protested against policy recommendations they see as based on implicit extrapolation from a small number of experiments to a wide variety of dissimilar contexts (Deaton 2010;Pritchett and Sandefur 2013;Deaton and Cartwright 2016).Empirically, a growing body of work finds different effects of identical policies among individuals with the same observed characteristics living in different contexts (e.g., Attanasio, Meghir, and Szekely 2003;Allcott 2015;Andor et al. 2020).Relevant unobserved differences between contexts remain, even when considering individuals with the same observed characteristics.
In this article I derive bounds on the average treatment effect (ATE) in an alternative context of interest (context a) using experimental evidence from elsewhere (context e).To fix ideas, consider a familiar setting from the development economics literature.A randomized evaluation of a pilot program has been run in context e and we wish to know what we can conclude about the effect of the program in context a.The experimental treatment group has access to the program, while the control group does not.I study intention-to-treat effects, inclusive of any compliance issues.Data are available on characteristics X and outcomes Y of individuals participating in the experiment.There are also data available on outcomes and characteristics of individuals in a, possibly coming from a different survey.Since the program is a pilot, individuals in a do not have access to the program so they are all untreated.
Throughout the article I will use context-specific superscripts to index quantities conditioned on context, for example letting ) in a by assuming the distribution of the treated outcome for a given untreated outcome and set of characteristics in the context of interest (F a Y 1 |Y 0 ,X (y 1 |y 0 , x)) can be generated from one of the possible joint distributions of potential outcomes in the experiment for the same characteristic set.There is more than one possible joint distribution of potential outcomes consistent with the experiment because we never observe any individual in the treated and untreated state at the same time; the set of joint distributions is only partially identified.Unobservables generate differences in realized outcomes for individuals with the same observed characteristics so we can isomorphically think of joint distributions of potential outcomes as characterizing treatment effect heterogeneity due to unobservables.
Since the untreated outcome distribution in a is identified we can integrate the expectations of the treated outcome E a [Y 1 |Y 0 , X] generated by possible distributions of the treated outcome conditional on the untreated outcome and X over it.This partially identifies the CATEs in a.The width of the bounds depends crucially on the difference in untreated outcome distributions between contexts e and a, with greater differences generating wider bounds.Without further restrictions, the bounds on treatment effects in a obtained in this way can be wide because the experiment does not rule out any level of dependence between treated and untreated outcomes.For continuous outcomes, this means the treatment could perfectly preserve individual ranks in outcome distributions or, at the opposite extreme, it could invert ranks.We typically do not believe treatments studied in economics are sufficiently powerful to invert ranks.Except in extreme cases, we expect positive dependence between treated and untreated outcome levels for any individual, to varying degrees depending on the nature of the treatment.
As a second contribution, I therefore develop tighter bounds, indexed by the minimum normalized rank correlation between an individual's treated and untreated outcome the researcher is willing to consider, following Heckman, Smith, and Clements (1997).When treated and untreated outcomes are perfectly positively dependent, there is only a single joint distribution of treated and untreated outcomes consistent with the experimental results.In the continuous outcome case, each untreated outcome is linked to a single treated outcome.As we move away from perfect dependence, different associations between treated and untreated outcomes become possible.These different associations produce uncertainty about CATEs in a that is increasing in the difference between the distributions of untreated outcomes across contexts.The width of the bounds for a given minimum dependence level provide a measure of uncertainty about the ATE.They also allow researchers to assess the assumptions on dependence necessary to draw specific conclusions about the effect of the program in the context of interest, such as its ability to exceed a cost-effectiveness threshold.
Computation of the bounds on the average treatment effect is challenging because it requires solving an infinite-dimensional optimization problem over the space of possible joint distributions of treated and untreated outcomes.I address this difficulty when outcomes and characteristics are discrete by deriving an alternative representation of the problem based on optimal transportation theory (see Villani 2009;Galichon 2016).I show how this representation can be solved quickly using linear programming techniques and yields a moment inequality representation for inference.
I empirically evaluate the results of my bounding procedure compared to the current benchmark method for extrapolating treatment effects using observed characteristics: Hotz, Imbens, and Mortimer (2005) (henceforth HIM).HIM make the stronger assumption that the joint distribution of untreated and treated outcomes for individuals with the same observed characteristics is independent of context.They also suggest using untreated outcomes for individuals with the same characteristics to assess generalizability, testing independence of the untreated outcome distributions conditional on observed characteristics across contexts.As documented in, for example, Gerard, Rokkanen, and Rothe (2020) and Rambachan and Roth (2021) it is typical when such a test rejects the null to abandon the empirical exercise, in this case effectively concluding that the experiment teaches us nothing about causal effects in the context of interest.If the test fails to reject, common practice would be to proceed as if CATEs are transportable across contexts such that the experiment point-identifies the treatment effect of interest.
To check my predictions against measured ATEs, I use data from randomized evaluations of a remedial education program implemented in two Indian cities and described in Banerjee et al. (2007).The two cities' student populations are sufficiently different that equality of their untreated outcome distributions is rejected which, following common practice, would lead to believing we cannot learn anything about the causal effect in one city based on experimental results from the other.However, I show that if we assume treated and untreated outcomes are sufficiently dependent and F Y 1 |Y 0 ,X (y 1 |y 0 , x) in the new city can be generated from the experimental data, we can exclude a substantial range of ATE a s, including a zero effect.The observed causal effects are consistent with predictions based on these assumptions.In an extension, I allow F Y 1 |Y 0 ,X (y 1 |y 0 , x) in the context of interest to be only close to distributions consistent with the experimental results, with proximity measured in terms of Hellinger distance.Even under this extension a zero ATE a remains outside the bounds as long as dependence is sufficiently high, with the caveat that statistical inference for this case is outside the scope of the current project.
This article extends the literature on generalizing treatment effects to new contexts based on invariance assumptions on distributions of untreated and treated outcomes or treatment effects for individuals with the same observed characteristics (HIM, Attanasio, Meghir, and Szekely 2003;Cole and Stuart 2010;Stuart et al. 2011;Angrist and Fernández-Val 2013;Flores and Mitnik 2013;Pearl and Bareinboim 2014;Angrist and Rokkanen 2015;Dehejia, Pop-Eleches, and Samii 2019).Metaanalysis methods have been used to assess the external validity of results within a set of studies (see e.g., Dehejia 2003;Meager 2019;Vivalt 2020 for examples in economics).When used to extrapolate to contexts outside the set of studies, existing metaanalysis methods incorporating individual-level heterogeneity (known as mixed models) generate a point estimate that assumes individuals with the same characteristics will have the same treatment response.
Relative and complementary to the small literature examining the sensitivity of generalization exercises to the role of unobserved treatment effect modifiers (emphasized in the literature on external validity reviewed in Muller 2014), this article is unique in leveraging (a) the information provided by the experiment regarding treatment effect heterogeneity due to unobservables, as discussed in, for example, Heckman, Smith, and Clements (1997), Djebbari and Smith (2008), and Fan and Park (2010), and (b) using differences in outcome distributions across contexts to learn about differences in the distributions of unobservables.Andrews and Oster (2019) consider a local approximation around the benchmark of no distributional imbalance in the observed and unobserved characteristics which determine treatment effect heterogeneity.Nie, Imbens, and Wager (2021) and Nguyen et al. (2017) consider sensitivity to the ability of an unobserved covariate to distort measures of the prevalence of different CATEs across contexts and the difference in average treatment effects directly, respectively.In other related work, Athey and Imbens (2006) use outcome distributions for different groups of individuals in the same time period to capture unobserved differences across groups when generalizing the linear difference-in-differences estimator.One of their estimators coincides with this paper's, with time playing the role of treatment, under perfect dependence between treated and untreated outcomes.Hoderlein and Stoye (2014) consider another setting where marginal distributions of outcomes (cross-sectional demand) are identified but the joint distribution (of demand under different budget sets) contains the information of interest.As in this article, the authors explore the identifying power of sets of dependence assumptions between the marginal distributions.Finally, this article is also related to a growing body of work moving from testing frameworks to approaches based on quantifying assumptions required to draw conclusions about causal effects (e.g., Rosenbaum 2002;Imbens 2003;Altonji, Elder, and Taber 2005;Oster 2019;Gerard, Rokkanen, and Rothe 2020;Rambachan and Roth 2021).
The rest of the article is organized as follows.Section 2 sets up the problem and notation and describes the approach based on HIM.In Section 3, I present the derivation of the bounds.Section 4 describes my approach to estimation and inference with discrete outcomes and covariates.Section 5 investigates using the results from one of the two remedial education experiments to predict the results in the other experiment.Section 6 lays out the extension to untreated-outcome-conditioned distributions of treated outcomes in the context of interest beyond those consistent with the experimental results.Section 7 concludes.Some readers may wish to consult Online Appendix A, which describes the intuition behind the proposed methods by means of a simple example.All proofs are given in appendices, and all expectations referenced in the article are assumed to exist.

Setup
We are interested in the causal effect of a binary treatment T ∈ {0, 1} on an observable outcome Y ∈ Y ⊆ R. Each individual is associated with two potential outcomes: Y 1 ∈ Y is her outcome if she receives treatment and Y 0 ∈ Y is her outcome if she does not.The observed outcome Y can be written as ( 1 ) For the remainder of the article I will use potential outcomes notation but note that we can equivalently write where g(•) is a structural function, X ∈ X denotes the vector of observed covariates, and U is a vector unobserved variables of unrestricted dimension.Therefore, all statements regarding conditional distributions Y 1 , Y 0 |X = x can be equivalently stated in terms of the distribution of unobservables U|X = x.Data come from two contexts, indexed by D ∈ {e, a}. e is the context in which an experimental evaluation of T was conducted and a is the alternative context of interest.In context e I assume an evaluator assigns T independently of all other random variables with perfect compliance.The probability of assignment to treatment in context e is known and bounded away from zero and one.
Assumption 1 (Simple random assignment in context e).P e (T = 1) ∈ (0, 1) is a known constant p e .I maintain that all members of the alternative population are untreated.I also assume the probability of a unit's belonging to context a in the joint data comprising the two contexts is known and interior.
In this article, the object of interest is the average treatment effect in the alternative context, E a [Y 1 − Y 0 ], which depends on our ability to identify the expectation of the counterfactual treated outcome If the treatment effect were constant for all individuals, E a [Y 1 − Y 0 ] would simply be equal to E e [Y 1 −Y 0 ].However, theory rarely implies a constant treatment effect and we can often reject it empirically (see e.g., Bitler, Gelbach, and Hoynes 2006;Heckman, Smith, and Clements 1997;Djebbari and Smith 2008).

Approach Based on HIM
Within this general setup, I briefly describe HIM's approach to identifying ATE a .HIM put forth the assumption that the joint distribution of potential outcomes is independent of the context, conditional on observed covariates: or equivalently, that all unobserved covariates determining the outcome are independent of the context indicator.As long as X a ⊆ X e we can identify the average treatment effect in the population of interest by reweighting the conditional expectation of the treated outcome from e by the distribution of covariates in a and subtracting the expectation of the untreated outcome there, For (2) to hold, the conditional distributions of untreated outcomes must be the same in the two populations.Therefore HIM and papers following them suggest testing equality of these distributions or their moments.Two issues come up if one were to test F e Y 0 |X (y 0 |x) = F a Y 0 |X (y 0 |x) and use the result to conclude whether or not we can generalize results from the experiment to the context of interest.First, considering the small sample sizes of many social experiments, applications of this test will often be underpowered to reject equality of the conditional outcome distributions (a point raised also in Flores and Mitnik 2013).Second, faced with a test rejecting the null hypothesis, common practice would often be abandon the empirical exercise and conclude that the experiment tells us nothing about ATE a and abandon extrapolation.In the following section, I depart from the testing framework and derive bounds on the average treatment effect in the population of interest as a function of the differences in the covariate-conditioned distributions of untreated outcomes between the population of interest and the experimental population.

Identification of Bounds on ATE a
I begin this section by deriving bounds on the unobserved E a [Y 1 |X], imposing only the assumption that F a Y 1 |X (y 1 |x) cannot be ruled out by the experimental results.I then derive bounds on ATE a that impose additional restrictions on the dependence between treated and untreated outcomes for any individual.

Bounds on E a [Y 1 |X] Under Consistency with Experimental Results
To use the experiment to provide information about where . Let C denote the set of valid copula functions.I express the assumption as follows.
Assumption 3 (Consistency with experimental results).The conditional distribution of treated outcomes in the population of interest is consistent with the experimental results: for some copula function C x ∈ C.
I view this assumption as a benchmark.Identification under Assumption 3 illustrates how uncertainty about the extent of treatment effect heterogeneity due to unobservables in the experiment carries over into uncertainty about ATE a .Moving from the mapping between treated and untreated outcomes when positive dependence between them is maximized (and treatment effect heterogeneity due to unobservables is minimized, Cambanis, Simons, and Stout 1976) to a nondegenerate distribution of treated outcomes for each value of the untreated outcome does, however, generate a second source of uncertainty about the ATE in a.This conditional distribution of treated outcomes need not be the same in a as in e.In Section 6 I show how to adapt my theoretical bounds and augment the programming problem to allow the distribution of the treated outcome conditional on a value of the untreated outcome to differ across contexts, subject to a constraint on the Hellinger distance from conditional distributions consistent with the experimental results to the corresponding distribution in a.
A sufficient condition for Assumption 3 is that the distribution of the treated outcomes be the same across populations once we have conditioned on a value of the control outcome and the observed covariates.Formally: recalling that in conditioning on Y 0 = y 0 and X = x we are conditioning on a function of U and X, g(0, u, x), and x itself.In practice, (4) may be easier to evaluate economically than the more general Assumption 3. Note that Assumption 3 necessarily rules out differences in the distribution of unobservables across contexts which affect treated outcomes only.These include unobserved differences in factors influencing how the treatment would be delivered in a, which can matter a great deal for treatment effects (see e.g., Bold et al. 2018).One could alternatively view results under Assumption 3 as representing E a [Y 1 ] in the counterfactual case where treatment delivery in a were identical to that in e, conditional on X and Y 0 .
A distribution F a Y 1 |Y 0 ,X (y 1 |y 0 , x) obtained from Assumption 3 is defined only for y 0 on the support of F e Y 0 |X (y 0 |x).Therefore, I will also assume that the support of Y 0 |x in a is a subset of the support in the experiment.
Assumption 4 (Support of Y 0 |X = x).The support of Y 0 |X = x in the context of interest is a subset of the support in the experiment for all values of x in the support of X in e: Supp a (Y It is of course still possible to obtain identification without Assumption 4 if Y is bounded (Manski 1990).Assumptions 1, 2, 3, and 4 allow us to bound the expectation of the treated outcome in the context of interest for a value of x in X e : Turning to ATE a , note that the bounds from ( 5) are only defined for values x on the support of X in the experiment.Therefore, to produce informative bounds on the unconditional expectation E a [Y 1 ] I assume that all values x on the support of X in the context of interest are contained in the support of X in the experiment.
Assumption 5 (Support of X).The support of X in the population of interest is a subset of the support in the experimental population: X a ⊆ X e .

Bounds on ATE a with Restricted Dependence
By considering the full set of possible copulas, we consider copulas that may not be credible.We typically anticipate some positive dependence between outcomes with and without treatment for any one individual, with the degree of dependence (and thus of unobserved treatment effect heterogeneity) depending on the application.I therefore index copulas by the degree of dependence in the joint distributions of untreated and treated outcomes they generate.I use Normalized Spearman's ρ, defined below, to measure dependence.
Definition 1 (Normalized Spearman's ρ.).For any two random variables U and V, and copula function C, Normalized Spearman's ρ is given by when U takes a finite number of values, and equivalently for V.The notation F U (u−) denotes P(U < u), and equivalently for V. cov C (R(U), R(V)) refers to the covariance between R(U) and R(V) under copula C: is the maximum covariance possible between R(U) and R(V), occurring when U and Vare comonotonic): is the minimum possible covariance, occurring when U and V are countermonotonic: Theorem 5 and Section 4.3 in Nešlehová (2007) shows that 12cov C (R(U), R(V)) is equivalent to the standard definition of Spearman's ρ found in, for example, Nelsen (2006) Equation 5.1.14: where U and V are distributed independently with the same marginals as U and so that the calculation is completely standard.However, when U and V take a finite number of values, may not equal 1/12 and the normalization is needed (Genest and Nešlehová 2007, sec. 4.4).The only difference with the standard calculation is the normalization in the discrete case.
I produce bounds on ATE a subject to the restriction that we only consider copula functions generating dependence greater than a specified level.These are represented in the following proposition.
With ρ L = −1 there is no restriction on dependence.At the opposite extreme, C( 1) is a singleton and the bounds shrink to a point.C(1) can be justified by assuming that unobserved heterogeneity is one-dimensional and g(t, u, x) is strictly increasing in u for all t and x.Invoking Assumption 3 is likely to be most credible when also restricting dependence.That is, in applications where unobserved heterogeneity is close to this one-dimensional reference point.In the application to remedial education in India, the highest-scoring student without a remedial education teacher assigned to her school would very likely still be the highest-scoring student in math with a remedial education teacher assigned.Latent math skill is the dominant unobservable in this case, and it naturally affects both treated and untreated outcomes.At the same time, it is natural to believe that remedial education teachers are more effective at working with students of specific (low) skill levels, such that they may leapfrog their peers in the test score distribution.Hence, ρ L is plausibly near 1, but we would like to explore sensitivity of conclusions regarding ATE a to nearby values.
In addition to measuring the strength of assumptions on ρ L necessary to draw conclusions about ATE a , researchers may also wish to gauge plausible values of ρ L by simulating simple parametric models.The advantage of the framework advanced here relative to simply using such a model to draw inferences about ATE a is that the researcher need not be wedded to a specific set of parametric assumptions and, instead, they can draw inferences that are consistent with a variety of possible models.

Estimation and Inference
While the identification results in the previous section are general, for the purposes of estimation and inference I consider the case when outcomes and covariates are discrete or discretized.When outcomes and covariates are discrete, optimization over the space of copulas C(ρ L ) can be represented as the solution to a linear programming problem.In particular, the bounds on the average treatment effect in context a for individuals with covariates x admit a representation as the solution to a discrete optimal transportation problem (see Villani 2009; Galichon 2016) with a nonstandard cost function and an additional linear constraint on dependence.As is well-known, efficient algorithms are available to solve linear programs (e.g., Boyd and Vandenberghe 2004), so the bounds can be computed quickly.The linear programming representation is important for practical estimation because the space of copulas is infinite-dimensional and the minimizing and maximizing arguments do not lie at the comonotonic/countermonotonic boundary points of the space, in contrast to the minimizing and maximizing arguments for related problems such bounding the variance of treatment effects (Cambanis, Simons, and Stout 1976).The representation also points the way to a moment inequality representation, which can be used to perform projection inference for ATE a following Andrews and Soares (2010).I assume the following.

Linear Programming Representation
To illustrate the linear programming representation, I begin by considering the case where there are no covariates X (The representation with covariates is given in Online Appendix C.1).For clarity, I refer to the supports of the potential outcomes as Y 0 = {y 01 , . . ., y 0j , . . ., y 0J } and Y 1 = {y 11 , . . ., y 1k , . . ., y 1J }, but it should be understood that for j = k y 0j = y 1k = y j .For ρ L ∈ [0, 1] (positive dependence, which I have argued is often most plausible), the upper bound is obtained by solving the following linear programming problem with optimal value τ U (ρ L ).The representation for ρ L ∈ [−1, 1] is given in Online Appendix B.3.Proposition 2. Suppose the assumptions for Proposition 1 hold, Assumption 6 is satisfied, and ρ L ∈ [0, 1].Then τ U (ρ L ) is equivalent to the optimal value of the following linear programming problem, expressed in terms of observable quantities.
where d represents the d-dimensional unit simplex and probabilities involving y 1−1 should be understood as evaluating to zero.τ L (ρ L ) can be obtained by replacing the max operator with the min operator in the statement of the problem above.
Proof.See Online Appendix B.2.
The choice variables of the linear programming representation are the elements of the matrix defining a possible joint distribution of Y 0 and Y 1 in context e, π = {P(y 0j , y 1k )} k=1,...,J j=1,...,J .Constraints ( 7) and ( 8) require that the minimizing/maximizing joint distribution be consistent with the marginal outcome distributions in e. Constraint (9) enforces that Normalized Spearman's ρ applied to the potential outcomes Y 0 and Y 1 in e may not be below ρ L .Together, constraints ( 7), ( 8), and ( 9) make maximizing over the elements of the joint distribution of Y 0 and Y 1 equivalent to maximizing over the restricted space of copulas, C(ρ L ).
Online Appendix D explores the performance of the linear programming bounds under alternative discretizations of a continuous outcome by means of a simulation study.The DGPs considered there permit an analytical characterization of the bounds on ATE a as a function of the values of Spearman's ρ considered so we can measure how often the true bounds lie within their counterparts constructed using the discretized outcomes as inputs.The simulation results show that finer discretizations of the outcome generate bounds which include the truth with high probability, although bound width also increases in the fineness of the discretization so that an optimal discretization trading off the two objectives may be possible.

Inference
I now show how to translate the linear programming form of the bounds on ATE a from Proposition 2 into a moment inequality framework which allows for inference on ATE a following (Andrews and Soares 2010).Consider the parameter vector The first element of the vector is ATE a , defined by the following moment equality, which is analogous to the objective function of the linear program defined in (6): where ϒ is set equal to the probability mass function of the outcome in the experimental control group through the moment equality The constraints in ( 7) and ( 8) that π be consistent with the experimental untreated and treated marginal distributions translate to the following set of moment equalities: An additional set of moment equalities and inequalities implement the dependence constraint in (9) of the linear programming representation.The dependence constraint requires extra care because of the normalization to the standard Spearman's ρ calculation required with discrete outcomes, as discussed in Definition 1.The right-hand side of the inequality in (9) computes the maximal value of Spearman's ρ based on the Fréchet-Hoeffding upper bound on the joint distribution of Y 0 and Y 1 in e while the left-hand side computes Spearman's ρ for joint distribution π .The moment inequality representation, in turn, must simultaneously identify the upper bound on ρ together with the other parameters.
The following moment equalities and inequalities identify the Fréchet-Hoeffding upper bound, parameterized as the discrete CDF elements γ ∈ [0, 1] J 2 −1 .Equations ( 11) and ( 12) specify that γ jk not exceed either of the corresponding untreated and treated marginal CDF values, while (10) enforces that γ jk be a convex combination of the marginal CDF values.Combined, these restrictions replicate the min operator from (9).
Finally, ( 13) substitutes elements of γ for the Fréchet-Hoeffding upper bound in ( 9) This representation allows us to form confidence regions for θ using Generalized Moment Selection (GMS), introduced in Andrews and Soares (2010), and confidence regions specifically for β (ATE a ) by projecting the confidence region for θ .As discussed in Bugni, Canay, and Shi (2017), and Kaido, Molinari, and Stoye (2019) a confidence interval formed in this way does control asymptotic size but can be quite conservative when θ is high dimensional, which in this case means |Y| is large.Fang et al. (2022) and Bai, Santos, and Shaikh (2022) provide a method for conducting inference on the optimal values of linear programs resembling the one in Proposition 2.
However, their approach assumes parameters to be estimated can be additively separated from known linear functions of the choice variables π .This is true of constraints ( 7) and ( 8), but not of the dependence constraint (9), where marginal probabilities to be estimated multiply the elements of π , or of the objective function, where unknown marginal probabilities similarly multiply π .
My specific approach uses the maximum positive violation of moment inequalities as the test statistic.To cut down on computation time, I use Kaido, Molinari, and Stoye's (2019) Evaluation-Approximation-Maximization (EAM) algorithm which uses an auxiliary model of the relationship between critical values of the test statistic and the corresponding values of θ to choose values of θ to be evaluated at each step of the algorithm which are likely to provide the largest/smallest values of ATE a consistent with the moment equalities/inequalities relaxed to account for sampling error.This reduces the number of values of θ for which critical values must be bootstrapped.My implementation extends the code documented in Kaido et al. (2017) to cover my case of moment inequalities which are not additively separable in data and parameters.
In the next section I move on to apply the theoretical results derived so far in an empirical example, contrasting my approach with one based on HIM.

Remedial Education in India
I now consider a setting where I can compare predicted average treatment effects derived using my bounds approach to experimentally estimated average treatment effects.I take advantage of Banerjee et al.'s (2007) (henceforth BCDL) evaluation of a remedial education program implemented by the same NGO (Pratham) in two Indian cities: Mumbai and Vadodara.Under the program, Pratham provides government schools with a teacher to work with 15-20 students in the third and fourth grade who have been identified as falling behind.The teacher works with these students for about half the school day.I use the results from the Vadodara experiment, combined with the control group data from Mumbai to predict the average treatment effect in Mumbai.
BCDL carried out the experimental evaluations in Mumbai and Vadodara over the course of three years, from 2001 to 2003.The last year was used to investigate the persistence of program effects on learning, so I focus on the first two.In Mumbai, the experiment was carried out only among third graders in the first year of the evaluation, while in the second year there were compliance issues, with only two-thirds of Mumbai schools agreeing to participate.To abstract from the problems with compliance, I work with the sample of third graders surveyed during the first year of the experiment in Mumbai.In Vadodara, in contrast, both grade levels were represented in each of the first two years.I implicitly condition on grade level and thus do not consider the Vadodara fourth graders.
BCDL provide a harmonized measure of learning across cities: student grade level competency.Grade level competency measures whether the student successfully answered questions showing mastery of the subjects taught in each grade.This measure of knowledge is used in the Annual Status of Education Report (ASER), also affiliated with Pratham, to compare achievement across Indian states.Students may not achieve all competencies below their maximum competency so, for simplicity, I consider the maximum competency as the outcome of interest and focus on math scores.For third graders the measure ranges from 0 to 3, with 0 assigned to students who do not display mastery of subjects taught in even the first grade while 3 reflects comprehensive knowledge of subjects covered in the third grade.
Other applications, such as the additional application covered in a previous draft of this article (Gechter 2022), may benefit from the common practice of experimental studies' basing parts of their questionnaires on national surveys.International comparisons may similarly benefit from efforts to harmonize national surveys across countries.At the same time, practitioners should take care that it is appropriate to consider responses to even identical questions comparable given possible differences in enumerator quality or question framing.
With the exception of competency at baseline, relatively little data on students is available consistently across the two samples.Table 1 shows summary statistics for the maximum competency at baseline in the two populations as well as students' class size and gender.The populations are relatively balanced on gender, while Mumbai classes are notably larger than those in Vadodara.BCDL find no evidence of treatment effect heterogeneity on either of these characteristics, so I ignore them and focus on conditioning on the maximum competency level at baseline.
I now move to using the results from Vadodara and the Mumbai control group to predict the average treatment effect in Mumbai, comparing the HIM-based procedure and the bounds developed in this article.Table 2 shows the distributions of grade level competency in math on leaving third grade in the control groups in both cities in the BCDL experiments, conditional on their grade level competency in math on entering third grade.The last column of panel B shows the p-value associated with a χ 2 test of equality for each conditional distribution, treating each classroom-year pair as a cluster.The tests reject equality of the conditional distributions at the 5% level for all values of grade level competency on entering third grade.The common practice would be to abandon the empirical exercise and conclude that we cannot learn anything about the treatment effect in Mumbai from the Vadodara experiment: the students in the two cities are too different.Turning to the bounds developed in this article, Figure 1 plots bounds on the predicted values of the average effect of the remedial education program on maximum math grade level competencies in Vadodara as a function of the minimum rank correlation, ρ L , between outcomes with and without remedial education for individuals with the same grade level competency on entering third grade.The linear programming form makes computing the set estimates quick: producing all the set estimates in the graph takes 24.7 sec on a 2019 Macbook Pro using R interfaces to the open source C library lp_solve.The bounds are plotted in black, while the translucent gray region represents a pointwise 80% Andrews and Soares (2010) projection confidence interval, based on 600 block bootstrap replications clustering at the classroom-year level.Online Appendix E replicates the key results from Figure 1 in tabular form and additionally reports 90% confidence intervals.
A notable feature of the bounds in this example is that they widen quickly with only small deviations from the maximum possible rank correlation.This is due to the fact that the conditional distributions of control outcomes differ substantially between Mumbai and Vadodara, as we saw in Table 2.A zero average treatment effect in Vadodara can only be rejected using the Mumbai results if ρ L is greater than or equal to about 0.95 and, additionally, that Assumption 3 holds exactly (in the next section, I outline a method for relaxing the latter assumption).This reflects a strong, but plausible level of dependence.BCDL's result that the ATT is about one half of the test score gain a control group child realizes from completing a year of school suggests the remedial education program is unlikely to have the effect of raising participating students' grade level competency by more than one grade.Furthermore, BCDL did not find evidence of significant effects on nonparticipants.Importantly, in this example I can investigate the assumption of strong positive dependence by comparing the predictions obtained from the bounds to the observed results in Mumbai.The light gray line plots the measured average effect of remedial education on maximum grade level competency in math in Mumbai.We see that the point estimate with maximum rank correlation over-predicts the average treatment effect in Mumbai.Some amount of shuffling in the potential outcome distributions induced by the remedial education treatment is needed to recover the Mumbai point estimate.This is, again, plausible given that not all students work with the remedial education teacher.

Extension: Relaxing Assumption 3-Consistency with Experimental Results
Once we depart from the continuous outcome case under the assumption of perfect dependence between treated and untreated outcomes, two sources of uncertainty arise: (a) ambiguity regarding the conditional distribution of treated outcomes for a given untreated outcome value in the experiment and (b) uncertainty over whether the conditional distribution in the alternative context is the same as or otherwise consistent with the experimental results.I have mainly focused on (a) since the experimental data can be directly informative here.In order to relax Assumption 3, one can constrain F a Y 1 |Y 0 ,X (y 1 |y 0 , x) to lie within a given statistical distance of a conditional distribution consistent with the experiment.In this extension, I consider restricting F a (y 1 |y 0 , x) to be at most κ away from a conditional distribution consistent with the experimental results in units of Hellinger distance.I denote this set of CDFs F(y 0 , x), defined as The bounds on ATE a under this restriction are While the bounds from Proposition 1 have their root in what can be identified about treatment effect heterogeneity due to unobservables from the experiment, the relaxation in the set of possible F a Y 1 |Y 0 ,X (y 1 |y 0 , x) in ( 14) is in the spirit of sensitivity analyses which directly specify biases as the sensitivity parameter, such as Nie, Imbens, and Wager (2021) and Nguyen et al. (2017).As in Imbens (2003), researchers could consider specifying one element of X as a pseudo outcome and computing the maximal Hellinger distance between the distributions of the pseudo outcome conditioned on the remaining elements of X and Y 0 across contexts.
The linear programming formulation from Proposition 2 can be extended as follows to allow the conditional distributions in a to only be close to distributions consistent with the experimental results according to the Hellinger distance: also maintaining the constraints on π e from ( 7), (8), and (9) exactly as they appear in the statement of Proposition 1. E a [Y 1 ] in the objective function is now defined directly in terms of the elements of the probability of each value y 1k of the treated outcome conditional on a value y 0j of the untreated outcome in context a, π a k|j .However, constraint (15) keeps the conditional distribution in a for each y 0j close to a distribution consistent with the experimental results.The program is no longer linear but remains convex under any f -divergence, of which Hellinger distance is one example.Hellinger distance is particularly wellbehaved in computational software since it avoids problems with different potential supports of π a k|j and π e jk P e (y j |T=0) unlike, for example, relative entropy.To ease interpretation, Hellinger distance is also conveniently bounded between zero and one, unlike many f -divergences which are bounded only from below.As a linear program augmented with a constraint subtracting geometric means of optimization variables (making the constraint known to be convex), the program obeys the conventions of disciplined convex programming (Grant, Boyd, and Ye 2006).It can therefore be processed by CVX (Grant and Boyd 2020) before being sent to the open source semidefinite-quadraticlinear programming solver SDPT 3 .
Figure 2 shows the results of applying the relaxed program to the remedial education application described in Section 5. When κ = 0, we see the same bounds as the black region in Figure 1.Increasing κ increases the width of the bounds.With a relaxation of κ less than or equal to roughly 0.05, the bounds continue to exclude zero so long as ρ L exceeds about 0.95.Without the relaxation (if κ = 0), ρ L needs to exceed about 0.82.

Conclusions
The methods derived in this article offer researchers a formal and tractable way of assessing the extent to which experimental results generalize to contexts outside the original study.I avoid the need to make a binary decision, for example based on HIM's test, of whether to proceed with extrapolation.The bounds developed here quantify our uncertainty about effects in the context of interest due to unobserved differences across contexts, and exhaust the available information.In the remedial education example, the bounds showed that under assumptions of strong positive dependence between a student's grade-level competency with and without a remedial education teacher assigned to her school, and consistency with experimental results, we can learn quite a bit about the effect of remedial education in one city using results from the other.The experimental effects are consistent with the assumption of strong dependence.

Figure 1 .
Figure 1.Bounds on the Change in Average Grade Level Competency in Mumbai Using Experimental Results from Vadodara and Untreated outcomes from Mumbai NOTE: For each lower bound on the dependence between a student's maximum grade level competency with and without a remedial education teacher assigned to her school, ρ L , the solid black region shows the bounds on the average treatment effect in Mumbai.The translucent gray region shows pointwise projections of Andrews and Soares (2010) 80% confidence regions for θ formed as described in Section 4.2, for ATE a with critical values based on 600 block bootstrap replications clustered at the classroom-year level, following BCDL.The light gray line shows the point estimate of the average treatment effect in Mumbai, using the experimental results.

Figure 2 .
Figure 2. Bounds on the Change in Average Grade Level Competency in Mumbai using Experimental Results from Vadodara and Untreated Outcomes from Mumbai, Relaxing Assumption 3-Consistency with Experimental Results NOTE: the lines show the upper and lower bounds on the average treatment effect in Mumbai for a given value of the lower bound on the dependence between a student's treated and untreated maximum grade level competency, ρ L (on the xaxis), and maximal Hellinger distance between untreated-grade-level-competencyconditioned distributions of treated maximum grade level competency κ indicated by line type.

Table 1 .
Summary statistics for Mumbai and Vadodara samples.Student-level sample means and standard deviations (in parentheses) for student and classroom characteristics.Students are third graders from years 1 and 2 of the Banerjee et al. (2007) experiment in Vadodara and year 1 in Mumbai.Author's calculations based on data from Banerjee et al. (2007).

Table 2 .
Controls-P(competency on exiting grade 3 | competency on entering grade 3).NOTE: The final column of panel B reports p-values for χ 2 tests of the equality of the conditional distributions, accounting for classroom-year level clusters.