Efficient Estimation of Data Combination Models by the Method of Auxiliary-to-Study Tilting (AST)

We propose a locally efficient estimator for a class of semiparametric data combination problems. A leading estimand in this class is the average treatment effect on the treated (ATT). Data combination problems are related to, but distinct from, the class of missing data problems with data missing at random (of which the average treatment effect (ATE) estimand is a special case). Our estimator also possesses a double robustness property. Our procedure may be used to efficiently estimate, among other objects, the ATT, the two-sample instrumental variables model (TSIV), counterfactual distributions, poverty maps, and semiparametric difference-in-differences. In an empirical application, we use our procedure to characterize residual Black–White wage inequality after flexibly controlling for “premarket” differences in measured cognitive achievement. Supplementary materials for this article are available online.


INTRODUCTION
Let Z = (W , X , Y ) denote a random vector drawn from some study population of interest with distribution function F s . For some unique γ 0 , and known function ψ (z, γ ) of the same dimension, we assume that where E s [·] denotes expectations taken with respect to the study population. If a random sample of Z is available, then consistent estimation of γ 0 (under regularity conditions) is straightforward (e.g., Newey and McFadden 1994). Many statistical models of interest can be represented in terms of moment restrictions like (1); see Wooldridge (2002) for a textbook exposition.
In this article, we consider estimation of γ 0 when a random sample of Z is unavailable. Instead two separate samples are available. The first is drawn from the study population and contains N s measurements of (Y, W ) . The second is drawn from an auxiliary population (with distribution function F a ; E a [·] denotes expectations taken with respect to this distribution) and contains N a measurements of (X, W ) . While the variable W is common to the two-samples, X and Y are not. Hahn (1998) and Chen, Hong, and Tarozzi (2008) showed that identification of γ 0 follows if (i) the conditional distributions of X given W in the two populations coincide (although their marginal distributions for W may differ), (ii) the support of W in the auxiliary population is at least as large as that in the study population, and (iii) ψ (z, γ 0 ) is separable in the components depending on the "noncommon" variables Y and X ψ (Z, γ 0 ) = ψ s (Y, W, γ 0 ) − ψ a (X, W, γ 0 ) . ( Examples of statistical problems to which the above setup applies include the two-sample instrumental variables (TSIV) model of Angrist and Krueger (1992) and Ridder and Moffitt (2007), the average treatment effect on the treated (ATT) estimand from the program evaluation literature (e.g., Heckman and Robb 1985;Imbens 2004), counterfactual earnings/wealth decompositions as in Dinardo, Fortin, and Lemieux (1996) and Barsky et al. (2002), poverty mapping as in Elbers, Lanjouw, and Lanjouw (2003) and Tarozzi and Deaton (2009), direct standardization methods used in demography (e.g., Kitagawa 1964), and models with mismeasured regressors and validation samples (e.g., Carroll and Wand 1991).
To help fix ideas consider the ATT example. Here, Y denotes an individual's potential outcome under active treatment, say earnings given participation in a job training program, X denotes her outcome under control (earnings in the absence of training) and W is a vector of baseline covariates. Available is a random sample of (Y, W ) from the population assigned active treatment (i.e., "the treated"). A separate sample of measurements of (X, W ) is drawn from a population of controls. The ATT, γ 0 = E s [Y − X], is given by the solution to (1) with ψ s (Y, W, γ 0 ) = Y and ψ a (X, W, γ 0 ) = X + γ 0 . Dehejia and Wahba (1999), revisiting earlier work by LaLonde (1986), combined two distinct samples to estimate the effect of the National Supported Work (NSW) demonstration, a labor training program, on the post-intervention earn-ings of trainees. Their study sample consists of 185 NSW participants, while their auxiliary sample includes 2490 nonparticipants drawn from the Panel Study of Income Dynamics (PSID). These two-samples consist of random draws from distinct, nonoverlapping, populations. The two-sample feature of their analysis distinguishes it from one seeking to estimate a population average treatment effect (ATE). In that case, the researcher generally bases her analysis on a random sample from the population of interest, where some units happen to be treated, and others not (e.g., Rosenbaum and Rubin 1983). There the inferential problem is usefully conceptualized as one of missing data and the general theory of Robins, Rotnitzky, and Zhao (1994) directly applies.
Relationship Between Data Combination and Missing Data Problems. One perspective is that data combination problems are nothing more than a particular class of "missing data" problems in which the auxiliary sample is collected independently, and from a different population than that, of the study sample. Our use of the term missing data is more technical, referring, in particular, to the family of problems analyzed by Robins, Rotnitzky, and Zhao (1994, sec. 8.1). In this family, both the study and auxiliary samples are random ones from the population of interest. It turns out that this difference has statistical content with, as we emphasize here (and others have before us), implications for estimator formulation and properties. In an important article, Hahn (1998) showed that while prior restrictions on the form of the propensity score do not lower the semiparametric variance bound for the ATE, they do lower the corresponding bound for the ATT. Chen, Hong, and Tarozzi (2008) generalized this result, showing that, unlike in the missing data context (their "verify-in-sample" case), knowledge of the form of the propensity score is asymptotically valuable in data combination problems (their "verify-out-of-sample" case).
Our contribution is to develop a flexible parametric estimator for general data combination problems with good efficiency and robustness properties. Similar to the augmented inverse probability weighting (AIPW) estimator for missing data problems due to Robins, Rotnitzky, and Zhao (1994), our data combination procedure is locally efficient and possesses a double robustness property. This latter property, given the nonancillarity of the propensity score in the data combination problem, is surprising.
To our knowledge, we are the first to propose a locally efficient estimator in the data combination context. Chen, Hong, and Tarozzi (2008) proposed a globally efficient estimator, but their procedure requires nonparametric modeling as opposed to the flexible parametric approach adopted here. Our methods provide a practical alternative to theirs when W is high dimensional (see Firpo and Rothe 2013). Abadie (2005) developed a parametric propensity score reweighting (PSR) estimate of the ATT. Qin and Zhang (2008) showed that Abadie's estimator can have low efficiency in some settings and proposed an alternative that uses empirical likelihood ideas. Qin and Zhang (2008) did not characterize the semiparametric efficiency or robustness properties of their ATT estimator, nor showed how to extend it to the wider class of problems considered here.  also proposed a type of propensity score reweighting estimator for the ATT. Their estimator exhibits a double robustness property, but they did not consider issues of semiparametric efficiency nor general data combination prob-lems as we do. Besides its robustness and efficiency properties, our estimator is simple to compute and is suitable for many applied problems, like the estimation of the ATT, two-sample instrumental variables and others cited above.
In Section 2, we define the semiparametric data combination model. Modestly extending the work of Chen, Hong, and Tarozzi (2008), we calculate the semiparametric efficiency bound for our model. We relate our efficiency bound analysis to prior work on distribution function estimation based on a random sample from the population of interest and a second, biased, sample from the same population (e.g., Qin 1998;Gilbert, Lele, and Vardi 1999). This discussion motivates the form of our AST estimator, which we introduce in Section 3, where we also formally characterize its large sample properties. Our key results are Theorems 2 to 4. Section 4 provides an illustrative empirical application and reports on the results of several Monte Carlo experiments. Proofs of our main results are contained in the Appendix. The supplemental Web appendix contains additional proof details, extra examples of data combination problems, and additional Monte Carlo results. An algorithm for computing our estimator, that we have found to work well in practice, is also described in the supplemental Web appendix.

SEMIPARAMETRIC DATA COMBINATION MODEL
A formal definition of the data combination model is given by Assumption 1. Let A R P denote a compact subset of R P .

Assumption 1. Semiparametric Data Combination Model
then S s ⊂ S a . (iv) (Multinomial sampling). With probability Q 0 ∈ (ξ, 1 − ξ ) for 0 < ξ < 1, we draw a unit at random from F s and record its realizations of Y and W, otherwise we draw a unit at random from F a and record its realizations of X and W. . . . , N) corresponds to a study population unit and D i = 0 otherwise. (v) (Propensity score model). There is a unique δ 0 ∈ D R dim(δ) , known vector r (W ) of linearly independent functions of W with a constant in the first row, and known function G (·) such that (a) G (·) is strictly increasing, differentiable, and maps into the unit interval with lim for all w ∈ W, and (c) 0 < G r(w) δ ≤ κ < 1 for all δ ∈ D and w ∈ W.
The first part of Assumption 1 implies global identifiability of the complete data model. The second part implies that the distributions of (Y, W ) and (X, W ) in the two populations differ only in terms of their marginal distributions for the always measured variable, W. The third part ensures that, in large samples, for each unit in the study sample there will be matching units with similar values of W in the auxiliary sample. The fourth part of Assumption 1 allows us to treat the merged sample , "as if" it were a random one from a pseudo merged population with distribution function F (let E [·] denote expectations taken with respect to this distribution). The semiparametric data combination model is typically defined by specifying properties of the merged population (e.g., Hahn 1998;Chen, Hong, and Tarozzi 2008). We prefer the formulation given above because it (i) emphasizes that the problem is fundamentally one of combining two datasets and (ii) in many applications the merged population does not correspond a real world population. Neither (i) or (ii) are features of standard missing data problems (i.e., Robins, Rotnitzky, and Zhao 1994). We also note that formulating a model by imposing restrictions on a pseudo-population is somewhat awkward (see the discussion in Abadie and Imbens 2006, p. 239).
The sampling distribution induced by the multinomial scheme, F, has density . Now consider the conditional probability given W = w that a unit in the merged sample corresponds to a draw from the study population. Let E[ D| W = w] = p 0 (w) denote this "propensity score," by Bayes' Law we can define a relationship between the study and auxiliary densities of W in terms of p 0 (w) Under the merged population formulation of the problem it is clear that part (i) of Assumption 1 corresponds to requiring that E [ ψ (Z, γ 0 )| D = 1] = 0, part (ii) to conditional independence restrictions on the merged population distribution function of F ( y| w, d = 1) = F ( y| w, d = 0) and F ( x| w, d = 1) = F ( x| w, d = 0) , and parts (iii) and (iv) to assuming that p 0 (w) is bounded away from one. Part (v) implies that the density ratio f s (w) /f a (w) takes a parametric form or, equivalently, that the propensity score is known up to a finite dimensional parameter. Identification of γ 0 follows from, using parts (ii) and (iii) of Assumption 1 and Equation (3), the equality which is, by part (i) of Assumption 1, uniquely zero at γ = γ 0 . See Lemma 3.1 of Abadie (2005) for a formal proof.

Example: Two-Sample Instrumental Variables (TSIV)
To give some idea of the range of problems to which our methods apply, we elaborate on one common data combination problem in detail: the two-sample instrumental variable model (TSIV). This model is widely used by empirical researchers in economics (see Inoue and Solon 2010). Our observation that TSIV is a special case of the model defined by Assumption 1 is a new one, with empirically relevant implications. In particular, the auxiliary-to-study (AST) estimator we propose below is both (i) more efficient and (ii) consistent under a wider, and empirically relevant, set of assumptions, than, for example, the estimators of Angrist and Krueger (1992) and Ridder and Moffitt (2007).
Additional examples of data combination problems are outlined in the supplemental Web appendix. Chen, Hong, and Tarozzi (2008), Ridder and Moffitt (2007), and Abadie (2005) provided further examples.
Following Ridder and Moffitt (2007), consider two-sample instrumental variables (TSIV) models of the form with W = (W 0 , W 1 ) . The first sample consists of measurements of (Y, W ) and the second of (X, W ). They assume that both samples are random ones from the study population (i.e., the samples are "compatible"). This corresponds to augmenting Assumption 1 with the additional requirement that F s (w) = F a (w) . The TSIV model is of the form required by (2) with ψ s (y, w, γ ) = f (Y ; γ ) e (W ) and ψ a (x, w, γ ) = g (X, W 1 ; γ ) e (W ). When e (W ) = W , f (Y ; γ ) = Y and g (X, W 1 ; γ ) = X α + W 1 β with γ 0 = α 0 , β 0 we have the linear model analyzed by Angrist and Krueger (1992). Ridder and Moffitt (2007) showed how one may estimate the mixed proportional hazard (MPH) model under this setup, while Ichimura and Martinez-Sanchis (2004) discussed binary choice models.
A concrete example of a TSIV problem is provided by the work of Currie and Yelowitz (2000), who considered the model where Y is an indicator for whether a school-aged child has repeated a grade, X an indicator for residence in public housing, W 0 equals the number of male siblings in the household, and W 1 equals the overall number of siblings and also contains other household characteristics; W = (W 0 , W 1 ) . Their interest centers on the causal effect of residence in public housing on human capital acquisition. The number of male siblings changes the probability of residence in public housing since, conditional on the overall number of siblings, families with a mixture of boys and girls qualify for larger units and hence higher (implicit) housing subsidies. Currie and Yelowitz (2000) additionally argued that, conditional on the total number of one's siblings, their gender mix should not influence schooling independently of any effect mediated by exposure to public housing. Hence, W 0 may serve as an instrumental variable for X. Currie and Yelowitz (2000) observed Y and W for a random subsample of children drawn from the U.S. Census. The Census, however, does not collect information on residence in public housing, X. This information is available in the U.S. Current Population Survey (CPS), which also includes measurements of W (but not Y). They treat both the Census and CPS samples as random ones from their study population (school-aged children living in the United States) and use a variant of Angrist and Krueger's (1992) In applications of the TSIV model, like Currie and Yelowitz's (2000), it is often found that the sample moments of the common variables W differ significantly across the two datasets being combined (see also Björklund and Jäntti 1997). This suggests that full compatibility may fail in practice (i.e., F s (w) = F a (w)). The estimator presented below does not require full compatibility and is generally more efficient than the one proposed by Angrist and Krueger (1992) (compare Theorems 2 and 3 with Angrist and Krueger (1992, p. 331) or Ridder and Moffitt (2007, p. 5505)).

Efficiency Bound
Hahn (1998, Theorem 1) calculated the semiparametric variance bound for the special case where γ 0 is the ATT and part (v) of Assumption 1 is not part of the prior restriction. Chen, Hong, and Tarozzi (2008, Theorem 3) included part (v) in their prior, but assume that ψ s (Y, W, γ ) = 0. The following result generalizes that of Chen, Hong, and Tarozzi (2008) to the case where the moment function is of the form given in (2). To present this result, we require some additional notation. Let E * [Y |X] denote the mean squared error minimizing linear predictor of Y given X and define Theorem 1 (Semiparametric Variance Bound). Under Assumption 1(i) the maximal asymptotic precision with which γ 0 may be regularly estimated is given by the inverse Proof. The proof, which involves a modest extension of the analysis of Chen, Hong, and Tarozzi (2008, Theorem 3), is in the supplemental Web appendix.
It is easy to show that the information bound for γ 0 is smaller in the model, which leaves p 0 (W ) nonparametric (i.e., where part (v) of Assumption 1 is not part of the prior). Knowledge of the parametric form of the propensity score increases the large sample precision with which γ 0 may be estimated. In contrast, in semiparametric missing data problems, it is well known that parametric restrictions on the propensity score do not shift the efficiency bound (e.g., Robins, Rotnitzky, and Zhao 1994;Hahn 1998). The value of prior restrictions on the propensity score distinguishes the data combination problem from the missing data one.
To understand this difference, we use the well-known result that a biased sample may be combined with a random one to form a more efficient distribution function estimate as long as the biasing function is known or parametrically specified. Parts (v) of Assumption 1 implies that we can view the auxiliary sample as a biased sample from the study population of interest where the biasing function is known up to a finite-dimensional parameter (see Qin 1998;Gilbert, Lele, and Vardi 1999;Ridder and Moffitt 2007).
Here, and in what follows, we assume without loss of generality that the merged sample is arranged such that its first N s units correspond to study population draws, and its remaining N a units to auxiliary sample draws. Let G(r(w) δ ML ) denote the conditional maximum likelihood estimate of the propensity score (based on the merged sample), then the estimate efficiently uses the information in both the study and auxiliary samples to estimate F s (w). To understand (7) note that Bayes' and Q 0 with their maximum likelihood estimates, and f (W i ) with the empirical measure of the merged sample, 1/N, gives f s (W i ) = π eff i , for π eff i defined in (7). Equation (7) uses both study and auxiliary units-linked via a parametric form for the propensity score-to efficiently estimate F s (w).
In contrast, in missing data problems the population of interest corresponds to what we have termed the merged population.
The most efficient estimate of the merged population distribution function of W is the merged sample empirical distribution function. This is true irrespective of the form of the propensity score. This provides one intuition for why prior knowledge of the form of the propensity score is not valuable in the missing data context (see Graham 2011).

AUXILIARY-TO-STUDY TILTING
In this section, we present our AST estimator and characterize its large sample properties under different sets of assumptions. Since the parameter of interest, γ 0 , involves integration over the study population distributions of (Y, W ) and (X, W ), these two distribution functions must be (implicitly) estimated to estimate γ 0 . The AST estimator uses distribution function estimates that share a finite number of moments of W in common with F eff s (w). That is, we calibrate our estimates of the study population distributions of (Y, W ) and (X, W ) to features of (7) (which is a semiparametrically efficient estimate of F s (w) when the propensity score takes a parametric form). This, as we explain below, is the source of the efficiency gains associated with our procedure.
The idea of calibrating a distribution function estimate to information garnered from auxiliary sources arises in other contexts. Little and Wu (1991) discussed contingency table calibration to known margins and provided historical references (see Hellerstein and Imbens 1999). Bickel, Ritov, and Wellner (1991) studied estimation of linear functionals of probability measures with known marginals.  showed how calibration to marginal information from refreshment samples may be used to correct for certain types of nonignorable attrition in panel data. In the context of average treatment effect estimation, Tan (2006) calibrated estimates of the two potential outcome distributions to features of the empirical distribution of always observed variables (see Qin and Zhang 2007;Graham, de Xavier Pinto, and Egel 2012). Recently, Cheng et al. (2009) applied related ideas to an instrumental variables model.

Outline of the AST Estimator
Our estimator for γ 0 , which we call the AST estimator, is a sequential method of moments estimator. In the first step, we estimate the propensity score parameter δ by conditional maximum likelihood: In the second step, we compute a reweighting of both the study and auxiliary samples. Let t (W ) be vector of known linearly independent functions of W with a constant 1 in the first row and λ a and λ s be "tilting" parameters of the same dimension. We allow for r (W ) and t (W ) to include common elements or even coincide. Fixing δ at δ ML and Q at Q ML , we choose λ a to To understand this method of choosing λ a , it is helpful to rearrange (9) to get and where the second line of (10) is equivalent to the first. The term to the right of the equality in (10) is an estimate of E s [t (W i )]-the study population mean of t (W i )-based on the efficient distribution function estimate (7). It is consequently The solution to (9)-our estimate of λ a -is chosen to form a reweighting of the auxiliary . To better understand (10) recall that, as shown by Abadie (2005) and others, the propensity score reweighting-type estimator is consistent for the study population distribution function of (X, W ). Our AST estimator replaces F PSR This tilted distribution estimate, unlike F PSR s (x, w), is guaranteed to integrate to one and shares a finite number of moment in common with F eff s (w) . We also compute an analogous tilt of the study sample for With the auxiliary and study sample tilts in hand, we then choose γ AST to solve, holding λ a and λ s fixed at their second step values, Inspection of (13) indicates that our estimate of γ 0 is based on two separate estimates of the study population distribution function. The first, corresponding to the study tilt { π s i } N s i=1 is an estimate of the study population distribution of (Y i , W i ), the second, corresponding to the auxiliary tilt, { π a i } N i=N s +1 , is an estimate of the study population distribution of the (X i , W i ). Neither of these two estimates coincide with the efficient estimate of the study population distribution of W i alone (i.e., with (7)), but they do share important features with it. Specifically, they are constructed so that the means of t(W i ), computed using the two tilts, coincide with the efficient estimate.

Large-Sample Properties
Our next three results provide formal descriptions of the asymptotic sampling properties of γ AST under different combinations of assumptions. We begin with a characterization of the sampling properties of √ N ( γ AST − γ 0 ) under our baseline model (i.e., Assumption 1). We then outline our local semiparametric efficiency and double robustness results.
To state our first result, we require some additional notation. Let , be weighted projections of ψ s (Y, W, γ 0 ) and ψ a (X, W, γ 0 ) onto the space spanned by t (W ), with projection coefficients of * Also define R (D, Theorem 2 (Asymptotic Distribution). Suppose that Assumption 1 and additional regularity conditions hold, then (i) γ AST and (iii) the asymptotic efficiency of √ Na ( γ AST − γ 0 ), for any vector of constants a, is bounded below by where = max ( s , a ) with and · ∞ denoting the maximum absolute row sum norm.
Theorem 2 indicates that under Assumption 1 our AST estimator is consistent and asymptotically normal, but inefficient relative to a semiparametrically efficient estimator (see Theorem 1). Some insight into the degree of AST's inefficiency is provided by the bound (17). First, the term, 2 , indicates that the AST estimator performs better when q s (w) and q a (w) are well approximated by a linear combination of the elements of t (w) . We discuss the nature of this approximation further below. Second, the performance of the AST estimator will, in general, be sensitive to the degree of overlap. If the expected value of the propensity score weight, p 0 (W ) / (1 − p 0 (W )), used to reweight auxiliary units is large, as may be true if κ, the upper bound on p 0 (W ), is close to one (see part (v) of Assumption 1), then the performance of the AST estimator may be poor (see Khan and Tamer 2010).
More generally the form of R (D, W ) indicates that the relative efficiency of γ AST depends on the quality of the linear approximations q s (W ) * s t (W ) and q a (W ) * a t (W ). This is easiest to see in the special case where r (W ) ⊂ t (W ), in which case (see (A.13) in the Appendix), defining U * s = q s (W ) − * s t (W ) , U * a = q a (W ) − * a t (W ) , and U * = (1 − p 0 (W )) U * s + p 0 (W ) U * a : so that the degree of inefficiency depends on a (weighted) expectation of the squares and cross products of a linear combination of the approximation errors. The form of RR (γ 0 ) indicates that γ AST will have high relative efficiency whenever q s (W ) and q a (W ) are well approximated by a linear combination of the elements of t (W ) . This will be particularly true when overlap is good such that the weight p 0 (W ) 1−p 0 (W ) does not take on extreme values.
Our next result, which characterizes when γ AST will be efficient, is anticipated by the discussion above. Consider the assumption: Assumption 2 (Moment CEF). For some unique pair of matrices s , a , and vector of linear independent functions t (W ) with a constant in the first row, we have Assumption 2 posits a working model for the conditional expectation functions (CEFs) of ψ s (Y, W, γ 0 ) and ψ a (X, W, γ 0 ) given W. The substantive content of this assumption is, of course, model and application specific. The ATT example discussed in the introduction provides a simple illustration. In that case, Assumption 2 implies that the CEFs of the potential outcomes given active and control treatment, Y and X, are linear in t (W ). Thus, if the object of interest is the ATT, the analyst should pick the elements of t (W ) so as to provide a good approximation to these two CEFs. For the two-sample instrumental variables (TSIV) model, it is possible to show that the correct t (W ) is an implication of the structure of the first-stage relationship between the endogenous right-hand side variable, X, and the instrument vector, W.
If both Assumptions 1 and 2 hold, the Appendix shows that γ AST is asymptotically linear with representation from which our next theorem directly follows.
Theorem 3 (Local Semiparametric Efficiency). Suppose that Assumption 1 and additional regularity conditions hold, then for γ AST the solution to (13), γ AST is locally efficient at Assumption

Proof. See the Appendix.
Our efficiency bound calculation, Theorem 1, gives the information bound for γ 0 without imposing the additional auxiliary Assumption 2. This assumption imposes restrictions on the joint distribution of the data not implied by the baseline model. If these restrictions are added to the prior used to calculate the efficiency bound, then it may be possible to estimate γ 0 more precisely. Our estimator is not efficient with respect to this augmented model. Rather it attains the bound provided by Theorem 1 if Assumption 2 happens to be true in the population being sampled from, but is not part of the prior restriction used to calculate the bound. Newey (1990, p. 114), Robins, Rotnitzky, and Zhao (1994, p. 852-853), and Tsiatis (2006) discussed the concept of local efficiency in detail. In what follows we will, for brevity, say γ AST is locally efficient at Assumption 2. The form of the variance bound when semiparametric, or parametric (as in Assumption 2), restrictions on q s (w) and q a (w) are maintained as part of the prior restriction is unknown. Graham (2011) studied such restrictions in the missing data context.
Next we give our double robustness result. Here, our result is slightly less general than similar results in the missing data literature, but nevertheless may be useful in practice.
Theorem 4 (Double Robustness). Under parts (i) to (iv) of Assumption 1, γ AST p → γ 0 with a limiting normal distribution if either (a) part (v) of Assumption 1 also holds or (b) the analyst chooses G (v) = exp(v) 1+exp(v) and Assumption 2 holds. Proof. See the Appendix.
Theorem 4 indicates that the advantage of choosing t(W ) with Assumption 2 in mind is two-fold. Under the baseline model defined by Assumption 1, Theorem 3 implies that γ AST will have low sampling variation if q s (w) = E[ψ s (Y, W, γ 0 )|W = w] and q a (w) = E[ψ a (X, W, γ 0 )|W = w] are approximately linear in t(w) (see also part (iii) of Theorem 2). This is the case covered by part (a) of the theorem. Now consider the case where the analyst misspecifies the propensity score model, but Assumption 2 holds, part (b) of Theorem 4 indicates that γ AST will remain consistent for γ 0 in this case if the analyst chooses G (v) to take the logit form. We emphasize that the true propensity score model may or may not be of the logit form.
The peculiar feature of Theorem 4, relative to analogous results in the missing data literature (e.g., Tsiatis 2006), is the requirement that the assumed propensity score takes the logit form. To understand this requirement, note that, in general, (7) will be an inconsistent estimate of the study population distribution of W when the propensity score is misspecified. Calibrating the study and auxiliary tilts to moments of this distribution will therefore typically produce an inconsistent estimate of γ 0 . However, when condition (b) of Theorem 4 holds, we have, from the estimating equations for the propensity score parameter, Now consider the mean of t(W i ) with respect to F eff s (w). Using (18), and the fact that t(W i ) contains a constant so that Therefore, under the conditions of part (b) of Theorem 4, irrespective of whether the propensity score is correctly model. This implies that the study and auxiliary tilts will be correctly calibrated such that, when Assumption 2 holds, γ AST will remain consistent for γ 0 . Note that this estimate of E s [t(W )] will not be efficient when the propensity score is misspecified.
Although the propensity score is not ancillary in the data combination problem, our estimator remains consistent in the presence of propensity score misspecification when G(v) takes the logit form. It is an open question where there exist a locally efficient and doubly robust estimator under nonlogit parametric forms for the propensity score.
The alternative estimator, which replaces maximum likelihood (ML) propensity score fit, computed in the first step of our procedure with the method of moments (MM), one will be double robust but not locally efficient (unless a logit form for G (v) is maintained as part of Assumption 1, in which case the ML and MM fits coincide). More generally there is a tension between efficiency, which requires using the MLE of the propensity score for reweighting, and robustness to propensity score misspecification.
Implications for Practitioners. Collectively, Theorems 2 to 4 suggest several useful guidelines for empirical researchers. First, when overlap is good, or equivalently the propensity score weights p 0 (W )/(1 − p 0 (W )) do not take very large values, Theorems 2 to 4 provide a very strong theoretical case for using AST in practice. If Assumption 2 happens to be true in the sampled populations, then AST will be more efficient than the propensity score reweighting approach of Abadie (2005). This result is analogous to the enhanced efficiency of the augmented inverse probability weighting (AIPW) estimator of Robins, Rotnitzky, and Zhao (1994) relative to conventional inverse probability weighting (IPW) in the missing data context. In practice, high levels of precision will be observed whenever q s (w) and q a (w) are reasonably well approximated by a linear combination of the elements of t (w) . A further advantage of the AST procedure is that, if the propensity score is inadvertently misspecified, AST will nevertheless remain consistent for γ 0 if Assumption 2 holds (and the analyst works with a logit form for G (v)).
In settings with poor overlap, the AST estimator may be highly variable and, in extreme cases, may not even exist. To understand this last observation consider the case where G (v) takes the logit form. In that case, the computation of the auxiliary tilt requires that the study sample mean of t (W ) lies within the convex hull of the auxiliary sample. If the study and auxiliary distributions of W are very different from one another, this convex hull condition may fail in practice even if Assumption 1 holds in the population. We do not view this as a weakness of our procedure, rather such situations alert the researcher to the fragility of identification when overlap is poor (see Khan and Tamer 2010). When overlap is poor, direct imputation approach may be preferable (e.g., Chen, Hong, and Tarozzi 2008;Kline 2011). However, imputation will be very sensitive to violations of Assumption 2; this limitation is illustrated by our Monte Carlo experiments below.
The computational algorithm detailed in the supplemental Web appendix is designed to work well in situations where the convex hull condition is "nearly" violated and we recommend its routine use. For covariance matrix estimation, we recommend use the textbook formulas for the GMM estimator based on the moment vector implied by (8), (9), (11 ), and (13) above and explicitly defined in the Appendix.

APPLICATION AND MONTE CARLO EXPERIMENTS
Empirical Application. Neal and Johnson (1996) studied the role of "premarket" (i.e., acquired prior to age 18) differences in cognitive achievement in explaining differences in earnings between young Black and White men. Using a sample of employed Black and White males drawn from the National Longitudinal Survey of Youth 1979 (NLSY79), Neal and Johnson (1996) computed the least-square fit of the logarithm of hourly wages on a constant, a black dummy, age, and Armed Forces Qualification Test (AFQT) percentile score measured at age 16 to 18. They found that the coefficient on the black dummy variable drops by two-thirds to three quarters when AFQT score is included as a covariate. On the basis of this finding, they argued that differences in the rate of cognitive skill acquisition across Blacks and White prior to age 18, due to differences in family background, school quality, and neighborhood characteristics, explain a substantial portion of subsequent Black-White wage inequality. We do not provide an assessment of this interpretation here, rather our goal is to illustrate the use of AST in a familiar setting.
Let Y denote real average wages from 1990 to 1993 for a random draw from the population of Black men aged 16 to 18 in 1979 and residing in the United States. This population corresponds to our study population of interest. Let X denote real wages for a random draw from the population of White men aged 16 to 18 in 1979 and residing in the United States. This corresponds to our auxiliary population. Let W be a vector including year of birth and AFQT score (we transform the percentile scores used by Neal and Johnson (1996) onto the real line using the inverse standard normal CDF). We compare features of the observed distribution of Black wages with those of a hypothetical White population whose age and AFQT distribution coincides with that of the Blacks (i.e., with study population's). These types of hypothetical comparisons underlie Oaxaca decompositions, as used in labor and health economics, and similar exercises undertaken in demography (e.g., Kitagawa 1964). Barsky et al. (2002) and Fortin, Lemieux and Firpo (2011) surveyed the application of decomposition methods in economics.
Our sample closely resembles that used in Johnson and Neal (1998). It includes 1371 measurements of real wages, race, age, and AFQT scores drawn from the NLSY79. Throughout we replace the empirical measure of our sample with the NLSY79 base year sampling weights (although this adjustment has little effect on our results). The age distributions for Blacks and Whites in the merged sample are, as would be expected, quite similar. The distribution of AFQT scores across the two groups is quite different. The mean Black score is approximately one standard deviation lower than the mean White score. The two distributions also substantially differ in their second, third, and fourth moments (not reported).
Panel A of Table 1 reports estimates of mean log Wages for Blacks (Column 1), as well as the Black-White average difference (Column 2). On average, Blacks earn almost 28% less per hour than Whites in our sample. Panel A also reports estimates of the CDF of the Black wage distribution at selected points, and the corresponding Black-White CDF differences. For example, while over 45% of Blacks earn less than $7.50 per hour in our sample, fewer than 30% of Whites do (Table 1, row 3). Inspection of the CDF differences indicates that while the distributions are most different at the lower wage levels, differences exist across the entire support of wages. Panel B of Table 1 reports average wage differences between Blacks and a hypothetical population of Whites whose distribution of age and AFQT score coincides with the Black distribution. This allows for a comparison between Black and White wages that flexibly controls for differences between the two populations in age and AFQT score.
In Column 1 of Panel B, we report age-and AFQT-adjusted differences in mean wages and wage CDFs based on the conditional expectation projection (CEP) estimator of Chen, Hong, and Tarozzi (2008). Our implementation of their procedure models the conditional expectation functions (CEFs) of Y and X given W as a separable functions of a constant, 2 year of birth dummies, a quadratic polynomial in transformed AFQT score, and 12 dummy variables for the transformed AFQT score lying, respectively, below −2, −1.75, . . . , 0.25, 0.5. Let t(W ) be the vector containing all these functions of W. In principle, if the dimension of the approximating model is allowed to grow with the sample size, the Chen, Hong, and Tarozzi (2008) estimator is consistent for, and efficient under, all data-generating processes satisfying parts (i) to (iv) of Assumption 1. In small samples, the performance of the estimator is heavily dependent on the quality of the two CEF approximations.
Column 2 of Panel B implements the propensity score reweighting (PSR) estimator of  and Abadie (2005). We model the propensity score as a logit function with an index linear in t(W ) as defined above for the CEP estimator. The PSR estimates are very close in magnitude and precision to the CEP estimates.
Column 3 of Panel B implements our AST procedure using the same choice of t(W ) and r(W ) = t(W ). This choice ensures that the study and auxiliary sample tilts share the following features with the efficient distribution function estimate of W: (i) the marginal year of birth distributions coincide, (ii) the means and variances of the transformed AFQT score coincide, (iii) the probability masses assigned to the intervals defined by the −2, −1.75, . . . , 0.25, 0.5 grid of AFQT score intervals coincide. Figure 1 plots undersmoothed kernel density estimates of the actual Black and White AFQT score densities; the two distributions are very different from one another. The figure also plots a density estimate based on the auxiliary sample tilt. This corresponds to the AFQT score density in the hypothetical comparison population of Whites. As is evident from the figure, our choice of t(W ) is rich enough to closely match this density with its target Black one.
After adjusting for age and AFQT differences, we find that while a Black-White residual log wage CDF gap remains at middle parts of the wage distribution, it disappears at the low and high ends of this distribution. The average log wage gaps falls, after adjusting for age and AFQT differences, from −0.279 to −0.111.
While the AST point estimates are similar to the corresponding CEP and PSR ones, their estimated sampling precision is uniformly superior (as Theorem 3 would suggest). The close correspondence between the CEP, PSR, and AST point estimates in our application likely reflects a combination of two factors. First, while the AFQT distributions across Blacks and Whites differ dramatically, the support of the Black distribution is clearly contained within that of the White distribution. Hence part (iii) of Assumption 1 is well satisfied. Second the approximating models underlying each of the estimators are quite flexible. In settings where overlap is weaker, and/or the approximating models more parsimonious (as would be required when the dimension of W is large), we would expect the three estimators to more often yield different point estimates depending on the true data-generating process.
Monte Carlo. We now report on a number of Monte Carlo experiments we conducted to verify the theoretical properties described in Theorems 2 to 4. In particular, we wish to assess the relevance of our theoretical robustness and efficiency results. To do this, we consider a stylized program evaluation setting. The analyst wishes to estimate the average treatment effect on the treated (ATT). In each of our first set of experiments, we assume that W is distributed according to a truncated normal distribution, with support [−c, c], in both the study (treated) and auxiliary (control) populations. The location and scale parameters of these two distributions, respectively, (μ s , σ 2 s ) and (μ a , σ 2 a ), may differ. We assume a multinomial sampling scheme: with probability Q 0 = 1/2 a draw of (Y, W ) is taken at random from the study (treated) population, otherwise a draw of (X, W ) is taken from the auxiliary (control) population. Finally, we assume that Y and X, which play the roles of the outcome under treatment and control, are generated according to where μ W |D=1 and σ 2 W |D=1 are the study population mean and variance of W (which differ from μ s and σ 2 s due to truncation). The target parameter is γ 0 = E s [Y − X] = α 0 . The propensity score induced by these designs is of the logit form with an index quadratic in W: where β 0 , β 1 , and β 2 are functions of (μ s , σ 2 s ) and (μ a , σ 2 a ) (see Anderson 1982). When the study and auxiliary population distributions of W have different means, but a common variance, the logit index will be linear in W. When both the means and variances differ, then the index will generally be nontrivially quadratic in W.
Across all designs, we assume a sample size of N = 1000 and set μ s = 0, σ 2 S = 1, μ a = −1/2, α 0 = 0, α 1 = 1/2, σ 2 X = 1, and c = 3. We vary σ 2 A and α 2 across designs to, respectively, induce nonlinearity in the (index of) the propensity score and E [ ψ a (X, W, γ 0 )| W ] = q a (W ). We vary σ 2 Y across designs to keep the variance bound fixed. Across each of our designs an efficient estimator (under Assumption 1) will have an asymptotic standard error of I (γ 0 ) −1 /1000 = 1/10. Table 2 gives the parameter configurations for each of four Monte Carlo designs. In the first design both the propensity score, p 0 (w) and q a (w) are "linear" in w (for p 0 (w) "linear" means linear in the logit index). In the second design, the propensity score is quadratic in w, while q a (w) remains linear. In design 3 the reverse is true, while in design 4 both objects are "quadratic." Across each design we implement the AST estimator with G (·) being the logit function and r (W ) = t (W ) = (1, W ) . For the conditional expectation projection (CEP) estimator, we proceed "as if" E [ X| W ] were linear in W, while our implementation of propensity score reweighting (PSR) uses a logit propensity score with a linear index.
Our AST estimator is consistent for γ 0 in designs 1 through 3. CEP is consistent in designs 1 and 2, but inconsistent in design 3. The PSR estimator is consistent in designs 1 and 3, but inconsistent in design 2. All estimators are inconsistent in design 4 due to the nonlinearity of both p 0 (w) and q a (w). Table 3 reports the results of our experiments. Column 1 lists a "pencil and paper" asymptotic bias calculation, while Column 2  gives the median bias across 5000 Monte Carlo replications (in both cases bias is scaled by the "pencil and paper" asymptotic standard error reported in Column 3). As predicted, AST is median unbiased (up to simulation error) in designs 1 through 3. In contrast, PSR is severely biased in design 2 and CEP in design 3. As expected, all estimators perform poorly in design 4. These bias properties are reflected in the coverage of standard, Wald-based, 95% confidence intervals for γ 0 (Column 6). By comparing Columns 1 and 2 and Columns 3 and 5, we see thatfor the designs considered here-the finite sample distributions of all of the estimators are very well approximated by their asymptotic counterparts.

SUMMARY
When the propensity score is parametrically specified, information in both the study and auxiliary samples may be used to form an efficient estimate of W, the variable common to both datasets. An intuition for this insight follows from recognizing that, under part (v) of Assumption 1, the auxiliary sample is equivalent to a biased sample from the study population with the biasing function known up to a finite-dimensional parameter. Using this efficient distribution function estimate, we tilt the propensity score reweighting (study population) distribution function estimates of (Y, W ) and (X, W ) so that they share certain moments in common. By choosing these moments carefully (i.e., with reference to Assumption 2), we can produce a locally efficient estimate of γ 0 . Even if the parametric relationship between the study and auxiliary populations, as embodied in the propensity score model, is misspecified, AST remains consistent for γ 0 if Assumption 2 holds.
To our knowledge, we are the first to propose a locally efficient estimator for the class of data combination problems defined by Assumption 1. Our procedure also has a double robustness property. Our results provide a useful complement to the work of Robins, Rotnitzky, and Zhao (1994), Tan (2006), and others for missing data problems. Relative to Chen, Hong, and Tarozzi (2008), who did provide explicit results for data combination problems (their so-called "verify-out-of-sample" case), our approach may be useful when W is high dimensional such that their method, which requires nonparametric estimation of q s (w) and q a (w), is impractical.
In future work, it would be useful to study data-dependent methods for choosing t (W ). Similarly, it would be interesting to construct a locally efficient estimator with minimal variance across all estimators based on the linear approximating models q s (W ) s t (W ) and q a (W ) a t (W ) . In the missing data context, such estimators are called "improved locally efficient" (e.g., Tan 2010).
Let M = E[∂m(Z, θ 0 )/∂θ ]; a standard argument (e.g., Newey and McFadden 1994) gives, under regularity conditions, the asymptotically linear representation The influence function for γ AST corresponds to the last K elements of (A.1). By tedious, but straightforward, calculation, we can show that this subvector equals Evaluating M 21 yields, after some manipulation, where p 0 (W ) = G r (W ) δ 0 = G r (W ) δ 0 + t (W ) λ a0 . These results imply that

Similar calculations give
Evaluating M 22 and M 42 yields G 1 r (W ) δ 0 ψ a (X, W, γ 0 ) t (W ) . (A.7) Using (A.6) and (A.7) and iterated expectations, we get as defined in (14) of the main text. Now consider M 33 and M 43 ; we have Using (A.8) and (A.9) and iterated expectations, we get also as defined in (14) of the main text.
Recalling the definitions, q * a (W ) = * a t (W ) and q * s (W ) = * s t (W ) , substituting the expressions derived immediately above into (A.2), and rearranging yield the form of the influence function stated in the theorem. Now recall the definitions of R s (D, W ) and R a (D, W ) given in (15) and (16)  Let a be a vector of constants. By linearity of the LP operator, the Cauchy-Schwartz inequality, and recalling that R (D, W ) = R s (D, W ) − R a (D, W ), we have that This bound will hold with equality if r (W ) ⊂ t (W ) since, by the definitions of * s and * a , we will have (in that case) the zero covariance results