Ensemble and calibration multiply robust estimation for quantile treatment effect

ABSTRACT Quantile treatment effects can be important causal estimands in the evaluation of biomedical treatments or interventions for health outcomes such as birthweight and medical cost. However, the existing estimators require either a propensity score model or a conditional density vector model is correctly specified, which is difficult to verify in practice. In this paper, we allow multiple models for propensity score and conditional density vector, then construct a class of calibration estimators based on multiple imputation and inverse probability weighting approaches via empirical likelihood. The resulting estimators multiply robust in the sense that they are consistent if any one of these models is correctly specified. Moreover, we propose another class of ensemble estimators to reduce computational burden while ensuring multiple robustness. Simulations are performed to evaluate the finite sample performance of the proposed estimators. Two applications to the birthweight of infants born in the United States and AIDS Clinical Trials Group Protocol 175 data are also presented.


Introduction
Causal evaluation of treatment or intervention is commonly done by estimating the average treatment effect (ATE). However, in some studies about the low weight in children growth, medical cost and so on, outcomes tend to be highly skewed such that the ATE may not be a proper representative parameter. Our study is motivated by the birthweight data based on June 1997 Detailed Natality Data published by the National Center for Health Statistics. Low birthweight can cause a series of subsequent health problems, long duration of health care and other economic problems according to previous studies. We want to investigate the impact of maternal behavior on the birthweight of babies born in the United States under the conditions of various demographic characteristics. In specific, similar to Xie et al. [31], we consider the association between the birthweight and smoking status of the mother. It can be checked that the distribution of birthweights is skewed such that the ATE may not be appropriate. In this case, it is often important to learn about distributional impacts beyond the ATE, such as the effects on upper (or lower) quantiles of an outcome, while quantile treatment effect (QTE) may be more relevant and informative [2,4,10,11,[14][15][16]26]. For the birthweight data, the QTE is defined as the quantile difference of birthweight distributions between the treatment (mother smokes) and control (mother does not smoke) groups. Another example is based on the AIDS Clinical Trials Group Protocol 175 (ACTG 175 [17]) data. Our purpose is to examine whether the new regimens work or not. Since the outcome distributions are skewed, it is more reasonable to investigate the QTEs between the three new regimens and control arm, respectively.
To define the QTE, we begin with some notations. Let T be a binary treatment indicator, X be a p-dimensional pretreatment covariate, Y 1 and Y 0 be the potential outcomes under treatment T = 1 and T = 0, respectively. For each unit, either Y 1 or Y 0 is observed, but not both, i.e. we only observe Y = TY 1 + (1 − T)Y 0 . Given τ ∈ (0, 1), the τ th QTE is defined as The problem of interest is to estimate q τ based on n independent and identically distributed samples {(X i , Y i , T i ), i = 1, . . . , n} from (X, Y, T). For simplicity, we omit τ from the rest of this paper, but we should remember that they are τ -specific. In order to identify the QTE, we assume that the treatment assignment is independent of the potential outcomes when conditioning on observed covariates, i.e. T ⊥ ⊥ (Y 1 , Y 0 ) | X, which is usually called ignorability of treatment or unconfoundedness assumption. Under this assumption, the propensity score (PS) is defined as π(X) = Pr(T = 1 | X), i.e. the probability of treatment assignment conditional on covariates.
There exist several methods to estimate the QTE, such as the inverse propensity weighting (IPW), instrumental variable approach, matching, imputation, calibration and so on [1,14,18,23,24,28]. One of the most commonly used is the IPW estimatorq ipw =q 1 ipw − q 0 ipw , whereq 1 ipw andq 0 ipw are the solutions to ψ τ (u) = τ − I(u ≤ 0) is the check function, π(X, β) is a parametric model for π(X) with unknown parameter β andβ is a consistent estimator of β [8]. On the other hand, imputation [29] is also a widely adopted approach. For example, Chen and Yu [9] and many others considered the imputation method for quantile regression estimation. Write f (X) = {f 1 (X), f 0 (X)} with f t (X) = f (Y t | X), t = 0, 1. Gaussian or more complex parametric conditional density vector f (X, γ ) = {f 1 (X, γ 1 ), f 0 (X, γ 0 )} with unknown parameters γ = {γ 1 , γ 0 } can be assumed for predicting unobservable potential outcomes. We define the multiple imputation (MI) estimator asq mi =q 1 mi −q 0 mi , whereq 1 mi andq 0 mi are the solutions to respectively, whereγ t is a consistent estimator of γ t , {Ỹ l i (γ t )} L l=1 are independently imputed values given X i from the estimated conditional distribution f t (X i ,γ t ) and L is the number of multiple imputations. To improve the estimation efficiency of the IPW and MI estimators, augmented inverse probability weighting (AIPW [28]) estimator is defined asq aipw =q 1 aipw −q 0 aipw , whereq 1 aipw andq 0 aipw are the solutions to It can be verified that the IPW estimatorq ipw is consistent only if π(X) is correctly modeled and the MI estimatorq mi is consistent only if f (X) is correctly modeled. However, if the candidate model is wrongly selected, the resulting IPW or MI estimator will be biased. While the consistency of the AIPW estimator is provided as long as either π(X) or f (X) is correctly modeled, which is also called the double robustness (DR) property [3]. In conclusion, the existing methods require the specification of either propensity score π(X) or conditional density vector f (X).

Related literature
In practice, however, the data structure may be complicated, which makes it difficult to impose a correct candidate model. Beyond the DR, Han and Wang [22] proposed to maximize an empirical likelihood (EL) function subject to multiple moment calibration equations and introduced the concept of multiple robustness (MR) in missing data problem, which can be viewed as an extension of the DR (see also [5,6,19,20] and so on) Alternatively, Xie et al. [31] proposed multiply robust estimation of the QTE based on the multiple propensity score models and multiple conditional distribution functions (CDFs). However, Xie et al. [31] placed some restrictions on the CDFs and estimated the unknown parameters in the CDFs based on quantile-based equations. Alternatively, different from specifying the conditional distribution functions by Xie et al. [31], Tang and Qin [30] and Han et al. [21] found that it is more easy and convenient to specify the conditional density models, and the unknown parameters can be easily estimated through a complete-case maximum likelihood estimator (MLE). Moreover, when the number of models increases, the computational cost in the EL optimization procedure will increase, the estimation efficiency and stability will decrease. If the number of models is large, the EL may not be properly defined because of the so-called empty set problem.
To address these problems, in this paper, we propose two classes of efficient MR estimators based on the multiple imputation and inverse probability weighting approaches. In specific, (1) we first propose a class of calibration estimators based on multiple propensity score models and multiple conditional density vector models. Motivated by Han et al. [21], the weights are obtained through the empirical likelihood (EL) approach by matching population moments of model-based fitted values from the treated and untreated subsamples to the full sample. Under the ignorability of treatment assumption, we prove that the proposed estimators have multiple protection on estimation consistency when any one of the candidate models is correctly specified. Furthermore, the asymptotic normality is established under some mild conditions. (2) Recognizing the computational burden induced by a large number of models in the EL procedure, we extend the ensemble approach in [12] to the QTE estimation and propose another class of MR estimators by refitting working models using the least-square (LS) approach based on the complete subsamples in treated and untreated groups. Through such refitting, each set of fitted working models is reduced to a scalar value. (3) Simulation results show that the proposed two classes of estimators provide extra protection against model misspecification, while estimation efficiency is not compromised.
This article is organized as follows. The proposed calibration MR estimators and asymptotic theories are studied in Section 2. The ensemble MR estimators are discussed in Section 3. Sections 4 and 5 present the numerical studies and two real data applications, respectively. Some relevant discussions are given in Section 6. The Supplementary Material contains all proofs and additional simulation results.

Multiply robust QTE estimation
In general, true functional forms of propensity score π(X) and conditional densities of Y 1 , Y 0 given X are all unknown. Let be two collections of candidate models for π(X) and f (X), respectively, with unknown parameters β m and γ k = {γ 1 k , γ 0 k }. Since covariate X i is always observed, a consistent MLE of β m , denoted asβ m , can be obtained by maximizing the binomial likelihood function: Ifβ m → β m * as n → ∞, then π m (X, β m * ) = π(X) only if π m (X, β m ) is just the correct model for π(X). Write S t = {i : T i = t, i = 1, . . . , n} as the set of observations in the control (t = 0) and treatment (t = 1) groups, respectively, with sample sizes n 0 = n i=1 (1 − T i ) and n 1 = n i=1 T i . Similarly, the MLE of γ t k , denoted asγ t k , can be obtained, i.e.
by fitting the conditional density model based on subsamples is just the correct model for f t (X). Onceγ k is obtained, as in (2) -(3), we can obtain the MI estimators of q 1 and q 0 based on the kth working model f k (X, γ k ), denoted asq 1 k andq 0 k , respectively. The MI estimators are consistent only if f k (X i , γ k ) is correctly modeled.

Remark 2.1:
When the dimension of X is not low, the ordinary MLE may not work well. To solve this problem, we assume that only a few of these available covariates are actually related to the parametric propensity and conditional density models, respectively, i.e. the parameters are sparse, and then some regularization penalties, such as Lasso, SCAD and MCP [13], can be applied in the maximum likelihood estimation to efficiently select significant variables and estimate parameters simultaneously.

Remark 2.2:
If the outcome variable Y is not normally distributed, we can try to fit the data by the commonly used Gamma, Beta, Chi-square distributions or exponential family of distributions. In addition, Box-Cox transformations can be applied to the outcome such that the transformed outcome is normally distributed.

Calibration MR estimator based on empirical likelihood
The proposed calibration MR estimators for the QTE can be expressed as the following steps.
and L is the number of multiple imputations. Calculateq t k , t = 0, 1 and k = 1, 2, . . . , K, as an imputation estimator of q t by solving respectively. (S3) Obtain the weights w i 's, i ∈ S t , for each group by maximizing i∈S t w i under the following calibration constraints, (S4) The proposed estimator is defined asq mi =q 1 mi −q 0 mi , whereq t mi is the solution to In (S3), the first constraint is imposed for regularization, the second and third calibration constraints balance the weighted average of each candidate model evaluated on the observed samples with the corresponding unweighted sample average. This coincides with calibration method in [6] based on the fact that E[(T/π(X) − 1)f (x)] = 0 with f (X) being π m (X) and L l=1 ψ τ (Ỹ l i (γ t k ) −q t k )/L, respectively. Since w i is a discrete probability measure, the estimation for w i can be naturally considered as a typical form of an EL problem. In specific, let For fixed (β,γ t ,q t ), using the Lagrange multiplier method in [27], it can be seen that whereρ t = (ρ t1 ,ρ t2 , . . . ,ρ tJ ) T is a J = M + K dimensional vector satisfying the equations Because of the nonnegativity ofŵ i , we need 1 +ρ T t g i (X i ,β,q t ,γ t ) > 0 for all i. The modified Newton-Raphson algorithm discussed in [7] can be applied.

Remark 2.3:
The proposed calibration MR estimators, which include all the information of candidate models, have the same structure as the IPW estimator, with T i /π(X i ) replaced by the weightsŵ i∈S 1 and (1 − T i )/(1 −π(X i )) replaced by the weightsŵ i∈S 0 , respectively.

Multiple robustness and asymptotic normality
Under some regularity conditions [21], Theorem 2.1 shows thatq mr is multiply robust, in the sense ofq mr is a consistent estimator as long as one working model is correctly specified, either for π(X) or f (X). This is a significant improvement over the IPW, MI and AIPW estimators. Its proof is given in the Supplementary Material.
(C1) Parameter spaces Q 1 for q 1 and Q 0 for q 0 are compact, true values q 1 0 and q 0 0 are in the interior of Q 1 and Q 0 , respectively.
0 and q 0 0 are the unique τ th quantiles in treatment and control groups, respectively. (C4) E X 4 < ∞.
(C5) π m (X, β m ) has bounded derivatives in X up to the second order, is continuously differentiable in β m and inf X inf β m π m (X, β m ) > 0.
, when W contains a correctly specified model for π(X) or F contains a correctly specified vector model for f (X),q mr is a consistent estimator of q 0 as n → ∞, where q 0 is the true value.
(C6) Without loss of generality, assume that π 1 (X, β 1 ) is the correct model for π(x), β 1,0 is the true value of β 1 and where 1 is the corresponding score function.
where β m * , q t * and γ t * the probability limits ofβ m ,q t andγ t , respectively. The matrices Furthermore, the asymptotic normality ofq mr can be established in the following theorem.
√ n(q mr − q 0 ) has an asymptotic normal distribution with mean 0 and variance var (Z), where

Remark 2.4:
A closed-form expression for the asymptotic variance is difficult to derive, because it depends on not only the number of models but also the particular functional form of those models. As in [18,22,31] and many others, we propose to use the bootstrap approach to provide a reasonable standard error (SE), i.e. the square roots of bootstrap variance estimators based on 200 bootstrap replications, to avoid tedious formula-based implementation.

Compression by refitting
In theory, the number of models postulated in W and F has no effect on the multiple robustness, as long as M and K are fixed. However, when M and K are large, the computational cost of obtainingŵ i will increase. Moreover, the convex hulls of g i (X i ,β,q t ,γ t ) may not contain the vector 0 such that the set forŵ i are empty. To solve this issue, we apply the following ensemble approach [12] to reduce the computational burden. Denotê The key step for defining our ensemble estimators is to perform twice least-squares model refitting to summarize the respective working model's information. Specifically, by regressing T i onÛ πi for i = 1, . . . , n, we havê and define a weighted value asπ In order to increase the weights of important models and reduce the weights of unimportant ones, we square the weights instead of usingα π directly, which makes the candidate models more distinguishable in their final contributions. Since the final weighted valueπ i is refitted based on all working models, the consistency ofπ i guarantees if the correct model is contained in W. On the other hand, by regressing Y i onÛ l ti based on the complete subsamples, we havê Define the corresponding weighted values as The consistency ofỹ l 1i andỹ l 0i is guaranteed if the correct models for conditional densities are contained in F. Definem t i = L l=1 ψ τ (ỹ l ti −q t )/L, whereq 1 andq 0 are imputation estimators calculated by solving the following equation: and then defineθ

Ensemble MR estimator
The ensemble MR estimators are defined aŝ and the weights w es i 's are obtained by maximizing i∈S t w i under the following constraints: Using the Lagrange multipliers method, whereρ es t satisfies two-dimensional estimating equations Under the same regularity conditions in Theorem 2.1, the consistency of ensemble estimatorq es still holds.

Simulation studies
The simulation setting is the combination of [25,31] with some modifications. In specific, covariate X i is generated from a 8-dimensional normal distribution with mean zero and identity covariance matrix. Outcome models for Y t have two choices: where t 's are independently from N(0, 1) for t = 0, 1. We generate T from a Bernoulli distribution with probability π(X) and consider two choices for π(X): The true parameters are set as It can be verified that the true value of QTE is 2.5 in these cases.
We evaluate the finite-sample performance of the following estimators: (a) the naive (1); (c) the MI estimatorq mi defined by (2) and (3); (d) the AIPW estimatorq aipw defined by (4) and (5); (e) the MR estimatorsq mr defined by (6); (f) the ensemble MR estimatorsq es defined by (7). The MR estimators of [31] are not included since it is hard to specify the conditional distribution functions in these cases. To implement our proposed methods, we postulate two models for π(X) as follows: and two functions for f (Y t | X) as follows: Here, T are unknown parameter vectors. Following the notations of [22], a four-digit subscript is used to distinguish estimators constructed using different candidate models. Each digit indicates the use of π 1 (X, β 1 ), π 2 (X, β 2 ), f 1 (X, γ 1 ) and f 2 (X, γ 2 ) in order. For example,q ipw(1000) denotes the IPW estimator constructed using π 1 (X, β 1 ),q mr(1101) denotes the MR estimator constructed using π 1 (X, β 1 ), π 2 (X, β 2 ) and f 2 (X, γ 2 ). To investigate the effect of different sample sizes n and different quantile levels τ , six scenarios are considered, i.e. Turning to the calibration MR and ensemble MR estimators, the multiple robustness is well demonstrated by inspecting their biases, which are consistent under these four data generating scenarios according to our theory. Neither the AIPW estimator nor any existing doubly robust estimator can achieve such robustness. (iii) To assess the efficiency of our proposed estimators, we use the MSEs ofq aipw(opt) as the benchmark, whereq aipw(opt) denotes the corresponding AIPW estimator based on two correctly specified models; that is,q aipw(opt) =q aipw(1011) under the first scenario (both π 1 (X) and f 1 (X, γ 1 ) are true),q aipw(opt) =q aipw(1001) under the second scenario (both π 1 (X) and f 2 (X, γ 2 ) are true),q aipw(opt) =q aipw(0110) under the third scenario (both π 2 (X) and f 1 (X, γ 1 ) are true),q aipw(opt) =q aipw(0101) under the fourth scenario (both π 2 (X) and f 2 (X, γ 2 ) are true  based on n = 500 with 100 replications. The results are shown in Table S2 in the Supplementary Material. As expected, it can be seen the ensemble MR estimators reduce computational burden efficiently with less time than the ordinary MR estimators. (vi) With different sample sizes n and τ , the above conclusions can also be established.
To see the robustness of our proposed methods, the simulation results based on non-normal outcome variables are given in Table S1 in the Supplementary Material, and we

Birthweight data
We apply the proposed methods to the birthweight data discussed in Section 1. There are a total of 50,000 observations. The outcome response is birthweight of infants recorded in grams with five covariates: mother's marriage status (X 1 ), mother's race (X 2 ), gender of the infant (X 3 ), mother's age (X 4 ) and mother's education level (X 5 ). The QTE is defined as the quantile difference of birthweight distributions between the treatment (mother smokes; T = 1) and control (mother does not smoke; T = 0) groups. For the propensity score, we consider two candidate models, (1) π 1 (X, β 1 ) = 1/{1 + exp(β 10 + β 11 X 1 + · · · + β 15 X 5 )}, (2) π 2 (X, β 2 ) = 1/{1 + exp(β 20 + β 21 exp(X 1 /2) + · · · + β 25 exp(X 5 /2))}. For conditional density, we also consider two functions,    Table 8.  It can be shown that (i) the proposed MR estimates and ensemble MR estimates are close, and smaller than naive estimates in all cases. The estimatesq mi(0001) have significantly different values from other estimates and their SEs are much larger than those ofq mi(0010) , which indicates model (A) is more reasonable and the response may follow a normal distribution; (ii) we look into the estimates using model (A) as a candidate model and find that their values are all close; (iii) the negative values of the QTE estimates indicate that the 25th, 50th and 75th percentiles of the infants' birthweights under T = 1 (mother smokes) are smaller than those under T = 0 (mother does not smoke); (iv) the SEs ofq aipw(0110) are larger than these ofq aipw (1010) , which indicates that model (1) may be more suitable for the treatment assignment; (v) compared to the IPW, MI and AIPW estimates, our proposed estimates have smaller SEs, which shows that our estimates are more efficient due to the inclusion of more propensity score models and conditional density models.

ACTG 175 data
We further consider the ACTG 175 data to investigate the treatment effects of different antiretroviral regimens. In specific, 1340 HIV positive patients were divided into 4 arms, i.e. 321 subjects with zidovudine or ZDV, 333 subjects with didanosine or ddi, 336 subjects with ZDV and ddi, 350 subjects with ZDV and zalcitabine. The outcome response Y is CD4 count at 96 ± 5 weeks with six continuous covariates: age (X 1 ), weight (X 2 ), CD4 cell counts at baseline and 20 ± 5 weeks (X 3 and X 4 ), and CD8 cell counts at baseline and 20 ± 5 weeks (X 5 and X 6 ). In this study, the traditional regimen, i.e. ZDC, is the control arm, and the regimens ddi, ZDV and ddi, ZDV and zalcitabine are three new treatments. To see whether the new three regimens work or not, we compute the quantile differences of the CD4 count at 96 ± 5 weeks between the three new treatments and control arm, respectively, which are denoted as 1 , 2 and 3 . It is reasonable to assume that, given the six baseline covariates, the treatment assignment does not depend on the CD4 count at 96 ± 5 weeks. Thus, the unconfoundedness assumption holds.
It can be shown that (i) the proposed MR estimates and ensemble MR estimates are close; (ii) the estimatesq mi(0001) are different from other estimates, which indicates a normal distribution with truncation is more reasonable; (iii) the positive values of the QTE estimates indicate that the 25th, 50th and 75th percentiles of the new regimens have a positive influence than the traditional ZDV; (iv) compared with other estimates, our proposed estimates have slightly smaller SEs by including the multiple models.

Discussion
In this paper, we consider the QTE estimation and propose two classes of MR estimators. The first class of MR estimators allows more than one candidate model for both propensity score and conditional density of potential outcomes. The second class of MR estimators is based on the ensemble approach to reduce computational burden efficiently while ensuring multiple robustness. Both of these two classes of estimators are based on the calibration approaches that have the same structure as the IPW estimator and the weights are calculated by the constrained EL method. Throughout this paper, we assume that the unconfoundedness condition holds. However, in observational studies, there often exist unmeasured confounding variables, which may cause some problems for evaluating the QTE, i.e. the estimates may have large biases or population parameters are not identifiable. In addition, the working models considered are all parametric, and the robustness could be further improved by using nonparametric working models.