High-dimensional inference for the average treatment effect under model misspecification using penalized bias-reduced double-robust estimation

The presence of confounding by high-dimensional variables complicates the estimation of the average effect of a point treatment. On the one hand, it necessitates the use of variable selection strategies or more general high-dimensional statistical methods. On the other hand, the use of such techniques tends to result in biased estimators with a non-standard asymptotic behavior. Double-robust estimators are useful for offering a resolution because they possess a so-called small bias property. This property has been exploited to achieve valid (uniform) inference of the average causal effect when data-adaptive estimators of the propensity score and conditional outcome mean both converge to their respective truths at sufficiently fast rate. In this article, we extend this work in order to retain valid (uniform) inference when one of these estimators does not converge to the truth, regardless of which. This is done by generalizing prior work for low-dimensional settings by [Vermeulen K, Vansteelandt S. Bias-reduced doubly robust estimation. Am Stat Assoc. 2015;110(511):1024–1036.] to incorporate regularization. The proposed penalized bias-reduced double-robust estimation strategy exhibits promising performance in simulation studies and a data analysis, relative to competing proposals.


Introduction
The effects of treatments, policies, or interventions are commonly characterized in terms of contrasts between the mean of counterfactual outcomes corresponding to different treatment or exposure levels. For instance, for a dichotomous treatment A (coded 0 for no treatment and 1 for treatment), the average treatment effect (ATE) is defined as E{Y(1)} − E{Y(0)}, where Y(a) denotes the counterfactual outcome of a random individual if exposed to treatment a = 0, 1. Estimation of such effect from observational data generally requires adjustment for a set of covariates that are sufficient to adjust for confounding of the effect of treatment on outcome. This is a difficult task when the number of covariates is large or when one or multiple continuous covariates can have non-linear effects on exposure or outcome. It is therefore common to start from flexible models and adopt variable selection or more general regularization techniques to handle the high dimensionality of the models.
The use of hypothesis test-based covariate selection approaches or Lasso penalization requires consideration in itself, however. Regularization techniques tend to return biased estimators (e.g. for the dependence of treatment or outcome on covariates). Estimators of the ATE based on these may inherit this bias. Nuisance parameter estimators obtained via regularization techniques also typically have a non-normal asymptotic distribution [1,2]. This may render the distribution of ATE estimators based on these rather complicated. Both these concerns make asymptotically unbiased estimators for the ATE with accompanying uniformly valid confidence intervals difficult to attain.
So-called double-robust (DR) estimators of the ATE ( [3], see [4] for a review) are not susceptible to the above problems, under certain conditions that we will specify next. DR estimators of the ATE make use of two working models: one model A for the dependence of exposure on covariates, and one model B for the dependence of outcome on covariates. They have the attractive property of being consistent for the ATE when either one of these working models is correctly specified, but not necessarily both. Moreover, when both nuisance working models A and B are correctly specified and estimated at faster than n −1/4 rate (in a sense to be made precise later), then DR estimators of the ATE are approximately orthogonal (w.r.t. the covariance inner product) to the scores for the infinite-dimensional nuisance parameters that index the observed data distribution (i.e. the probability of treatment given covariates, and the outcome distribution given covariates and fixed treatment levels). This in turn implies that estimation (and in particular, regularization) of these nuisance parameters can be ignored and, hence, that the resulting DR estimator is asymptotically unbiased with standard, easy-to-calculate confidence interval that is uniformly valid [5][6][7][8]. This surprising result applies to any (sufficiently fast converging) method for estimating nuisance parameters; in particular, it forms the cornerstone of the Targeted Maximum Likelihood method [9].
While promising, a limitation of the above results is that it assumes both nuisance working models A and B to be correctly specified (or more generally, both nuisance parameter estimators to converge to their respective truths); some exceptions are [10,11], who proposed a DR estimator for the ATE in high-dimensional settings, where only model B is assumed to be correctly specified. This is unlikely to be satisfied. Indeed, note that current practice is often based on simple parametric working models. Moreover, the data analyst is essentially always forced to constrain the model's flexibility in order to ensure nuisance parameter estimators that are sufficiently fast converging. In view of this, in this article, we will generalize the above results to allow for misspecification of either (but not both) nuisance working models A and B. In particular, we will show that the use of special nuisance parameter estimators will yield a DR estimator which is asymptotically unbiased when at least one of the working models is correctly specified, and will moreover yield an accompanying Wald confidence interval that is easy to calculate and uniformly valid for the estimator's probability limit. In contrast to [5,12], who also achieve valid inference when one of the nuisance parameter estimators does not converge to the truth, our proposal delivers standard errors that are consistent even when both nuisance parameter estimators fail to converge to the truth. We will achieve this goal by extending the bias-reduced DR estimation principle of [13] to incorporate regularization in a way that is inspired by penalized estimation equations [14]. In particular, we will consider 1 or Lasso norm penalization [15,16] in order to prevent slow convergence, and therefore potentially severely biased estimators, which may otherwise result when the working models include many (unimportant) covariates.
The rest of the article is organized as follows. In Section 2, we describe our proposed penalized bias-reduced DR estimator and evaluate its asymptotic properties. In Section 3, we numerically evaluate the performance of the proposed estimators in comparison with other DR estimators through simulation studies. We illustrate the proposed estimators in an application on the effect of life expectancy on economic growth in Section 4 and conclude with suggestions for future work in Section 5.

Background
Consider a study design which intends to collect i.i.d. data on an outcome Y i , a treatment A i (coded 0 or 1) and a p-dimensional vector of covariates X i for subjects i = 1, . . . , n. Our focus will be on the estimation of the counterfactual mean μ 1 ≡ E{Y(1)} under the nonparametric model M for the observed data (Y, A, X), which is defined by the assumption that X is sufficient to control for confounding of the exposure effect, in the sense that Y(1) ⊥ ⊥ A|X, and the so-called consistency assumption that the conditional laws of Y and Y(1), given A = 1 and X, are identical. Throughout, we will also make the positivity assumption that P(A = 1|X) ∈ [ρ, 1 − ρ] for some ρ > 0 with probability 1. Note that E{Y(1)} is one component of the ATE; estimation of E{Y(0)} proceeds analogously upon changing the treatment coding.
Unless X is limited to few (e.g. one or two) discrete covariates, some form of dimension reduction is typically needed in order to obtain a well-behaved estimator of the marginal treatment effect in small to moderate sample sizes (Robins and Ritov, 1997). For instance, in routine practice, it is common to adjust for confounding under a low-dimensional model for the dependence of X on the outcome. In particular, in this article we will proceed under the assumption that the expected outcome in the exposed obeys a parametric (working) model B, which postulates that E(Y|X) = m(X; β * ) where m(X; β) is a known function, smooth in β, and β * is unknown, e.g. m(X; β) = β 0 + β 1 X + β 2 X 2 with β ≡ (β 0 , β 1 , β 2 ) . Given a consistent estimatorβ of β * , μ 1 can then be estimated as In high-dimensional settings where the number of covariates p is large relative to the sample size n (i.e. p is allowed to grow with n), regularization procedures (e.g. stepwise variable selection, Lasso or more general penalization procedures, among others) cannot usually be avoided for estimating the conditional outcome mean. These procedures typically return estimators with bias that shrinks slowly with sample size. The estimatorμ may inherit this bias [17] and, moreover, follow a non-standard asymptotic distribution as a result, making uniformly valid confidence intervals for μ 1 difficult to attain (see Section 2.3 for detail). DR estimators of μ 1 form an exception [5,7,18]. In particular, let A be a parametric working model P(A = 1|X) = π(X; γ * ) for the probability of being exposed, where π(X; γ ) is a known function, smooth in γ , and γ * is unknown, e.g. π(X; γ ) = 1/{1 + exp(−γ 0 − γ 1 X)} with γ ≡ (γ 0 , γ 1 ) . Consider now the estimator where m(X) ≡ E(Y|A = 1, X) and π(X) ≡ P(A = 1|X), andm(X) = m(X;β) and π(X) = π(X;γ ) are estimators of m(X) under model B and π(X) under model A, respectively. This estimator is double-robust in the sense that it converges to μ 1 when either m(X) converges to E(Y|A = 1, X) orπ(X) converges to P(A = 1|X), but not necessarily both. It follows from [7] thatμ has the same asymptotic distribution (in a uniform sense) as n −1 n i=1 U i (m, π), regardless of the choice of estimatorsm(X) andπ(X), provided that both are consistent and converging sufficiently fast. Uniformly valid, normal confidence intervals for μ 1 are therefore straightforwardly obtained based on a standard error which can be consistently estimated as 1 over n times the sample variance of U(m, π), evaluated at m(X) = m(X;β) and π(X) = π(X;γ ) [7].
Consistent estimation and fast convergence of both m(X) and π(X) can be difficult to achieve in high-dimensional settings (where p may even grow with n). Indeed, the sparsity in the data necessitates one to make simplifying assumptions, such as the parametric model restrictions A or B, in order to obtain fast enough converging estimators. Such restrictions are unlikely to be entirely correct. In this paper, we therefore aim to obtain uniformly valid standard errors, even under misspecification. We will first explain the procedure and then demonstrate its asymptotic properties in the next section.

Proposal
As in [7,18], we will develop inference for μ 1 under parametric working models with high-dimensional covariates (where p may potentially exceed n). Our proposal is then to estimate μ 1 asμ = 1 n n i=1 U i (η) for a nuisance parameter estimatorη = (γ ,β ) obtained by solving the following penalized estimating equations using the Lasso penalty [16]: where λ γ > 0 and λ β > 0 are the associated penalty parameters. The function d(g) is the sub-gradient of the Lasso or 1 norm ||g|| 1 with respect to g [15], where the 1 norm of a vector a ∈ R p is defined as ||a|| 1 ≡ p i=1 |a i |. Following [16], we can define d(g) = lim δ→1+ (δ|g| δ−1 • sign(g)). Therefore, the penalty term d(g) has jth component sign(g j ) if g j = 0 and belongs to [−1, 1] otherwise. Here, for vectors a ∈ R p and b ∈ R p , c = a • b ∈ R p refers to the so-called elementwise (or Hadamard) product, where c = (c 1 , . . . , c p ) with c i = a i b i for i = 1, . . . , p. Further, sign(a) for a vector a ∈ R p is defined as a vector of elements sign(a j ), for j = 1, . . . , p. Finally, we assume that the nuisance parameters β and γ have the same dimensions. Throughout, for pedagogic purposes, we will specialize our proposal to working models of the form In that case, the proposal amounts to first solving the set of penalized estimating equations: to estimate γ . In that case, we recommend solving this equation by minimizing the function [13,19]: This results in an estimatorγ of γ . We next solve the set of penalized estimating equations: It can be best done by minimizing the function: which is possible by standard software for (weighted) 1 norm penalized linear regression. This results in an estimatorβ of β. The above proposal generalizes the bias-reduced DR estimation procedure of [13] to incorporate penalization. In low-dimensional settings with λ γ = λ β = 0, it delivers consistent nuisance parameter estimators under correct model specification. In the next section, we will demonstrate that the above proposal enables uniformly valid inference in high-dimensional settings where either model A or B -but not both -is misspecified.
The performance of our proposal can generally be improved by incorporating a doubleselection technique, previously analyzed by several authors [see 7,8,20, among others]. This approach is discussed in detail in Section 3.1.

Asymptotic properties
As in [7,18], we will study convergence under an arbitrary sequence {P n } of observed data laws that obey, at each n, the positivity assumption. This implies that the parameters η and μ 1 , as well as the models M, A and B should ideally be indexed by n, although we will suppress this notation where it does not raise confusion.
We will furthermore consider settings where the working models A and B may be misspecified. The population value of the nuisance parameter η may thus be ill-defined and we will therefore study (the rate of) convergence ofη to the solution η * n ≡ (γ * n , β * n ) to the population equation where we make explicit that the expectation is taken w.r.t. the law P n . It follows from [13] that the component γ * n equals the population value of γ indexing model A (under the law P n ) when that model is correctly specified, and likewise that the component β * n equals the population value of β indexing model B (under the law P n ) when that model is correctly specified. Our main result in Proposition 2.1 below now states that n −1/2 n i=1 U i (η) and n −1/2 n i=1 U i (η * n ) are asymptotically equivalent under model M, even under the 'worst' sequence of laws P n and even when the working models A and B are misspecified, provided that certain sparsity assumptions hold. Under these assumptions, we thus have that where the term o P n (1) converges to zero in probability under the measure P n . It follows from this that the uncertainty in the estimatorη can be ignored when doing inference about μ 1 , and in particular that a uniformly consistent estimator of the standard error of μ can be obtained asσ / √ n, whereσ is the sample standard deviation ofμ and is defined asσ It further follows from the above proposition that, when either model A or model B is correctly specified so thatμ converges to μ 1 , a uniformly valid confidence interval for μ 1 can be obtained as We now provide our main theoretical results. Define the active set of the variables as where, for any vector a ∈ R p , we denote its support as supp(a) = {i ∈ {1, . . . , p}|a i = 0}. Let the sparsity index s γ equal the cardinality |S γ |, and likewise s β = |S β |; note that s γ and s β may depend on n. For positive sequences a n and b n we use the notation a n b n to denote a n ≤ Cb n for some constant C > 0. For any matrix A ∈ R p×p and index sets S 1 and S 2 , the (S 1 , S 2 ) block-matrix is denoted as A S 1 S 2 ∈ R |S 1 |×|S 2 | and consists of entries A ij , such that i ∈ S 1 and j ∈ S 2 . For any set S, we denoteS and |S| as complement and cardinality of the set S, respectively. For any symmetric matrix A, we denote its maximum eigenvalue as max (A). We define the ∞ norm of We consider the following assumptions on the data-generating process.
(a) The entries of the data are bounded from above, i.e.
Proposition 2.1: Letη be the estimator of η = (γ , β ) as obtained via the proposed 1 norm penalized bias-reduced DR method. Assume that, in addition to the restrictions on the datagenerating process and the regularity assumptions provided in the supplementary materials, the following assumptions hold for the solutions of (4) and (6): Provided sufficient sparsity in the sense that (s γ + s β ) log p/ √ n converges to zero with increasing sample size, it follows that lim n→∞ sup P n ∈M Proof: The proof of Proposition 2.1 follows similar lines as in [21]. Taylor expansion shows that Let for any vector a = (a 1 , . . . , a p ) ∈ R p , ||a|| ∞ = max i |a i | denote the ∞ or sup norm. Then from Hölder's inequality we have since ||d(β)|| ∞ ≤ 1 by definition, and likewise that From the assumptions of the proposition it follows that Therefore, For default penalties satisfying which converges to zero when n → ∞, provided sufficient sparsity to ensure that (s γ + s β ) log p/ √ n → 0.
The proof of the above proposition is instructive about the logic behind the above proposal. Repeating the same reasoning for the non-DR estimatorμ with U i (η) = m(X i ; β) (and η redefined as β), one finds that the term || 1 with probability tending to 1 under the sequence P n , in which the first term generally diverges to infinity. Likewise, repeating the above reasoning for the DR estimatorμ with nuisance parameter estimators obtained via standard Lasso, the terms || 1 ∂U i ∂γ (η)|| ∞ may not be o P n (1). In general, these terms are O P n (1), unless both working models A and B are correctly specified (i.e. the estimators of nuisance parameters converge to their respective truths) in which case both gradients have expectation zero under the law P n . Except under correct specification of both working models, the distribution of √ n(μ − μ 1 ) is then generally complex and may not be well approximated by that [8] show that the use of sample splitting may lead to less stringent sparsity conditions. In particular, they find that √ s γ s β log(p)/ √ n converging to zero is sufficient to guarantee uniformly valid confidence intervals when both models are correctly specified. This is attractive as it enables one model to be dense, so long as the other is known to be sparse, as is typically the case in the context of randomized experiments. In contrast, Proposition 2.1 requires that λ β √ nc 1γ (n) + λ γ √ nc 1β (n) + √ nc 2 (n) 2 converges to zero. In simple randomized experiments, s γ = 0 so that fast convergence rates ofγ (i.e. c 1γ (n) converging to zero at a fast rate) are attainable even when λ γ is very small. This creates potential for making λ β √ nc 1γ (n) + λ γ √ nc 1β (n) converge to zero in the context of randomized experiments, even when dense outcome models are used. To what extent and under what conditions this is achievable, will be investigated in future work. We furthermore plan to evaluate whether stronger results are achievable with sample splitting.

Further properties
Our proposed estimation procedure was designed to make the empirical expectations converge to zero. This has as a by-product that it makes the resulting estimatorμ insensitive to local changes in both nuisance parameters, provided that the sample size is sufficiently large. It is hence not entirely surprising that asymptotic inference based onμ can ignore estimation of the nuisance parameters β * and γ * , and that regularization bias affecting these nuisance parameter estimators does not propagate into the estimatorμ. Farrell [7] also relies on this small bias property [22] and finds it to hold regardless of the choice of nuisance parameter estimators, provided they both converge to their respective truths. This is because he implicitly relies on both models A and B being correctly specified, in which case the expectations (7) converge to zero regardless of the choice of (consistent) estimator of the nuisance parameters. We have shown that this small bias property does not generally extend to contexts with model misspecification, unless for instance when the nuisance parameters are estimated in accordance with our proposed procedure.

Simulation study
In this section, we perform a simulation analysis to compare the performance of the proposed penalized bias-reduced estimatorμ P−BR with that of different estimators of a mean counterfactual outcome μ 1 . In particular, in Section 3.1, we detail the considered estimators of μ 1 . In Section 3.2, we describe the simulation scenarios for the models. Finally, in Section 3.3, we discuss the results.

Considered estimators and settings
We denote the nuisance parameters estimated through Lasso penalized Maximum Likelihood Estimation and Lasso penalized Least Squares asη LASSO = (γ LASSO ,β LASSO ) . Further, we denote nuisance parameters estimated through our proposed approach asη P−BR = (γ P−BR ,β P−BR ) . We additionally study the performance of the nuisance parameter estimators obtained through post-selection [7] and double-selection techniques [8,20]. We denote these estimators asη We next consider the following estimators using the estimated nuisance parameters: (1) Regression Estimator:μ (2) Sample-Bounded Inverse Probability Weighting Estimator: The considered estimator is "Sample-Bounded" in the sense thatμ SB−IPW (γ ) lies in the range of the observed outcome Y i . Note that the considered methods, including the proposed method, require the selection of the penalty parameter. In our simulation study, we employ a fivefold Cross-Validation (CV) approach to select these parameters. In particular, we estimate λ β using the R package glmnet through the argument lambda.min. In order to select λ γ , we divide the dataset into T = 5 subsets and we define J t as the set of observation indices in each subset t = 1, . . . , T, i.e. ∪ t J t = {1, . . . , n}. We denote J −t as the set which excludes the indices from the set J t . Based on the indices J t and J −t we divide the dataset X into two disjoint subsamples X J t and X J −t . Next, we define the following score function for a given λ γ : whereγ t (λ γ ) is the solution of the optimization problem (4) based on the subsample X J −t . Finally, we estimate λ γ by minimizing CV T through a grid search. As a warm starting value for this search, we choose λ γ = 1.1 max(n,p log n) ) previously recommended by [8] [see also , 23]. This choice significantly reduces the time required for selecting the optimal λ γ and is especially favorable for the problem (4) which is computationally demanding. The R code used for our simulation analysis is available at https://github.com/avagvahewur/Penalized-BR-DR-codes.
In our additional simulation results (see the supplementary materials), we have further explored the performance of the considered methods based on increasing sample size, correlated covariates and larger variance of the error term.

Discussion of results
Tables 1 and 2 summarize the simulation results. We first consider the case where both models are correctly specified. As predicted by the theory (see the end of Section 2.3), the results for the estimatorsμ OR (β LASSO ) andμ SB−IPW (γ LASSO ), which are not doublerobust, show estimated standard errors that do not agree well with the empirical standard deviation. When both models are correctly specified, then using 1 -penalization in combination with a DR estimator, as inμ LASSO , yields better performance because the first order terms in the Taylor expansion of Proposition 2.1 then have population mean converging to zero. The proposed estimatorμ P−BR sets these first order terms close to zero, regardless of correct model specification, and this is observed to further reduce bias and improve mean squared error. Further, we observe that when the propensity model is incorrect, all double-robust approaches provide roughly similar results.
In settings where the outcome model is incorrect, our proposed estimatorμ P−BR provides slightly lower bias compared toμ LASSO . In small sample sizes, the proposed estimators (just like other estimators based on penalization) are subject to some residual bias.  Farrell [7] and Belloni et al. [8] have proposed to eliminate some of this bias via the use of post or double-selection, which is indeed seen to improve performance. The results show that the bias ofμ LASSO is reduced through these techniques. However, we observe that this leads to significantly higher RMSE. Moreover, we can see that the RMSE inflates more for larger p. In general, the proposedμ P−BR outperformsμ LASSO in terms of the bias andμ Post−LASSO andμ DS−LASSO in terms of the RMSE when the outcome model is incorrect. Finally, we can see that the proposed procedureμ P−BR ensures reasonable agreement between the estimated standard errors and the empirical standard deviation, even in settings with model misspecification.

Illustration
In this section, we provide an empirical illustration of the proposed methodology. We study the effect of life expectancy (pseudo-exposure variable A) on GDP growth (outcome variable Y). As in [24], we make use of World Bank data (http://data.worldbank.org/) for 9  [25,26]. In our analysis, in addition to the nine basic covariates, we also consider the interactions between these covariates. Therefore, we consider 45 covariates. We compare the methods considered in Section 3.1. We do not consider nuisance parameter estimation using MLE and OLS because no convergence was attained for these methods. All penalty parameters are selected using a fivefold CV approach. Table 3 summarizes the estimated average treatment effects, sandwich estimators of the standard errors and 95% confidence intervals. It shows that most of the estimators suggest a positive treatment effect of low life expectancy on the GDP growth. Previously, the results of [25] indicated that countries with greater declines in mortality (therefore, increase in life expectancy) may have a slight decrease in GDP per capita. Additionally, Table 3 shows that our proposed estimatorμ P−BR provides lower standard errors than other doublerobust estimators. Moreover, it can be seen from Figure 1 that our proposed approach provides more stable Inverse Probability Weights than Lasso penalized MLE for both the treated sub-sample (i.e. low life expectancy group) and the untreated sub-sample (i.e. high life expectancy group). We observe that the estimatorη DS−LASSO , which is based on the double-selection technique, provides relatively higher standard errors. This may be due to selecting large number of covariates in the double-selection process (e.g. forη DS−LASSO , 10 variables are selected using treated sub-sample and 31 variables are selected using untreated sub-sample), which may increase the uncertainty of the nuisance parameter estimator.

Discussion
Plug-in estimators based on high-dimensional model fits are known to exhibit poor behavior with non-standard asymptotic distribution [9,27]. Double-robust plug-in estimators have been shown to be much less sensitive to this when all working models on which they are based are correctly specified (or estimators for them converge to the truth) [7]. In this paper, we have shown that this continues to be true under model misspecification when socalled penalized bias-reduced double-robust estimators are used. These estimators can be viewed as a penalized extension of recently introduced bias-reduced DR estimators, which use special nuisance parameter estimators that are designed to minimize -or at least stabilize -the squared first-order bias of the DR estimator, while shrinking the non-significant coefficients of the nuisance parameters towards zero. Our results thus generalize those in [7,8,20] to allow for model misspecification. Through simulation studies, we have demonstrated that the proposed approach performs favorably compared to other DR estimators even when one of the models is misspecified. The empirical data analysis further confirmed the stability of our estimator of the average treatment effect in terms of the standard errors.
We have focussed our numerical results on Lasso or 1 -norm penalization, even though it readily generalizes to other (possibly non-convex) penalization techniques. It remains to be seen, however, how it performs in combination with other choices of penalty. Our theory, like that in [7,8], was also developed for prespecified penalty parameters, although the calibration of penalty parameters is likely to improve results. In further work, we will evaluate whether our theory can be adapted to incorporate data-adaptive choices of penalty parameters, e.g. based on cross-validation. In addition, we will consider potential extension of the proposed method using post-selection and double-selection techniques.
As mentioned earlier, our proposal extends the bias-reduced double-robust procedure of [13] for estimating the nuisance parameter η under the high-dimensional settings, therefore, allowing the number of covariates to be larger than the sample size. References [28][29][30] concurrently developed DR estimators which are closely related to our earlier proposal [see, 31], which is summarized in this paper. In particular, papers [28,30] suggest estimating the nuisance parameter of the model A using (4) because it provides a consistent estimator (under certain sparsity assumptions) and additionally leads to approximate balancing conditions for the covariates. Our proposed estimator and the estimator of [28] enjoy similar theoretical properties (under similar sparsity assumptions for the nuisance parameters) when either of the models is misspecified, but not both. On the other hand, the sparsity requirements of [29,30] are weaker, which is achieved by applying a sample splitting technique. Relative to these related works, our numerical study is more extensive and includes comparisons of the proposed estimator with standard DR estimator (based on Lasso penalized MLE and Lasso penalized LS) as well as with DR estimators based on post-Lasso refitting (i.e. post-selection and double-selection).