Treatment Effect Estimation Under Additive Hazards Models With High-Dimensional Confounding

Abstract Estimating treatment effects for survival outcomes in the high-dimensional setting is critical for many biomedical applications and any application with censored observations. This article establishes an “orthogonal” score for learning treatment effects, using observational data with a potentially large number of confounders. The estimator allows for root-n, asymptotically valid confidence intervals, despite the bias induced by the regularization. Moreover, we develop a novel hazard difference (HDi), estimator. We establish rate double robustness through the cross-fitting formulation. Numerical experiments illustrate the finite sample performance, where we observe that the cross-fitted HDi estimator has the best performance. We study the radical prostatectomy’s effect on conservative prostate cancer management through the SEER-Medicare linked data. Last, we provide an extension to machine learning both approaches and heterogeneous treatment effects. Supplementary materials for this article are available online.


Introduction
Treatment effect estimation and inference is an essential topic of interest in causal inference. It has drawn immense interest, spanning many different fields. Our work was motivated by the proliferation of "big, observational data" from electronic medical/health records (EMR/EHR), which provides an abundant resource for studying the effect of various treatments and serves as an alternative when randomized trials are implausible or otherwise perceived uneconomical. The challenge in studying causal effects today, among other things, is to handle a large number of potential confounders, "p n." Motived by studies in cancer, using the linked surveillance, epidemiology, and end results (SEER)-Medicare database, our primary focus is the causal effect on a survival outcome.
Research methods for high-dimensional data analysis of survival outcomes have primarily focused on variable selection properties-inference in practice is often reported on the findings with only selected covariates. Meanwhile, it has been known that the direct estimation following variable selection is biased in high dimensions, and limited work has been done to provide corrections. Here, we consider the additive hazards model, where we define the treatment effect as a difference in the hazards between the treated and the control groups. We develop a family of orthogonal scores, introduced in as early as Neyman (1959). Such scores allow valid statistical inference and tractable asymptotic theory in high dimensions. We then develop several refinements and extensions, including a new hazard difference (HDi) estimator that uses covariate balancing. The HDi estimator is also a particular case of the proposed orthogonal score family, a family member linear in the unknown parameter of interest. When coupled with a crossfitting procedure, we show that the estimators attain the rate double robustness in high dimensions, as defined in Smucler, Rotnitzky, and Robins (2019).

Related Work
Our work contributes to the high-dimensional literature on asymptotic normality in that it enables statistical inference in high-dimensional models for right-censored survival outcomes. Recent results of Bradic, Fan, and Jiang (2011), Yu, Bradic, and Samworth (2020), and Hou, Bradic, and Xu (2019a) provided asymptotic properties under the high-dimensional Cox type proportional hazards models, which include competing risks. However, to the best of our knowledge, we provide the first results on high-dimensional inference, including confidence interval construction, under the additive hazards models.
The orthogonal score has been a familiar concept in the semiparametric literature (Newey 1990;Bickel et al. 1998), and is closely related to efficient scores and efficient influence functions. It is also related to the profile likelihood and the least favorable direction, including the nonparametric maximum likelihood approach, often used for semiparametric models in survival analysis (Severini and Wong 1992;Murphy and van der Vaart 2000) However, the nonparametric likelihood does not apply to the additive hazards model (Lin and Ying 1994).
The orthogonal score benefits have long been known: an estimator obtained from such a score should not be affected by the slower than the root-n estimation of the model's nuisance parameters (Newey 1990;Bickel et al. 1998). This property was recently used to estimate treatment effects in high-dimensional models (Belloni, Chernozhukov, and Hansen 2013;Farrell 2015;Chernozhukov et al. 2018). However, models with censored observations present a considerable challenge. Approaches based on uncensored data do not automatically generalize as the nested "risk sets" induce complex dependencies.
There is a connection between an orthogonal score and double robustness. An estimator is doubly robust (DR) if it is consistent (and asymptotically normal), as long as one (but not necessarily both) of the outcome and the treatment assignment models is correct. However, this classical definition does not capture the full complexity of model misspecification in the high-dimensional setting.
There are many related terms used in the recent literature, often with partially overlapping meanings, such as "doubly robust, " "locally robust, " "small bias, " "mixed bias, " "model/sparsity double robust"; see Chernozhukov et al. (2018); Tan (2020a,b); Bradic, Wager, and Zhu (2019) for discussion of these issues. In particular, the "rate double robustness, " as defined in Smucler, Rotnitzky, and Robins (2019), is directly relevant to our approach. Rate double robustness states that a root-n asymptotically normal estimator exists as long as one of the two correctly specified models is sparse enough. We will provide more details in Section 3.
For survival outcomes, DR estimators have only been considered in the low-dimensional setting, with a fixed p < n. Kang, Lu, and Zhang (2018) considered the DR approach under the additive hazards model but relied on kernel density estimators which are not suitable for high-dimensional covariates. Dukes et al. 's (2019) study the additive hazards model, albeit through the semiparametric efficiency theory. The authors derived a family of DR scores under the low-dimensional settings with p < n.

Notation
We consider right-censored survival outcomes, where T and C denote the event and the censoring time, respectively. Our observations consist of a sample of independent and identically distributed (iid) random variables ( 1} denotes the treatment assignment, and Z i ∈ R p a p × 1 vector of covariates. We useZ i = (1, Z i ) whenever an intercept term is needed. The counting process and the at-risk process, the hazard function for a subject i. To avoid degenerate cases, we consider absolutely continuous random variables. λ 0 (·) denotes an unknown baseline hazard function, whereas (t) denotes the corresponding baseline cumulative hazard. We consider a finite study duration time and denote the upper limit of follow-up time as τ < ∞. Moreover, we use "⊥ ⊥" to denote statistical independence. Pertaining to the model parameters, a subscript "0" denotes the true value; for example, θ 0 , β 0 , 0 , γ 0 .

Organization
The organization of the article is as follows. In Section 2, we propose a family of inferential methods based on orthogonal scores. We study the treatment effect under the additive hazards outcome model and the logistic treatment assignment model. Section 3, proposes a HDi estimator and establishes the rate double robustness in high-dimensions. Section 4 contains extensive simulation studies illustrating favorable finite-sample properties of the newly proposed estimates across several settings in high dimensions. In Section 5, we apply our method to the study of radical prostatectomy's effect on the conservative management of prostate cancer patients aged 65 or older, using the SEER-Medicare linked data. Section 6 expands our approach in two ways: to be effective with many machine learning (ML) estimators and estimate the effects of the heterogeneous treatments. The supplementary materials contain detailed proofs of all the theoretical results.

Inferential Methods and Guarantees
We are interested in the effect of the treatment D on the survival time T (subject to censoring C), conditional on covariates Z. For binary treatments, this conditional treatment effect can be seen as the difference in hazards between the treated and the control groups, conditional on Z. Under the well-known semiparametric additive hazards model (Lin and Ying 1994) where β ∈ R p , and p can be much larger than the sample size n. More generally we might write Under model (1) we first focus on homogeneous treatment effects so that θ(t; Z) does not depend on Z. We relegate extensions beyond model (1) to the Section 6.1. Extensions to heterogeneous treatment effects are considered in Section 6.2. Our goal here is to draw inference, construct confidence intervals, for the treatment effect θ , in the presence of high-dimensional nuisance parameter β as well as that of the baseline hazard λ 0 (t).

The Orthogonal Score
At the first, the high-dimensional model (1) may inspire one to use many of the now-standard regularized estimators. However, they are known to give a biased estimate of θ , even if θ itself is not penalized in the regularization process (Belloni, Chernozhukov, and Hansen 2013). The effects of regularization propagate, create bias, and prevent root-n inference. A model for the treatment selection mechanism can often be leveraged to remove the shrinkage bias saliently in such settings. We are primarily interested in binary treatments. Therefore, we assume a logistic regression as a working model for the binary treatment assignment Orthogonal scores have been highly effective in the presence of very many nuisance parameters; the estimation of the treatment effect, at least in low-dimensional settings, is not greatly affected by the estimation error of the nuisance parameters (Newey 1994). Orthogonal scores have also been proven useful in leveraging the treatment assignment model; see, for example, Chernozhukov et al. (2018). The orthogonality of a score function is defined as the local invariance of the score to a small perturbation in the nuisance parameter space. Under models (1) and (3), we denote the nuisance parameter as where is the cumulative baseline hazard. A score function ψ(θ, η) is an orthogonal score for θ , if the Gâteaux derivative with respect to η is zero; that is, where θ 0 and η 0 = (β 0 , 0 , γ 0 ) are the true parameter values, respectively, and η = η − η 0 . A common approach for obtaining orthogonal scores, at least under the linear outcome models, is to consider the product of two residuals: one from the outcome model and the other from the treatment model (Robins and Rotnitzky 1995). Under our models, two natural candidates would be the martingale residual and the logistic regression residual D − expit(β Z). Under the additive hazards model (1), the martingale residual evaluated at the true parameter values, M i (t; θ 0 , β 0 , 0 ), is a martingale with respect to the filtration F n,t .
However, it is not difficult to verify that the product of the above two residuals would not give an orthogonal score according to the definition (4), ultimately due to the dependence between the at-risk process and the treatment assignment. In the following we show that, if we are willing to assume we can make a simple correction, by "decoupling" the at-risk process and the treatment assignment. For a single copy of the data, let Lemma 1. Under models (1) and (3), and assumption (6), the score (7) identifies the true parameters (θ 0 ; β 0 , 0 , γ 0 ), that is, Moreover, φ is an orthogonal score for θ in the sense of Equation (4).
We obtainθ by solving for θ . Here, φ i is a value of the orthogonal score φ, Equation (7), evaluated on the ith observation. Note that, unlike most settings studied up to date, the nuisance parameters in our, and in the most survival settings, depend on the parameter of interest θ . Hence, our proposal extends and enriches the recent high-dimensional orthogonality literature to allow nuisances to depend on the unknowns of interest. While Equation (6) is stronger than the usual noninformative censoring, C⊥ ⊥T | (D, Z), we observe that under the two models (1) and (3), the pair (T, D) plays the role of "response" in relation to Z. From this perspective, the assumption (6) is in line with the typical noninformative censoring assumption. Nonetheless, we are also able to decouple Y i (t) and D under the weaker condition C⊥ ⊥ T | (D, Z), and similarly construct an orthogonal score, as stated in the following lemma. (1) and (3), and whenever C⊥ ⊥T|(D, Z), the score

Lemma 2. Under models
identifies the true parameters (θ 0 ; β 0 , 0 , γ 0 , S C,0 ), that is, Here, S C,0 is the true value of the additional nuisance S C (t; d, z) = P(C ≥ t|D = d, Z = z). Moreover, φ C is an orthogonal score for θ in the sense of Equation (4).
Note that a practical implication of Lemma 8 is the need to estimate S C (t|D, Z), at least consistently. Nonparametric estimation of S C (t|D, Z) in the presence of high-dimensional Z might be attempted using ML methods as described in Section 6.1 later. From here on, throughout the article, we continue by assuming Equation (6). Note that Equation (6) is easily satisfied in the case of administrative censoring, that is, caused by the end of a study.

From Orthogonality to Double Robustness
In classical, low-dimensional models, there has often been a close connection between local efficiency and double robustness; this is explicitly discussed in a fundamental work of Robins and Rotnitzky (2001). Therein, local efficiency is achieved using the efficient influence function or efficient score, which is a particular case of the orthogonal score. However, for highdimensional models, only recently did Chernozhukov et al. (2018) established such a connection in the case of linear outcome models. For survival outcomes, such connections have not yet been studied. The following lemma establishes the double robust property of the proposed orthogonal score in low dimensions. Our orthogonal score matches that of Dukes et al. (2019) obtained independently through classical semiparametric efficiency.
Lemma 3. (a) Suppose that model (1) holds, whereas the treatment assignment D follows a nonparametric model Then, for any given γ * , θ = θ 0 is the root of the equation (b) Suppose that model (3) holds, whereas the outcome T follows a partially linear, additive hazards model with g an unspecified function of both time and covariates. Under assumption (6), for any given β * and * with bounded total variation, θ = θ 0 is the root of the equation While typical double robustness might refer to a nonparametric outcome model in part (b) above, as the treatment effect θ implies a constant HDi over time, a partially linear model (9) is needed. In the Section 3, we will study the double robust properties of score function φ in high-dimensional settings.

One-Shot Inference for Treatment Effect
In this subsection, we present one-shot inference results, where all of the data, rather than subsamples, are used twice: first to estimate the unknown nuisance parameters and second to solve the estimating equation. Defineθ as the solution to In the above,β,ˆ , andγ are only required to satisfy the conditions specified below. Observe that the cumulative baseline estimate,ˆ , might often depend on the outcome model and, therefore, on the unknown θ . Indeed, with the help of the initial estimate of θ , one can still solve Equation (10). Instead, we consider a more general setting, where the baseline hazard estimatorˆ (t; θ) is a function of θ , consequently embedding the unknown θ in the nuisances. By allowing the nuisance parameter to depend on θ , we have generalized the existing approaches that assume the nuisances can be estimated irrespective of θ ; such is Chernozhukov et al. (2018) for example. Before stating the assumptions, we define the following measures of estimation error under models (1) and (3), respectively: Note that D 2 β (β, β 0 ) is the symmetrized Bregman divergence used in Gaïffas and Guilloux (2012), and D 2 γ (γ , γ 0 ) is the excess risk used in van de Geer (2008).
Assumption 1. For constants K Z , ε, K θ , K , and K v independent of p and n, we assume (i) bounded covariates: P sup i=1,...,n Z i ∞ < K Z = 1; (ii) positivity of propensity, at risk and event rates: var(D|Z) ≥ ε > 0, E{Y(t)} ≥ ε > 0, and E{N(t)} ≥ ε > 0; (iii) the total variation of the candidate estimateˆ (·; θ), is bounded by with probability tending to one as n → ∞, where θ 0 is the true value of the parameter θ ; (iv)ˆ (t; θ) is approximately linear in θ in the neighborhood of θ 0 , with respect to the total variation: with arbitrary (v) the rates of estimation errors satisfy: and the estimation error of the baseline hazard satisfies for any process Under Assumption 1, we have the following result on the asymptotic distribution ofθ in the presence of both highdimensional and infinite-dimensional nuisance parameters. The result below does not require exact model recovery (variable selection) for either the outcome or the treatment model, and is based solely on estimation error of the nuisance parameters, which is achieved easily for many estimates.
Theorem 1. Under Assumption 1,θ that solves n −1 n i=1 φ i (θ ;β,ˆ ,γ ) = 0, where φ is the orthogonal score defined in Equation (7), converges in distribution to a normal random variable at √ n-rate, when both n, p → ∞, where " " denotes convergence in distribution, and the variance estimator takes the closed form A few comments are in order. Bounded norm of the covariates appear commonly in nonlinear high-dimensional models, including the generalized linear model (van de Geer et al. 2014), the Cox proportional hazards model (Huang et al. 2013), as well as the additive hazards models (Lin and Lv 2013). Positivity assumption, or the overlap assumption, requiring the propensity scores for all subjects to be bounded away from zero, as well as positive at-risk and event rates, are necessary assumptions for causal and survival inferences, respectively.
The conditions in Assumption 1, collectively, play a similar role as the commonly used, a high-level requirement in the semiparametric literature where each estimator is required to converge at n −1/4 rate or faster (Robinson 1988;Belloni, Chernozhukov, and Hansen 2013;Farrell 2015). Here, Equation (11) allows one or two (but not all simultaneously) of our estimators to converge arbitrarily slowly, in l 1 sense (which in low dimensions are equivalent to l 2 sense), as long as the products converge at the n −1/2 rate. The latter product is specific for the semiparametric additive hazards model and is new in the literature, namely the interference of the infinite-dimensional and the treatment model coefficients. If the distribution of the covariates Z can be learned under suitable model assumptions, then we may further improve the rate assumptions by leveraging the information about Z (Robins et al. 2017). Another new aspect is the rate log(p) β − β 0 1 in Equation (11), which in turn is a result of high-dimensional dependencies in the risksets.
The proof of Theorem 1 is given in the supplementary materials. We note that β 0 Z i is allowed to grow arbitrarily large, as β 0 1 grows with p and n. To the best of our knowledge, Theorem 1 is the first such result with an unbounded conditional hazard function of a survival outcome, which distinguishes us from the existing work of Hou, Bradic, and Xu (2019a) and Yu, Bradic, and Samworth (2020), for example.
Lemma 4. Suppose that the cumulative baseline hazard estimator is linear in θ , that is, for some differentiableD(t), whose derivatived ( Note that the commonly used cumulative baseline hazard estimator under the additive hazards model is linear in θ . This is also the case for the estimators considered in Section 2.4 and later in Section 6. The closed-form HDi estimator of θ in Section 3.1 is also unique by definition.

An Illustrative Example: Lasso Regularizers
Whenever models (1) and (3) are sparse high-dimensional models, many regularization methods can be employed for obtaininĝ β andγ . We provide illustration for a simplest case of sparse high-dimensional linear models. Let s β = β 0 0 and s γ = γ 0 0 denote the sparsity of the outcome and treatment assignment model, respectively. For the additive hazards outcome model (1), a simple and widely used approach is the Lasso regularized estimator of Leng and Ma (2007), defined as where Since θ is part of the parameter under model (1), we obtain an initial estimatorθ l . Due to the bias induced by the regularization,θ l cannot be used to draw inference. As shown later,θ l can still be useful in the construction of the baseline hazard estimatorˆ .
The estimation of γ under model (3) is similar. We may use the Lasso estimator under the logistic regression model (Tibshirani 1996) We used the same notation ρ for the tuning parameter in Equations (15) and (16) for simplicity. Following Gaïffas and Guilloux (2012), among others, under restricted eigenvalue conditions, the above estimators satisfy whenever ρ > C log(p)/n and log(p)/n = o(1), that is, allowing the dimension p to be much larger than n. Finally, our estimation has one additional nonparametric nuisance parameter, the cumulative baseline hazard function. Letˆ be the well-known Breslow-type estimator (Lin and Ying 1994). We observe that Under Assumptions 1(i) and 1(ii), and applying similar rates as in Equation (17) This way Assumption 1(v) is satisfied with a proper choice of the sparsities below. In addition, direct calculation verifies that Assumptions 1(iii) and 1(iv) are satisfied with Equation (18).
A sufficient condition for Theorem 1 is that the sparsities of the outcome and treatment models satisfy We note that the sparsity conditions above are comparable to those of Belloni, Chernozhukov, and Hansen (2013), Farrell (2015), van de Geer et al. (2014), and Avagyan and Vansteelandt (2017). In the above, the first two sparsity conditions are needed to satisfy l 1 consistency at any rate, and the last is required to satisfy γ −γ 0 2 β −β 0 2 = o(n −1/2 ), a condition equivalent to that in the fundamental works of Robins and Rotnitzky (1995), among others.

Double Robustness in High Dimensions
In the following we begin with a special case of our estimator from Section 2, which has a closed-form expression and can be seen as directly estimating the HDi. In Section 3.2, we introduce the cross-fitting scheme, leading to a relaxation of sparsity conditions under which we establish rate double robustness.

HDi Estimator
Here, we introduce an estimator that uses covariate balancing to estimate the HDi directly under models (1) and (3). Covariate balancing weights have been shown to be useful and have gained substantial interest in the recent causal literature; see, for example, Imai and Ratkovic (2014) and Li, Morgan, and Zaslavsky (2018). An advantage of covariate balancing, at least in linear models, is that it eliminates confounding without having to rely on a correctly specified propensity score model or when γ 0 cannot be consistently estimated under model (3). Define a set of covariate balancing weights where γ is the same as in model (3). It is straightforward to verify that the above weights are covariate balancing in the following sense. Consider the weighted empirical cumulative distribution functions of the covariates where I(Z ≤ z) = p j=1 I(Z j ≤ z j ) with Z j and z j being the jth component of Z and z, respectively. It is straightforward to show that F d,n (z) converges in probability to E{var(D|Z)I(Z ≤ z)}/E{var(D|Z)} for d = 0, 1; that is, the distributions of covariates in both treatment arms are approximately the same after weighting.

Now defině
Under the additive hazards model (1),ˇ 1 andˇ 0 can be seen to estimate 0 (t)+θ t and 0 (t), respectively. It is then immediate that the HDi estimator can be defined as the weighted difference ofˇ 0 andˇ 1 , In the following we show thatθ is also a special case of our class of orthogonal estimators defined in Section 2.
With the weights defined above, the estimating function following Equation (7) can be written as follows: Setting . (21) We note that Equation (21) only uses weighted observations from treatment group "1, " and it appears to be the only obvious choice so that the resulting estimating function is linear in θ . Using Equation (21) into Equation (20) the estimating function becomes a linear function of θ where we used the weighted processeš We then havě which is readily verified to be equal to Equation (19).
For the HDi estimator we have the following asymptotic normality result.
In simulation studies, we show that the HDi estimator has good empirical properties even in the presence of model misspecifications. Theoretical investigation under model misspecification is beyond the scope of this work, and some preliminary results can be found in Hou, Bradic, and Xu (2019b, sec. 4.1).

Rate Double Robustness
As we have previously commented following Theorem 1, which also applies to the HDi estimator, we allow at least one of the nuisance parameter estimators to converge arbitrarily slow in 1 -norm, as long as the relevant products converge at n −1/2 rate. In this section, we develop a cross-fitted estimator based on either the HDi or the one-shot estimator (10), that achieves rate double robustness as defined in Smucler, Rotnitzky, and Robins (2019): an estimator is rate double robust if it is l 2 consistent and asymptotically normal, whenever the product of the estimation error of two nuisance parameters decays faster than n −1/2 , while the estimator of one nuisance parameter is allowed to converge arbitrarily slow in 2 -norm. In this section, we therefore relax the l 1 -norm requirement to the l 2 -norm requirement.
We make use of a cross-fitting procedure that uses one part of the sample to estimate all the nuisances and uses the remaining part to find the zero of the score. Cross-fitting induces independence between the score and the estimated nuisance parameters, further reducing the effect of the nuisance parameters on the treatment effect estimation. This is in addition to the orthogonality of the score function. Algorithm 1 summarizes the estimation procedure. For the number of cross-fitting folds k in Algorithm 1, any choice produces (in theory) asymptotically equivalent results. In practice, we recommend k to be at 5 or 10 based on our own experience.
To present theoretical results we need an additional set of notation. Let W * = (X * , δ * , D * , Z * ) be an independent copy of the original data. We denote E * as the expectation taken with respect to the independent data copy W * conditionally on the Data: split the data into k folds of equal size with the indices set I 1 , I 2 , . . . , I k for each fold indexed by j do 1. estimate the nuisance parameters β (j) ,ˆ (j) ,γ (j) using the out-of-j-fold data indexed by I −j = {1, . . . , n} \ I j ; 2. construct the cross-fitted score using the in-fold samples: end Result: Obtain the estimated treatment effectθ cf by solving

Algorithm 1: Estimation of the Treatment Effect via k-fold
Cross-fitting observed data. We define the average testing deviance as The average testing deviance has a convergence rate that grows slower than l 1 norm, considered in Section 2, as with K Z as in Assumption 1(i). Below we state Assumption 2, under which we have the inference result for θ .
Assumption 2. Suppose that Assumption 1(i), 1(ii), and 1(iv) hold with (β,ˆ ,γ ) = β (j) ,ˆ (j) ,γ (j) for all j = 1, . . . , k. Assume additionally, that there exist positive sequences of real numbers γ n , β n → 0 as n, p → ∞ such that D * γ γ (j) , γ 0 = o(γ n ) and D * β β (j) , β 0 = o(β n ), and √ nγ n β n + sup Theorem 3. Under Assumption 2, forθ cf defined in Equation (27) when both n, p → ∞, with the closed-form variance estimator A few remarks are in order. The cross-fitted score (27) can handle a larger number of covariates, less sparse models and less restrictive estimators of the baseline hazard in that Assumption 1(iii) is not required. The removal of Equation (12) allows various estimation methods of the baseline hazards besides the Breslow-type estimators, for example, splines. In addition, Equation (27) allows a larger dimension without the extra log(p) factor of Equation (11). Last, Equation (27) implies a sufficient condition for Theorem 3 max{s β , s γ } = o(n/ log(p)) and s β s γ = o n/ log(p) 2 , (28) which is weaker than those discussed in Section 2. Note that √ n/ log(p) ≤ s ≤ n/ log(p) (Theorem 3) is known as the moderately sparse regime, whereas s ≤ √ n/ log(p) (Theorems 1 and 2) is known as the exactly or strictly sparse regime. Our condition (28) parallels that of rate double robustness as defined in Smucler, Rotnitzky, and Robins (2019). Our results confirm that the same rate condition is achievable both with censored data and with the additional nuisance component of the baseline hazard that depends on the unknown θ .

Numerical Experiments
We consider n = p = 300 and n = p = 1500. To ensure that the baseline hazard is nonnegative, Z i 's the covariates are generated as iid N (0, 1) conditioned on the event that β Z i ≥ 0.25. The censoring time C is the smaller of τ and Uniform (0, c 0 ) where τ and c 0 are such that n/10 treated subjects are expected to be at-risk at t = τ , and the censoring rate is around 30%. We repeat simulation 500 times. We consider the four proposed estimators: θ as in Equation (7) with the Breslow (18), the HDi estimatoř θ in Equation (23), and their cross-fitted counterpartsθ cf as described in Algorithm 1 andθ cf , We also present the result of the Lasso estimator,θ, under the additive hazards model with the penalty for θ set to be zero. The Lasso estimators (15) and (16) are obtained from 10-fold crossvalidation with the R-package ahaz and glmnet, respectively. Cross-fitting uses 10-fold whereas each fold uses 9-fold crossvalidation.

Finite Sample Inference
The event time T follows (1)  We consider four pairs of sparsities: both β and γ very sparse (s β = 2, s γ = 1), β very sparse and γ moderately sparse (s β = 2, s γ = 10), β moderately sparse and γ very sparse(s β = 15, s γ = 1) and both β and γ moderately sparse (s β = 6, s γ = 3). Table 1 contains the summary of inferential properties. Proposed estimators largely reduce the bias of the naive Lassoθ , which is especially visible for the larger sample size n = p = 1500. All four estimators achieve reasonable coverage rates of the nominal 95% confidence intervals with the larger sample size n = p = 1500, while cross-fitted HDi estimator outperformed the rest in the small sample size setting preserving the good coverage with n = p = 300.

Exploring Model Misspecification
We test the robustness of our method when model or sparsity assumption is violated in either the propensity or the outcome model. We consider violations of sparsity (above the dashed lines in Table 2) as well as model misspecification (below the dashed lines in Table 2). We consider the following: Two pairs of very sparse-dense combinations, (s β = 2, s γ = 20) and (s β = 30, s γ = 1), are studied. Besides, the dense coefficients, we also consider model misspecification across three different scenarios, denoted by "E, " "P, " and "D" which stand for "Exponential, " "Probit, " and "Deterministic, " respectively.
• In scenario "E, " the event time is generated with the exponential link while the logistic (3) is correct. The coefficients are (s β = 2, s γ = 1).  True θ = −0.25, 30% censoring.θ,θ,θ cf , andθ cf are the four proposed estimators, where the subscript 'cf' denotes the cross-fitted version. The naive Lasso estimatorθ penalized only the covariate effects β but not θ . > 100 * * * The dashed entries are not well-defined due to misspecification. * * The divergence of magnitude is expected because setup "D" violates Assumption 1.
• In scenario "P, " we consider the misspecified treatment with the probit link while the additive hazards model (1) for the event time was correct. The coefficients are (s β = 2, s γ = 1). • In scenario "D, " we consider misspecified treatment model with the deterministic assignment where μ is the median of γ Z . The additive hazards model (1) is correct, and the coefficients are (s β = 2, s γ = 1). As var(D|Z) = 0 for all Z, that is, there is no overlap, Assumption 1 is violated.
We present the estimation results in Table 2. All four estimators have smaller bias than the naive Lassoθ, most notably under scenarios "D" and "P" where the naive Lasso failed completely. Moreover, the bias decreases significantly with the larger sample sizes. Additionally,θ cf andθ cf demonstrate the advantages of cross-fitting with even smaller biases. HDi estimator,θ cf often has smaller standard deviation thanθ cf leading to the improved MSE, especially visible in the smaller sample sizes. In scenario "D" with n = p = 300,θ cf shows advantages overθ cf which could not be defined in one of the simulation runs; therein the score equation appeared to have no numerical root in the neighborhood of interest.
We also study the estimation error through the average testing deviance defined of Section 3 and the magnitude of estimation under the possibly misspecified models with μ(t) = E * (Z * )/E * {Y * (t)}. We use the sample average n −1 k j=1 i∈I k to approximate the expectation E * for the former. Under the Scenario "E, " we evaluate For the latter, we use the sample average k/n i∈I k to approximate the expectation E * , and use the maximum across all folds.
In Table 3, we present the estimation error, deviance and magnitude of the nuisance parameters. The estimation errors from Lasso is presented through the l 1 -norm whereas the Breslow estimator is evaluated in the l ∞ -norm; these are compared to the true parameters when the models are correctly specified. The deviance columns contain the mean whereas the magnitude columns contain the median over simulation runs. When the true model is dense, that is, either s β = 30 or s γ = 20, Table 3 illustrates that the Lasso deviates substantially from the underlying true coefficients in l 1 -norm, suggesting failure at estimation. Regardless, our proposed method achieves consistent estimation of the treatment effect. When Assumption 1 is expected to hold, that is, for all scenarios except "D, " the magnitudes are well controlled, with no indication that they might blow up in high dimensions.

Illustration With SEER-Medicare Data
Clinical databases such as the United States National Cancer Institute's SEER typically contain disease specific variables, but with only limited information on the subjects' health status, such are comorbidities for example. In studying causal treatment effects, this can lead to unobserved confounding (Hadley et al. 2010;Ying, Xu, and Murphy 2019). On the other hand, the availability of information from insurance claims databases could make up for some of these otherwise "unobserved" variables; Riviere et al. (2019) shows that they contain much information about comorbidities in particular.
For the early stage prostate cancer, radical prostatectomy is quite effective in reducing cancer-related deaths. With improvement in diagnosis, treatment and management for the disease other causes have become dominant for the overall death of this patient population. Studying the comparative effect of radical prostatectomy on overall survival of the patient using observational data calls for proper control of confounding. Due to the lack of tools for handling the high-dimensional claims code data, existing work on the topic either have not made use, or made very limited summary statistics, of the rich information on the patients' health status. We consider 49,973 prostate cancer patients diagnosed during 2004-2013 as recorded in the SEER-Medicare linked database. The data contains the survival times of the patients, treatment information, demographic information, clinical variables and the federal Medicare insurance claims codes. We included in our analysis age, race, martital status, tumor stage, tumor grade, prior Charlson comorbidity score, and 6397 claims codes possessed by at least 10 patients during the 12 months before their diagnosis of prostate cancer. The summary statistics of these variables are presented in Table 4. Our main focus is the treatment effect of the surgery (radical prostatectomy) on the overall survival of the patients. In our sample, 17,614 (35.25 %) patients received surgery while 32,359 (64.75 %) patients received other types of treatment without surgery. As can be seen, many of the variables are not balanced between the two groups; in particular, patients who received surgery are younger, married, white, T2, with poorly differentiated tumor grade, zero comorbidity, and diagnosed in 2008 or earlier. Among all patients 5375 (10.76 %) deaths were observed, while the rest of the patients were still alive by the end of the year 2013. The Kaplan-Meier curves for the two groups are presented in Figure 1.
The causal diagrams of the analyses are illustrated in Figure 2. In Analysis I, we adjusted for the potential confounding effects from the clinical and demographic variables and the high-dimensional claims codes. After excluding claims codes with less than 10 occurrences, we are left with 6533 covariates and 5375 observed events indicating "p > n" scenario. Figure 3 plots the distribution of the estimated propensity scores from both groups with the ranges of 0.04-0.93 and 0.03-0.90 for the surgery and conservative treatment, respectively. Table 5 contains the results. Low-dimensional analysis is reserved for the naive regression method (crude), the regression adjustment directly adjusted for the clinical and demographical variables in the additive hazards model (regression adjustment); and the inverse probability weighting (IPW) with propensity score estimated using R package twang. We also present naive Lassoθ estimate, that did not penalize the treatment effect.
Results are clear: surgery improved overall survival compared to no surgery. The magnitude of the estimated treatment effect, however, varied according to the approach used. The crude analysis has the largest estimated treatment effect of almost 0.01 reduction in the hazard rate. Regression adjustment gives an estimated reduction of 0.006, while IPW with lowdimensional PS, as well as the four proposed estimators give reduction of around 0.004.
In addition to the above, heterogeneity in treatment pattern has been noticed across the geographic regions (Harlan et al. 2001), described by the hospital referral region (HRR). In Analysis II, we include in the treatment propensity score model all the interactions between HRR and the covariates in Analysis I. After excluding claims codes and binary interaction terms with less than 10 occurrences, we have 43, 466 covariates for the PS model. The large number of additional interaction terms puts the PS model in the p ≈ n scenario. Admirably, the results are numerically stable and quantitatively similar to the Analysis I, despite the dramatic increase in the number of covariates.

Beyond Simple Additive Hazards
In this section, we propose two additional extensions of the proposed estimators. Whereas one extension considers ML generalizations, the other focuses on treatment effect heterogeneity.

Modern Nonparametric Estimates
With the advancement in ML, nonparametric estimation under the general models (8) and (9) has become possible. The use of nonparametric estimation in causal inference has been  considered under the setting without censoring (Chernozhukov et al. 2018;Farrell, Liang, and Misra 2018). For censored survival data, two approaches are possible: one that considers estimation of the hazard and the other that estimates the cumulative hazard. The former is an immediate extension of Section 3.2, while the latter has more readily available ML tools.

Learning the Hazard Function
A natural extension of Equations (1) and (3) considers for unknown functions g and m. Suppose we have consistent estimatorsĝ andm satisfying Usingĝ, we may estimate the baseline cumulative hazard bŷ . NOTES: Crude analysis did not adjust for any covariates. Lasso estimatorθ penalized only the covariate effects β but not θ .θ,θ,θ cf andθ cf are the four proposed estimators, where the subscript 'cf' denotes the cross-fitting.
The orthogonal score (7) then takes the form Now we can directly apply Algorithm 1 for the score (34). While the nonparametric estimation of the hazard with highdimensional covariates is mostly an open problem, we offer one such solution. If we ponder the existence of an orthonormal basis {f j } j∈N of L 2 ([−K, K] p ), and a κ and a permutation π of the first κ elements of the basis such that Then, one can estimate g by any series estimatorĝ(Z) using a least-square regression on the first κ elements f 1 (Z), . . . , f κ (Z), or a least-squares regression of the likes of Equation (15) on a large number of elements of the basis but now with possibly group regularizations. Observe that in the above, κ plays an equivalent role to that of the sparsity level. Similarly we can obtain an estimatem(Z).

Learning the Cumulative Hazard Function
A generalization of model (33) considers a further loosening of the additive assumption between Z and λ 0 (t) as in Equations (8) and (9). However, most of the ML approaches for the cumulative hazards cannot effectively isolate the treatment effect θ (or the hazard itself). The HDi-type estimator, as specified in (19), is no longer generally applicable. Below we propose a different approach that mimics the orthogonal score φ i .
As score (36) requires a unified estimator of the cumulative hazard, we provide a simple estimator that combinesĜ 0 andĜ 1 We present the cross-fitting strategy in Algorithm 2. We expect that under conditions mimicking those of Section 3.2, we can guarantee asymptotically normal estimator of θ ; that is,θ ml of (40) satisfiesσ as long as the ML estimates satisfy Data: split the data into k folds of equal size with the indices set I 1 , I 2 , . . . , I k ; stratify the data into two treatment arms with the indices set I trt and I ctr . for each fold indexed by j do 1. estimate the propensity modelm (j) (z) using the out-of-fold (out of jth fold) data indexed by I −j = {1, . . . , n} \ I j ; 2. estimate the conditional cumulative hazardsĜ (j) 1 (t; z) I −j ∩ I trt and I −j ∩ I ctr ; 3. denoteD (j) (t) = i∈I −j D i Y i (t)/ i∈I −j Y i (t) and construct the estimator for the partially linear additive hazards model aŝ 4. construct the cross-fitted score using the in-fold samples: Algorithm 2: Estimation of the Treatment Effect with Cross-fitted Machine-learning estimates.

Heterogeneous Effects
So far we have made the assumption that the treatment effect is homogeneous given the covariates. Model (1) can be immediately extended to include interaction terms λ(t; D, Z) = λ 0 (t) + (θ + α W)D + β Z, where W ∈ R q contains some known functions of Z. The choice of W is typically informed by the scientific context, and we assume that the dimension of W is fixed, q n. The goal is now to draw inference jointly on (θ , α ).
Under model (41), the orthogonal score for (θ , α ) becomes φ h (θ , α; β, , γ ) = By orthogonality and under conditions similar to Assumption 1, the solution (θ ,α) to the system of equations 1 n n i=1 φ h,i (θ , α;β,ˆ ,γ ) = 0 is asymptotically normal. Let us emphasize that here the nuisance parameters,β,ˆ andγ , need to be estimated on a holdout dataset, whereas the solution to the score equation above needs to be found on the remaining data, such as in the crossfitting scheme. The estimate of the variance can be found usinĝ −1Vˆ −1 /n, wherê The above interaction terms are special cases of heterogeneous treatment effects. More generally, heterogeneous treatment effects may be written θ(W), or θ(Z). It is not straightforward to extend our orthogonal score method here to estimate a nonparametric θ(·), and possibly with high-dimensional Z. These are, however, important problems for future research.

Concluding Remarks
This article has developed a novel orthogonal score-based approach under the additive hazards model with highdimensional covariates. Thus, the resulting estimate of the treatment effect is consistent and asymptotically normal at a root-n rate even with (biased) input from regularized regression. We have presented the results with high-level assumptions such as Assumption 1 and discussed under which sparsity settings our high-level assumptions are satisfied for the Lassotype estimators. Such discussion holds, equivalently, for any regularized estimators for which rates of estimation have been established: nonconvex penalties, group, and hierarchical or smoothing penalties, etc. On the other hand, corresponding estimation rates have not been demonstrated for modern ML methods, including boosting, random forests, or neural networks; some of these have yet to be developed for censored data.
We highlight that the notion of double robustness, though clearly defined in low-dimensional problems, is no longer uniquely defined in the high-dimensional setting. Although semiparametric efficiency theory can be used to design doubly robust scores in low dimensions, as in Dukes et al. (2019), the resulting scores do not necessarily satisfy any of the notions of double robustness in high-dimensional settings. The problem here becomes rich with potentially many different aspects. We have shown that our cross-fitted orthogonal score has the rate double robustness property.
The above notion of rate double robustness assumes that the models are, in fact, correctly specified. Tan (2020b) and Smucler, Rotnitzky, and Robins (2019) relaxed this assumption for what is referred to as "model double robustness. " As pointed out by a reviewer, their assumptions are very similar to a sparsity assumption of the population minimizer of the (possibly misspecified) loss function. This, in turn, implies that the corresponding L 1 regularized estimators of the nuisance functions converge somewhere and that it is, in fact, quite strong an assumption. Simulation studies of this article indicate that our procedure can be valid in similar aspects. In Hou, Bradic, and Xu (2019b), we show that consistent estimation can be achieved when one (but not both) of the two models is misspecified. However, inference under those settings requires substantial future work on the nuisance parameters' suitable estimation process.

Supplementary Material
In supplementary materials contain complete proofs of the theoretical claims made in the main text.