Penalized robust learning for optimal treatment regimes with heterogeneous individualized treatment effects

The growing popularity of personalized medicine motivates people to explore individualized treatment regimes according to heterogeneous characteristics of the patients. For the large-scale data analysis, however, the data are collected at different times and different locations, i.e. subjects are usually from a heterogeneous population, which causes that the optimal treatment regimes also vary for patients across different subgroups. In this paper, we mainly focus on the estimation of optimal treatment regimes for subjects come from a heterogeneous population with high-dimensional data. We first remove the main effects of the covariates for each subgroup to eliminate non-ignorable residual confounding. Based on the centralized outcome, we propose a penalized robust learning that estimates the coefficient matrix of the interactions between covariates and treatment by penalizing pairwise differences of the coefficients of any two subgroups for the same covariate, which can automatically identify the latent complex structure of the coefficient matrix with heterogeneous and homogeneous columns. At the same time, the penalized robust learning can also select the important variables that truly contribute to the individualized treatment decisions with commonly used sparsity structure penalty. Extensive simulation studies show that our proposed method outperforms current popular methods, and it is further illustrated in the real analysis of the Tamoxifen breast cancer data.


Introduction
For many chronic diseases, such as cancer, cardiovascular disease and diabetes, patients often respond differently to a particular treatment, both in terms of the primary outcome and side effects.This difference in the response has prompted many researchers to make a shift ideologically from a one-size-fits-all approach to a more rational approach of precision medicine [1,3,9,29].Personalized medicine is a medical paradigm that focuses on the systematic use of information (covariates) about individual patients to find the most effective treatment decisions.The primary goal of precision medicine is to seek individualized treatment regime (ITR), which can be defined as a function mapping from the covariate space into the treatment space in mathematics.
There has been significant research developments on estimating the optimal treatment regimes from both parametric and nonparametric approaches in recent years.The parametric method usually posits an outcome regression model for treatment outcome given assigned treatment and patient covariate information, and the performance of which depends on the correctly-specified outcome model [2,7,11,16,17,[23][24][25]33].A representative method is Q-learning, which first models the conditional mean of clinical outcomes given the patient covariates and assigned treatment, the simplest way is to use linear model, then the optimal treatment regime can be easily obtained by choosing the best mean outcome based on the fitted regression model.Moreover, there are often a large number of pre-treatment variables that may or may not be useful in estimating an optimal individualized treatment regime, therefore Qian and Murphy [21] used 1 -penalized least squares with the 1 penalty to select the relevant variables.The nonparametric method directly estimates the marginal mean outcome for a given treatment rule first, then seeks the optimal ITR by maximizing the estimator of the marginal mean outcome over a prespecified class of treatment rules via a suitable optimization algorithm.These methods include outcome weighted learning [31,[35][36][37], D-learning [19,20] and tree-based methods [12,32], which circumvent the need to model the conditional mean of clinical outcomes given the patient covariates and assigned treatment.However, it should be noted that the optimal treatment regimes may also vary for patients from different subgroups, where heterogeneity comes from unobserved latent factors.
In an era of large-scale data analysis, samples are drawn from a heterogeneous population consisting of underlying subgroups, for example, the datasets are usually collected at different times and different locations for many observational studies or almost all clinical trials.However, the optimal ITR we are going to seek may also vary for patients from different subgroups (units) in the heterogeneous population.It is clearly unreasonable to assume that the coefficients of the interactions between covariates and treatment are the same for all the units by ignoring the heterogeneity.The aim of this paper is to propose optimal treatment regimes for patients from different subgroups (units) based on observed data.A simple way to consider the heterogeneity is to estimate the optimal ITR for each unit separately using the above-mentioned methods based on the dataset of one unit only.Moreover, Shi et al. [30] proposed a novel maximin projection learning method to aggregate linear optimal ITRs obtained across different data units.This kind of method may result in the loss of statistical efficiency and accuracy for the estimating of the optimal ITR because it does not use the entire data simultaneously and ignores the relationship information among the datasets of different units.On the other hand, when we take into account the heterogeneity by considering the modeling of the datasets from all units jointly, for some covariates, their interaction coefficients with the treatment are the same for all the units (homogeneity), but for the other covariates, the interaction coefficients are the same only for part of units or are all different for all the units (heterogeneity).The complex structure of interaction coefficients with both heterogeneity and homogeneity will become more common for high-dimensional datasets, which brings new challenge for the estimating of the optimal ITR in precision medicine.Ma and Huang [14] proposed a concave pairwise fusion penalized least squares approach for automatically identifying subgroup structures as well as estimating subgroup-specific intercepts.To consider the complex structure in high-dimensional integrative analysis, Yang et al. [34] proposed to penalize the difference between any two of the coefficients of the same covariate to identify heterogeneous and homogeneous coefficients.
In this article, we propose a penalized robust learning ( PR learning) to estimate optimal ITR by properly taking into account the complex structure of the coefficient matrix involved in the decision rules for subjects from different units.We first define unit-specific centralized outcome by removing the main effects of the covariates, which allows us to learn the optimal treatment regime without specifying the unit-specific main effect models.Moreover, to detect the possible true structure of coefficient matrix in the high-dimensional linear interaction model, we penalize the difference of the coefficients between any two of units for the same covariate, which can identify heterogeneous and homogeneous coefficients, and select the important variables that truly contribute to the individual response with commonly-used sparsity structure penalty.Compared with existing methods, our PR-learning not only eliminates non-ignorable residual confounding from misspecification of the main effect model, but also takes advantage of the relationship information among the datasets of different units.
The remainder of this paper is organized as follows.In Section 2, we first discuss the Q-learning method in K data units.Then we define the unit-specific centralized outcome to improve robustness and further introduce the penalized robust learning for estimating optimal ITR.Section 3 describes the implementation of the PR learning based on crossfitting and an ADMM algorithm.The theoretical properties of the resulting estimator are studied in Section 4. We conduct extensive simulation studies to evaluate the performance of the proposed methods in Section 5.The method is then illustrated on the Tamoxifen breast cancer data [13] in Section 6 and the AIDS clinical trial data [10] in the Supplemental Material.

Framework
Consider an observational study with data collected from K units, and the sample size of the kth unit is n k ≥ 1.For the kth unit, X k ∈ X k ⊂ R p denotes the p-dimensional vector of pretreatment covariates, A k denotes the treatment assignment by A k ∈ A k = {0, 1}, and R k is the observed clinical outcome after receiving the treatment.Without loss of generality, we assume that R k is bounded, and larger values of R k are more desirable.Then the observed data can be denoted as , where d k is the treatment regime in the kth unit, which maps from the covariate space X k into the treatment space A k .We postulate that the outcome of each unit satisfies the model for k = 1, . . ., K, where h k (•) is the main effect function from covariates, C k (•) is the interaction effect of these variables with the treatment choice and k is an independent error with k satisfy E[ k |X k , A k ] = 0 and standard error σ > 0. The conditional mean outcome for the kth unit can be expressed as where a k ∈ {0, 1} is the observed treatment assignment for A k .An optimal ITR d opt k (x k ) is a treatment regime d k that maximizes the conditional mean outcome by assigning the value 0 or 1 to the treatment A k based on covariates x k , that is, and h k (x k ) by assigning a k = 0 otherwise.Thus, we can further write the optimal ITR as , where only the interaction effects directly determine the corresponding treatment regimes.
For learning the optimal ITR based on the above data collected from K units, almost all existing methods could estimate the optimal treatment regime for each unit separately, i.e. the unit-specific Q-learning, that will be introduced in Section 2.2.However, two problems will inevitably be encountered for this kind of practice in precision medicine.First, it ignores the relationship information among the datasets of different units without using the entire data simultaneously, which will result in the loss of statistical efficiency.Second, residual confounding could not be completely excluded if the main effect models are misspecified, and identifying the complex structure for interaction effects with both heterogeneity and homogeneity brings new challenge for the estimating of the optimal ITR especially in high-dimensional case.To this end, we will propose a novel method, the penalized robust learning, to address these issues in Section 2.3.

Unit-specific Q-learning
Q-learning, a regression-based method, is commonly used to estimate decision rules in precision medicine.The method depends on the Q-functions, which measure the quality of treatment assigned to a patient with covariates, and an optimal decision rule is one that maximizes Q-function.Since the observed data come from K units with heterogeneity, we define the unit-specific Q-function for the kth unit as which is actually the conditional mean outcome mentioned above, and Q-function measures the quality of treatment a k ∈ {0, 1} at observation x k .The simplest way is to use linear regression to model the Q-function, , where α k = (α k1 , . . ., α kp ) T and β k = (β k1 , . . ., β kp ) T denote the regression coefficients for the main effect function and the interaction effect function, respectively.We can further get the least squares estimates of α k and β k as for each unit separately, where the estimation can be completed using the M-P generalized inverse matrix if the estimators α and β are not identified or unique for some special cases.
In addition, we can also obtain the estimators α and β via 1 -penalized least squares ( 1 -PLS) [21] in high-dimensional data.

Penalized robust learning
As previously noted, the treatment regime depends only on the model C k (•) for the interactions between treatment and covariates but not on the main effect model h(•).Moreover, misspecification of the main effect models for h k (•) may result in residual confounding and efficiency loss.To address this issue, we define the unit-specific centralized outcome is the expectation of the outcome R k conditional on covariates X k only, but not treatment assignment A k .For each unit, by removing the main effects of the covariates, the centralized outcome may better reflect the heterogeneity in response to treatment.According to model (1), we have , is the propensity score for the kth unit.Then the unitspecific main effect model ( 1) can be replaced with the model for propensity score and centralized outcome as 3) allows e k (•) to vary across units, which we refer to as treatment assignment heterogeneity.These sources of heterogeneity are not related to treatment decisions since they do not appear in the interaction effect function C k (•).For simplicity, we may assume that e if the propensity scores of different units tend to be the same in practice, e.g. in clinical trials.This kind of decomposition was originally used by [27] to estimate parametric components in partially linear models.In this case, the optimal treatment regime can be obtained by maximizing the conditional expectation of the centralized outcome based on a linear assumption to the interaction term where β k = (β k1 , . . ., β kp ) T is the unit-specific interaction effects for the kth unit.We further denote β = (β 1 , . . ., β K ) T be a K × p matrix, and the jth column represents the interaction effects of treatment and the jth covariate for all K units.The proposed model ( 4) obviously depends on the availability of e k (•) and m k (•), the advantage of which is that we can use more flexible approaches to estimate them, such as nonparametric regression or machine learning method for the estimation of m k (•) even for the high-dimensional data.Specifically, e k (•) is known and usually given before the study in a randomized clinical trial.
For the coefficient matrix β in the linear interaction model (4), the value of the kth unit-specific interaction coefficient may be equal to that of the k th one for some covariates, such as β kj = β k j for some j ∈ {1, . . ., p}.On the other hand, the majority of the true coefficient matrix β are zero or very close to zero due to the high-dimensional data, such that it is very sparse.These two characteristics may lead to the complex structures to the coefficient matrix, such as some of columns are heterogeneous (β kj = β k j for some ks' and k s'), the other columns are homogeneous (β 1j = • • • = β Kj = 0), and most of columns are zero (sparsity).The unit-specific Q-learning proposed in Section 2.2 ignores the information contained in the complex structures and estimates the optimal treatment regime for each data unit separately, which will lose statistical efficiency and accuracy for estimating optimal ITR substantially.
To properly take into account the possible true structure of coefficient matrix in the high-dimensional linear interaction model ( 4), we propose a penalized robust learning method by considering the following penalized least square estimator where p λ and p γ are penalty functions inspired by the work of [34], and λ, γ ≥ 0 are tuning parameters that control the amount of penalty on |β kj − β k j |'s and |β kj |'s respectively.The first penalty function penalizes the pairwise differences between unit-specific interaction coefficients across different units for each covariate to identify the heterogeneous and homogeneous structures of coefficient matrix.The second penalty is used to recover the sparse structure of coefficient matrix as it usually does in high-dimensional settings.

Estimation with cross-fitting
The implementation of the proposed penalized robust learning needs the estimates of e k (•) and m k (•), which can be obtained using approach of cross-fitting [5,28] with method tuned for optimal predictive accuracy.Specifically, we first split the data of each unit into S evenly sized parts, denoted by {D k1 , . . ., D kS } for k = 1, . . ., K, where S is typically set to 5 or 10.
For the subject i ∈ D ks with the covariate X ki , let mk (X ki ) and êk (X ki ) denote the prediction made with {D k1 , . . ., D kS } − D ks as the training data.In this article, we adopt the logistic regression model for e k (•) and the sophisticated non-parametric approach, such as Xgboost [4], for m k (•).
We then estimate β via a plug-in version of ( 5) by considering the following objective function: and estimator β is given by Thus the resulting optimal treatment regimes are given by
In addition, parameters ρ 1 , ρ 2 in (8) and a for the MCP penalty are fixed at ρ 1 = ρ 2 = 1 and a = 3, and the optimal λ and γ are selected via minimizing the criterion BIC [14,34].Let q j be the number of distinct non-zero coefficients of the jth variable, BIC(λ, γ ) can be given by

The initial value β (0)
k can be obtained by using the ordinary penalized least square estimates for each unit separately, that is,

Theoretical results
In this section, we study the oracle property of the resulting estimator for the proposed penalized robust learning.We first introduce some additional notations and regularity conditions. Let Let φ j t be the tth subgroup-specific coefficient of the jth variable.Let vec(β) = (β 11 , . . . ,β K1 , . . . ,β 1,p , . . . ,β Kp ) T , for each β ∈ M S , it can be written as vec(β) = φ, where φ = (φ 1 , . . ., φ T ) T and φ t = φ j t .By dividing the non-zero and zero parts, φ can written as φ = (φ T K1 , φ T K2 ) T , where φ K1 ∈ R q and φ K2 ∈ R S−q correspond to the nonzero and zero components of φ, respectively.Thus the true coefficients are 2 min 1≤j≤q |φ 0K1j | be the half of the minimum signal.Let b n = min j∈{1,...,p},T j >1 min t =t |φ j 0t − φ j 0t | be the minimum difference of the coefficients between two different integrative groups of any variable with heterogeneity.Define T and m i is ith element in m.Let e = (e 1 (X 11 ), . . ., e K (X 1n 1 ), . . ., e K (X K1 ), . . . ,e K (X Kn K )) T and e i is ith element in e.Let X = T vec(X) = (X 1  1 , . . ., X 1 T 1 , . . ., X p 1 , . . ., X p T p ) T , and if k ∈ S j t , X j t equals to X kj , otherwise the elements are 0's.Let X = ((A 1 − e 1 )X 1 , . . . , (A n − e n )X n ) T , where X i is the value of X for the ith subject.Let n = {n Let C and C j 's denote some positive constants.
To show the theoretical properties of the resulting estimators of the penalized robust learning, we define an integration-oracle estimator with the underlying integrative parameter, that is where the grouping structures of the heterogeneous covariates are given in advance, i.e. the matrix is known.In addition, the true m i and e i are given.The oracle estimator φ * = ( φ * T K1 , φ * T K2 ) T is the least square estimator under given and The oracle property is established under the following assumptions.

Assumption 4.3: The noise vector has sub-Gaussian tails such that Pr(|α
, where X k1 is the submatrix of X k corresponding to the nonzero elements β k1 of β.

Assumption 4.7:
The support of X k and the conditional treatment effect C k (X k ) are uniformly bounded.
Assumptions 4.1 and 4.3 are commonly used in high-dimensional regression.Assumption 4.2 imposes reasonable conditions on the estimators of the nuisance parameters, which are satisfied for many flexible methods.Assumptions 4.4 and 4.5 are about the true structure, for more remarks please refer to [8,34].Assumption 4.6 provides the condition of λ to establish the oracle property of the proposed estimator.
We first show in the following lemma that, with high probability, the objective function ( 6) is approximate to L(β), where Lemma 4.1: Under Assumptions 4.2 and 4.7, there is a neighborhood of true coefficients we have Next to show that β * is a local minimizer of ( 6), we have the following theorem: This theorem demonstrates that under some mild conditions, i.e. b n 1/ √ n min , and the estimators mk and êk satisfy Assumption 4.2, our proposed method can recover the true grouping structure of heterogeneous covariates and the proposed estimator enjoys the oracle property.The proof of the theorem is provided in the Supplemental Material.

Simulation
In this section, we conduct extensive simulations to evaluate the performance of the proposed penalized robust learning (PRL), where m k (•) are respectively estimated using penalized least square regression (PRL-reg) and Xgboost (PRL-Xg) based on fivefold crossfitting.And the propensity scores e k (•) of our PRL-reg and PRL-Xg are estimated through logistic regression models in the following simulations.We also compare the proposed PRL-reg and PRL-Xg with the other six methods: the Q-learning with penalized least squares ( 1 -PLS) [21], the unit-specific Q-learning with penalized least squares (US-QL) introduced in Section 2.2, the unit-specific D-learning (US-DL), the unit-specific IPWE (US-IPWE) as well as its doubly robust version (US-AIPWE), where US-DL, US-IPWE and US-AIPWE estimate the optimal treatment regimes for each unit separately using D-learning [19], IPW estimator and AIPW estimator [26,35] respectively.
We first generate clinical covariates X k from the multivariate normal distribution with mean 0 and cov(X where ε k ∼ N(0, 0.5 2 ).We consider both linear and nonlinear cases with different choices of h(X k ): (1)  k + 0.5) 3 + 1}/3 + I(X (3)  k ≥ −0.5)I(X (2) T .Thus, for the kth unit, the optimal treatment regime is given by  5) and ( 6) are provided in the Supplemental Material.In addition, US-IPWE and US-AIPWE are only used in cases ( 1) and ( 2), since they are inapplicable to the high-dimensional case.All results are based on 100 replications in the simulation studies.
We compare the performance of the above six methods based on three aspects: whether the estimated optimal treatment regimes are correct, whether the variables are well chosen and whether the estimated grouping structure is correct.We further define some notations used in the simulations.Let β(m) be the estimates of β = (β 1 , . . ., β K ) T in the mth replication.For the kth unit, the index set of the selected variables are V(m) k = {j : β(m) kj = 0}, the true index set of the relevant variables are V k0 = {j : β kj = 0}, and v k0 = |V k0 | is the corresponding cardinality.To evaluate the estimation of the optimal treatment regimes, we calculate the value function under the estimated optimal ITR (Value), denoted by n −1 K k=1 n k i=1 {h(X ki ) + dk (X ki )C k (X ki )}, and the percentage of making correct decisions (PCD), defined as 1 To compare the performance of variable selection, we calculate the Coverage Probability (CP), given by 1 To compare the performance of identifying the subgroups of each method, we use the Rand Index (RI, [22]) to measure the similarity between the estimated clustering result and the true structure for each covariate separately.The values of PCD, CP, PCZ, PICZ and RI are all between 0 and 1, where higher values for PCD, CP, PCZ, and RI indicate better performance, while a lower value for PICZ indicates better performance.The simulation results of p = 15 and n k = 100 are presented in Table 1.As we expected, 1 -PLS method performs particularly poorly for all three scenarios since it assumes that the coefficients of covariates are the same for all the units by ignoring the heterogeneity.For Scenario 1, the performance of PRL-reg is similar to that of PRL-Xg due to the linear assumption to the main effect model.In contrast, the performance of US-QL is slightly worse, for example, it has smaller CP, PCZ, and larger PICZ, which implies that the relevant variables are not well chosen.One of the reasons may be that this approach ignores the relationship information among the datasets of different units.US-IPWE performs slightly worse than US-DL, probably because US-IPWE could not select the important variables.US-AIPWE has a better performance because of the double robustness property.In Scenarios 2 and 3, the true main effect models are nonlinear and the main effect models for PRL-reg and US-QL are misspecified, hence they do not perform very well in these two scenarios.The other methods do not require specifying the main effect models and are therefore relatively less affected.In particular, PRL-Xg estimates m k (•) using nonparametric estimators and performs very well in terms of achieving higher values and smaller variances for the measures of PCD and Value as examples, which shows that PRL-Xg can estimate the optimal treatment regimes more properly.Moreover, our proposed methods (PRL-reg and PRL-Xg) can accurately recover the complex structure of the coefficient matrix in the estimated optimal treatment regimes since the estimated RI values of both the heterogeneous variables (1st RI) and the homogeneous variables (2nd RI) are close to one for most of cases.We also evaluate the performance of all methods for the case of large sample (p = 15, n k = 200), and the results are presented in Table 3.We can draw very similar conclusions as obtained above from Table 1, except that the effect of the model misspecification to PRL-reg and US-QL is even more visible but PRL-Xg is also robust to model misspecification for different scenarios.The excellent performance of PRL-Xg confirms its power in finding the optimal treatment regimes when the sample size is large.The simulation results of higher dimensional case (p = 100) are summarized in Table 2 (n k = 100) and Table 4 (n k = 200).Although the performance of all methods get a little bit worse for high dimensional case, PRL-Xg is still significantly better than other methods when the main effect model is nonlinear.One possible source of the observed heterogeneous treatment effects is the model misspecification.Therefore, in the following simulation, we further demonstrate the performance of our proposed method with respect to the heterogeneity arising from model misspecification.We also generate R k from the outcome model given as (2)  k , k = 5, 6.
In this case, the interaction effects C k (X k ) allow different models for different units to produce the hypothetical heterogeneity caused by model misspecification when we try to fit a linear model on all three of them.Figure 1 displays the plots of the three models of C k (•) in  order from left to right, and there is not a dramatic difference between the left and center plots, but the right plot is significantly different from them.The clinical covariates X k , the treatments A k , the main effect model h(•), and the error term ε k are simulated in the same way as the above simulations.The simulation results of n k = 200 are presented in Figure 2. Since the coefficients are set to be the same in the models of C k (•) among all units, and if the estimates of them are different, it implies that the model misspecification could cause observed heterogeneous treatment effects, and our method can identify such heterogeneity to some extent.From Figure 2, we can observe that the first coefficient of the interaction effects C k (•) (first column) is significantly different in units k = 5, 6 from those in the other units, thus our method is able to identify the hypothetical heterogeneity caused by the model misspecification (k = 5, 6).We also observe that the coefficients are indistinguishable in units k = 1, 2, 3, 4, since the second model is very close to the first model, although it is nonlinear, as shown in Figure 1.The similar results can be found in the case of n k = 100 from Figure S1 in the Supplemental Material.In summary, our PR-Learning is still worthwhile even the model is misspecified.

Applications to breast cancer patient data
In this section, we apply the proposed method to analyze breast cancer patient data from the cohort GSE6532 [13].The dataset consists of 414 subjects, where 277 subjects who received Tamoxifen treatment ( A = 1) and 137 subjects who received no systemic adjuvant treatment ( A = 0 ).Here, we classify subjects into four units according to grade of morphological features in breast cancer, which is histologic grade category (4 categories) based on the Elston-Ellis grading system [6].For this dataset, we select time of relapse free survival as the outcome variable R.After excluding observations with missing outcomes, the final dataset includes 393 observations, with a total of 44,928 gene expression measurements available.We further compute the marginal variance for each gene expression and select 200 genes that have the largest activity levels.In addition, we include the following two clinical variables as predictors: age (age in years) and size (tumor size in cm).In this analysis, the treatment was not randomly assigned, and the propensity score is unknown.Hence, the treatment propensities are estimated using logistic regression models.
We estimate the coefficient matrix of optimal treatment regimes via PRL-reg, and the results are presented in Figure 3.Our proposed method identified three important prescriptive gene expressions of IFIT1 (probe 203153_at), PVALB (probe 205336_at) and MSMB (probe 207430_s_at).We can observe that for the low PVALB expression and low MSMB expression, treatment is better than the control in the first unit.Moreover, patients from unit 2 should be assigned to treatment if gene expressions of IFIT1 and MSMB are low.For units 3 and 4, the treatment regimes are similar to each other, that is, patients with the low MSMB expression should be assigned to treatment.Our conclusions are similar to those in the medical literature.Specifically, we found similar findings in G2SBC (Genes-to-Systems Breast Cancer Database) [15] that PVALB is relevant to breast cancer.Moreover,  the current understanding of the IFIT proteins and their role in cellular signaling mechanisms is limited to the antiviral immune response and innate immunity, and research suggests that IFIT proteins could be novel therapeutic targets for cancer therapy [18].We also compare the four methods, PRL-reg, PRL-Xg, 1 -PLS, and US-QL in the following analysis.For PRL-reg and PRL-Xg, we also use fivefold cross-fitting on êk and mk .To have a better comparison, we randomly divide the dataset into half training data and half testing data for 500 times, and compute the estimated value function Vn ( dopt ) for the estimated optimal treatment regime in K units, that is, .
The estimated value functions are given in Table 6.We could see that the estimated value functions of PRL-reg and PRL-Xg are obviously larger than those of 1 -PLS and US-QL, which implies that our PR-Learning methods are useful for estimating optimal ITR when the data consists of multiple units with heterogeneity and homogeneity.

Discussion
In this article, we first defined unit-specific centralized outcome, which can better reflect the heterogeneity in response to treatment.The unit-specific main effect model is replaced with the model containing propensity score and mean outcome given covariates, and the propensity score and mean outcome can be estimated using suitable consistent estimators through a wide choice of methods.This avoids non-ignorable residual confounding from misspecification of the unit-specific main effect models.Then, we estimated regression coefficients of interaction effects by constructing the double penalized procedure, which can identify heterogeneous and homogeneous coefficients, and recover the sparsity structure of the coefficient matrix of the covariates simultaneously.Our proposed method can be used to estimate optimal treatment regimes for high-dimensional data from different units with heterogeneity and homogeneity, and the oracle property of the resulting estimator was also studied.Finally, the proposed method has better numerical performances compared to other methods, as shown in our simulation studies and real data analyses.
In practice, one possible source of the observed heterogeneous treatment effects comes from the model misspecification.How to identify the heterogeneous treatment effects caused by the model misspecification is a very interesting and challenging problem.As a preliminary attempt, in this paper, we carried out simulation studies to validate the performance of our proposed method in this situation, and the results showed that our method can identify the heterogeneity generated by different models to some extent.However, this problem is worth further study.
For our proposed model, m k and e k could be estimated by flexible methods, but the accuracy of the estimators mk and êk is required to guarantee the oracle property.Moreover, double robustness is a desirable property, especially in estimating the optimal treatment regimes.It is very necessary to construct doubly robust estimators for high-dimensional data from different units with heterogeneity and homogeneity.
Several other possible improvements and extensions can be explored for future studies.In this article, we only considered optimal individualized treatment regimes for single stage, but our PR learning can be generalized to multi-stage setting involving multiplestage Q-functions and to estimate the optimal dynamic treatment regimes by applying the backward learning methods.

⊃
V k0 ), to measure how well the relevant variables are chosen, Percentage of Correct Zeros (PCZ), defined by kj = 0) × I(β kj = 0))/(p − v k0 ), to characterize the capability of the method to produce sparse solutions, and Percentage of Incorrect Zeros (PICZ),

Figure 2 .
Figure 2. The level plots of coefficient matrix estimation with Scenario 1 (left), Scenario 2 (center), and Scenario 3 (right) for n k = 200.In order from top to bottom, the first two rows are PRL-reg with p = 15 and 100, and the last two rows are PRL-Xg with p = 15 and 100.

Figure 3 .
Figure 3.The level plots of coefficient matrix estimation with four units.

Table 1 .
Simulation results for p = 15 and n k = 100.

Table 2 .
Simulation results for p = 100 and n k = 100.

Table 3 .
Simulation results for p = 15 and n k = 200.

Table 4 .
Simulation results for p = 100 and n k = 200.

Table 5 .
Simulation results of the Rand Index for PRL-reg and PRL-Xg.

Table 6 .
Results of empirical value functions on test data for the Tamoxifen breast cancer data.