From Logic-Respecting Efficacy Estimands to Logic-Ensuring Analysis Principle for Time-to-Event Endpoint in Randomized Clinical Trials with Subgroups

Abstract An important goal of precision medicine is to identify biomarkers that are predictive, and tailor the treatment according to the biomarker levels of individual patients. Differentiating prognostic versus predictive biomarkers impacts important decision makings for patients and treating physicians. Using Hazard Ratio (HR) can mistake a purely prognostic biomarker for a predictive one leading to a disheartening possibility of depriving patients of beneficial treatment as demonstrated in the OAK trial. This stems from the illogical issue of HR at population level where marginal HR in the overall population can be larger than those in both subgroups. Instead of trying to circumvent this issue by discouraging comparisons between marginal and conditional HRs, we propose to directly fix it by using alternative logic-respecting efficacy estimands such as ratio of medians, ratio and difference of restricted mean survival times and milestone probabilities. These measures are straightforward, easy to interpret and clinically meaningful. More importantly, they will guarantee agreement between marginal and conditional efficacy and provide cohesive message around efficacy profile of the drug in the presence of subgroups. A step further is the application of Subgroup Mixable Estimation (SME) principle to ensure logical estimates when analyzing real clinical trial data. Detailed guidance is provided for the aforementioned logic-respecting estimands using either parametric, semiparametric or nonparametric approaches. Simultaneous inference can be provided with proper multiplicity adjustment to facilitate joint decision making with user-friendly apps.


Introduction
Many therapeutic products developed nowadays have its natural target of patient subgroups based on mechanism of action. For example, PD-1 inhibitors such as nivolumab and pembrolizumab are known to work better for patients with higher PD-L1 expression levels in many cancer indications. Gandara et al. (2018) assessed a biomarker called blood-based tumor mutational burden (bTMB) for selecting patients with higher treatment benefit when receiving atezolizumab in nonsmall cell lung cancer (NSCLC). The pivotal trial that reputedly confirmed the predictiveness of the biomarker is the OAK study where the authors showed a monotonic decreasing trend of overall survival (OS) hazard ratio (HR) from 0.70 to 0.50 for patients with bTMB value greater than a cutoff (marker positive patients) as it increases from 4 to 26. However, based on the reanalyses by Liu et al. (2020) such as examining HR for marker negative patients, and modeling OS data with continuous bTMB values, treatment indicator and interaction term, the biomarker turns out to be essentially purely prognostic for OS.
As pointed out in Liu et al. (2020), the issue lies with HR as an efficacy estimand at the population level. When there is prognostic or predictive effect in the subgroups, HR for the overall population is not well defined because it is a function CONTACT Yi Liu yiliu4321@gmail.com Nektar Therapeutics, San Francisco, CA. * Additional affiliation: Seagen Inc, Bothell, WA.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/SBR. of time. Nevertheless, a common marginal HR estimator can be obtained through the Cox model with treatment indicator as the only covariate. It has been shown in Xu and O'Quigley (2000) that this marginal estimator is approximately estimating the average HR over time integrated over the marginal survival functions. Liu et al. (2020) showed that this marginal estimand gets closer to one as the prognostic effect increases for a purely prognostic biomarker. This explained the puzzling behavior of OS HR in the OAK study. When more subgroups of patients with stronger prognostic effect are mixed, this marginal HR estimator moves closer to 1. That is why the lower the cutoff value, the higher the HR creating the illusion that bTMB is predictive while in fact it is purely prognostic.
One additional issue surfaced was that marginal HR for the overall population can be higher than the subgroup HRs as illustrated using a purely prognostic biomarker in Figure 5 of Liu et al. (2020). More generally, when there is prognostic and/or predictive effect in a biomarker, the marginal HR estimate for the overall population can be outside the range of the subgroup HRs. For example, in the Phase 3 KeyNote-426 trial where pembrolizumab plus axitinib is compared to sunitinib for Advanced Renal-Cell Carcinoma, the HR for white and nonwhite race patients are 0.7 and 0.63 while the HR for the overall population is 0.71 (Powles et al. 2020). As another example, Paz-Ares et al. (2021) showed the efficacy results of nivolumab plus ipilimumab with chemotherapy compared to chemotherapy in first line NSCLC where HR for PD-L1< 1% and ≥ 1% are 0.62 and 0.64, both of which are smaller than the overall marginal HR of 0.67. Ding, Lin, and Hsu (2016) first termed this phenomenon as illogical behavior of HR and defined an efficacy estimand as logic-respecting if the marginal estimand is always in between the subgroup estimands. It should be stressed that this definition is at the population level and irrespective of the data collection process and analysis methodologies. It naturally makes sense to only use logic-respecting efficacy estimands to report clinical trial results when there is a subgroup effect. The main goal of this article is to provide a list of logic-respecting efficacy estimands and detailed guidance on how to ensure the same logic holds in the efficacy estimates when analyzing clinical trial data in practice.
The article is organized as follows. We will first discuss a few common misconceptions and provide some clarifications in Section 2. Then we will provide logic respecting efficacy estimands and point out the incorrect analysis methods in current computer software, followed by the introduction and detailed steps in implementing Subgroup Mixable Estimation (SME) principle (Ding, Lin, and Hsu 2016) that will guarantee logical estimates in Section 3. We perform simulations and evaluate the performance of our proposed simultaneous CI using SME under either parametric Weibull model or nonparametric Kaplan Meier (KM) methods in Section 4. Section 5 contains real clinical trial examples where the data are reanalyzed to report logic respecting efficacy estimates following SME. We conclude the article with some discussions in Section 6.

Common Misconceptions
In this section, we would like to clarify a few common misconceptions around the topic of efficacy estimands in randomized clinical trials. Here we use the term "estimand" to represent the "population-level summary" attribute of the broad term "estimand" described in ICH E9 (R1) (European Medicines Agency 2020). We first point out that a specific efficacy estimand should not be linked with a specific model as many different estimands can be estimated from the same model. We then clarify the subtle difference in the definition of logic respecting estimand versus collapsible estimand, both of which are properties of an estimand, not a model. Finally, the distinction of conditional/marginal estimand versus conditional/marginal model is presented.
We would like to detach the population level summary measure, that is, "estimand" from modeling technologies and associated terminologies that is used for devising estimators. Estimators for different estimands can come from the same model. On the other hand, estimators from different models can be used to estimate the same estimand.

Detach Efficacy Estimand from Specific Model
When we model efficacy data with certain models, it is a common practice to define the estimand as the natural parameter of the model. For example, odds ratio (OR) is typically the estimand using a logistic model, while relative risk (RR) is the typical estimand for a log-linear model. This is mostly done out of computational convenience. There is no statistical rationale to link the two concepts. The choice of estimand should depend on how the results can be meaningfully interpreted from a medical perspective, while the choice of model should depend on the assumption of the data structure. One can model binary data using a logistic model and still assess efficacy by the logicrespecting measure RR. Likewise, one can model time-to-event data using a Weibull accelerated failure time model (instead of a Cox model) and choose to assess efficacy by hazard ratio (instead of a ratio of time).

Logic Respecting versus Collapsible Estimands
Let us denote T ∈ {Rx, C} the treatment indicator, Y the outcome variable, G ∈ g + , g − the subgroup indicator, θ g + and θ g − the treatment effect for the subgroups, and θ the marginal treatment effect for the overall population. Then assuming θ g − < θ g + , the efficacy estimand θ is said to be logic respecting if θ ∈ [θ g − , θ g + ] (Ding, Lin, and Hsu 2016). On the other hand, Huitfeldt, Stensrud, and Suzuki (2019) considers a marginal estimand to be collapsible if it can be written as a linear combination of the conditional estimands with weights w g + , w g − ≥ 0, that is, Although introduced in a general setting, the two concepts do appear similar at first glance (Xi and Bretz 2022) when one applies the collapsibility concept to subgroup problems. However, there is a subtle but yet important distinction: while collapsible estimands require pre-specification of the weights through which the conditional estimands are combined, logic respecting estimands do not have that requirement. So long as the marginal estimand is in between the subgroup estimands, it is considered logic respecting. The weights are not important and do not need to be known even during estimation process as will be shown in Section 3.3.
It is important to emphasize that the noncollapsibility and nonlogic-respecting behavior of estimands are different from confounding (Daniel, Zhang, and Farewell 2021) and can occur despite randomization and large sample size as this is an issue with the estimand such as OR and HR at the population level.

Detach Collapsibility-An Estimand Property from Model Property
Collapsibility of estimands have drawn quite a lot of attention, with the latest being discussed in the recent FDA guidance on covariate adjustment (Food and Drug Administration 2021).
The guidance specifically called out the noncollapsibility of OR with binary endpoint and HR with time-to-event endpoint where noncollapsibility is defined as "when all subgroup treatment effects are identical, this subgroup-specific conditional treatment effect can differ from the unconditional treatment effect. " This definition was actually referred to as strict noncollapsibility in Greenland, Pearl, and Robins (1999). More generally, "a measure is not constant across the strata, but that a particular summary of the conditional measures does equal the marginal measure. This summary is then said to be collapsible" according to Greenland, Pearl, and Robins (1999) which is consistent with the definition in Huitfeldt et al. (2019). However, in the same paper, a separate definition of collapsibility was introduced in the regression formulation: "The regression is said to be collapsible for …. " Perhaps this has caused the misconception that collapsibility is a model property and thus has lead to claims such as "Cox models and logistic models are noncollapsible. " In fact, collapsibility is a property of the efficacy estimand, not the model. As mentioned in the previous section, there is no one-toone correspondence between efficacy estimand and the model. We can say HR is not logic respecting or noncollapsible, but not Cox model because many logic-respecting estimands can still be estimated from the Cox model including difference of Restricted Mean Survival Time (RMST). As will be shown in Section 3.1.3, since difference of RMST has the form of difference of expectations, it is logic respecting at the population level. Estimated survival functions from the Cox model can be used to construct an estimator for the difference of RMST which is logic respecting.

Logic Respecting is an Estimand Property
Like collapsibility, logic respecting is also an efficacy estimand property and not a model property. This is not to be confused with the fact that some estimands will only respect logic under certain conditions expressed through modeling assumptions. For example, Ratio of Median (RoM) is proven to be logic respecting if survival data follows the Weibull model (Ding, Lin, and Hsu 2016). This does not mean Weibull model itself is logic respecting, as many other estimands derived from Weibull model such as HR can still be nonlogic-respecting. The Weibull model represents a specific data structure and can be viewed as a condition under which the estimand RoM is logic respecting.

Detach Conditional Estimand from Conditional Model and Marginal Estimand from Marginal Model
It is customary to estimate conditional estimand from conditional models (i.e., adjusted analyses where subgroup is included in the model in addition to treatment indicator) and marginal estimand from marginal models (i.e., unadjusted analysis where only treatment indicator is included in the model). Therefore, sometimes conditional/marginal estimands are automatically linked with conditional/marginal models. However, to estimate the marginal estimand, a more efficient estimator can be derived from the conditional model (Daniel, Zhang, and Farewell 2021).
There is no need for such linkage between estimands and models as there are multiple ways to estimate the same estimand. For inference on the marginal estimand, we advocate the use of conditional model over marginal model not only due to the asymptotic efficiency gained, but also the insurance of logical behavior between the marginal and conditional estimators. Table 1 lists the commonly used efficacy estimands for all endpoint types and shows whether each estimand is logic respecting or not. Detailed proofs are provided in Section 3.1. Then we point out issues with the current computer software in producing marginal HRs and commonly done but incorrect analysis methods that can lead to illogical results even for difference of means or difference of RMSTs. At the end of this section, we describe the Subgroup Mixable Estimation (SME) principle in detailed steps and generalize the original proposal from Ding, Lin, and Hsu (2016) using weibull model to incorporate additional parametric, semiparametric or nonparametric estimators.

Logic Respecting Efficacy Estimands
In this section, we present what are logic-respecting efficacy estimands for continuous, binary, and time-to-event outcomes. We note that results for continuous and binary outcomes were previously derived by Ding, Lin, and Hsu (2016) and so will only be discussed briefly. Also, some notations will be consistently used throughout for simplicity as well as to draw connections.

Difference of Means
For a continuous outcome Y, let μ Rx = E (Y |Rx ) , μ Rx g + = E Y Rx, g + , and μ Rx g − = E Y Rx, g − denote the mean or expected value under the treatment arm in the overall population, g + subgroup, and gsubgroup, respectively. Likewise, let μ C , μ C g + , and μ C g − denote the mean for the control arms in their respective populations. Now suppose efficacy is measured by the difference in mean response between treatment and control, that is, θ := μ Rx −μ C , θ g + := μ Rx g + −μ C g + , and θ g − := μ Rx g − −μ C g − , and let γ + and γ − be the population prevalences of the subgroups with γ + + γ − = 1. Then by the law of iterated expectations, Thus, efficacy in the overall population is a mixture of efficacy in the subgroup populations weighted by their prevalences. We conclude that difference of means is logic-respecting. However, this nice relationship does not usually hold for estimands that are not based on expectations or differences.

Difference of Proportions and Relative Risk
For a binary outcome Y such as a positive response (Y=1) or not (Y=0), let μ Rx denote the probability of response for the treatment arm in the overall population, and likewise for the control arm and/or subgroups. Then by the law of total probability (or iteractive expectations if one views probability of response as the expected value of an indicator function), one can establish that difference of proportions θ , much like difference of means, is logic-respecting with mixture weights being the subgroup prevalences γ + and γ − as in (1). With respect to relative risk (RR), Ding, Lin, and Hsu (2016) showed that RR = Pr g + |C, Y = 1 × RR g + + Pr g − |C, Y = 1 × RR g − . Note that the mixture weights for relative risk in the overall population are different from those for difference of proportions. Whereas the weights were previously population prevalences, the weights are now the population proportion of control responders who are g + or g -.
Odds ratios, despite being immediately obtainable from logistic regression models, are not logic-respecting as illustrated by the example in the FDA covariate adjustment draft guidance (Food and Drug Administration 2021).

Difference in (or Ratio of) Milestone Probabilities or RMST
denote the survival probability at time t (i.e., the milestone) for the treatment arm in the overall population, and likewise for the control arm and/or subgroups. Then again, by the law of total probability, difference in survival θ is logicrespecting with mixture weights being the subgroup prevalences as in (1).
The restricted mean survival time (RMST) (Royston and Parmar 2013) is the area under a survival curve up to time τ . As its name suggests, the RMST is also the mean or expected survival time in patients followed up to time τ , that is, RMST Rx The RMST has gained popularity in recent years since it can lead to efficacy measures that are arguably more interpretable and that require less assumptions compared to the hazard ratio. Considering the difference in RMST, the law of iterative expectations can be applied to establish that it is a logic-respecting estimand with mixture weights being the subgroup prevalences as in (1).
When efficacy is measured by the ratio of milestone probabilities or RMST, that is, can also establish that the estimand is logic-respecting by simple algebraic manipulation. Indeed, so that θ ∈ min θ g + , θ g − , max θ g + , θ g − . This derivation is very similar to the one for RR as both are ratio of expectations. Even though these type of estimands are logic respecting, the mixing coefficient is no longer the population prevalence as shown in (1).

Ratio of Median (RoM) Survival Time
Besides the hazard ratio, it is common practice in time-toevent trials to report the individual median survivals, defined as the time at which 50% of patients have experienced (or not experienced) an event for each treatment arm Rx and C. Median survivals can represent important shifts in survival curves, while being easier to understand and interpret (Chen and Zhang 2016). In fact, the value framework that assesses the value of new cancer therapies by both American Society of Clinical Oncology (ASCO) and European Society for Medical Oncology (ESMO) uses percent improvement of median survival times in Rx comparing to C arm in clinical benefit evaluation (Schnipper et al. 2015(Schnipper et al. , 2016Cherny et al. 2015Cherny et al. , 2017. Median survival times are estimated either nonparametrically via the Kaplan-Meier curve or parametrically. Weibull model is a popular choice among parametric models for censored data as it is the only model with both proportional hazards and accelerated failure time interpretations. The Weibull model is commonly used in cost effectiveness analyses for example, for health technology assessment, because it requires only two parameters (shape and scale), yet provides enough flexibility to closely approximate many real survival curves in practice (Li and Zhang 2015). The fitted model can also be readily used for predictions.
The difference in median survivals is not logic-respecting. One can construct a counter-example in the simplest setting of exponential distributions: Suppose median survivals in the (Rx, g + ), (C, g + ), (Rx, g − ), and (C, g − ) subpopulations are 2, 1, 4, and 3 years, respectively. Then the difference in median survival is 1 year for both the g + and gsubgroups, but 1.13 years for the overall population (provided γ + = 0.5).
Although the difference in median survivals is not logicrespecting, Ding, Lin, and Hsu (2016) showed that the ratio of median (RoM) survivals is logic-respecting under the survival model with following hazard function: where h 0 (t) = κλ κ t κ−1 is a Weibull hazard function with κ and λ being the shape and rate parameters, and T and G are indicators for treatment and subgroup. Following a similar argument (see proof in Appendix A.1, supplementary materials), one can actually show that RoM is logic respecting with the less stringent assumption of separate Weibull models for each biomarker subgroup: This relaxed assumption allows completely different baseline hazard functions between two subgroups instead of the proportional hazard requirement for Model (2). On the other hand, when one extends the distributional assumption of h 0 (t) from Weibull in model (2) to a mixture Weibull model, then RoM may not always be logic respecting. Appendix A.2, supplementary materials shows the proof that when β 2 > 0 and β 2 + β 3 < 0, or β 2 < 0 and β 2 + β 3 > 0, RoM is logic respecting regardless of the shape of h 0 (t). Otherwise, a counter example is constructed under mixture Weibull model to show RoM in the overall population smaller than that of both subgroups.

Incorrect Analysis Methods in Current Computer Software
Logical relationships between the subgroup and overall treatment effect should hold not only at the population level, but also at the sample level. This means we should choose correct analysis methods or estimators that guarantee such logical behavior in analyzing clinical trial data.
In this section, we will first show that the marginal effect estimator for HR implemented in SAS does not correspond to the real marginal HR and gives people the illusion that HR is logical. This helps mask the illogical behavior of HR and adds to the list of issues with HR as an efficacy estimand especially in the presence of subgroup effect. We then show that even for logic respecting efficacy estimands such as difference of means for continuous outcome, marginal model based estimator of marginal effect for the overall population will lead to illogical behavior in randomized clinical trial dataset with moderate sample size. Similarly, for time to event endpoint, even though difference of RMST is logic respecting at the population level, we show that the common KM based RMST difference marginal effect estimator by pooling subgroups within Rx and C arms can lead to illogical results. The common mistake is to obtain marginal effect estimator ignoring subgroups.

Marginal HR Calculated by SAS LSMEANS that Masks
Illogical Behavior of HR As discussed in previous sections, HR is not a logic respecting estimand, which means even with infinite sample size, marginal HR (even if viewed as average HR over time) can be outside the range of the subgroup HRs. On the other hand for data analysis, current computer softwares such as SAS provides marginal HR estimate simply by linearly mixing the subgroup HRs in the log scale. The resulting HR is guaranteed to be in between the subgroup HRs giving people the illusion that HR is logic respecting, while in fact this HR does not correspond to the real marginal effect estimate. This is illustrated in Appendix A.3, supplementary materials using simulated dataset in Liu et al. (2020) with a purely prognostic biomarker.

Marginal Model Based Analysis Can Lead to Illogical Results Even for Difference of Means (DoM)
Estimands that can be written as difference of expectations are generally logic respecting regardless of the endpoint type including difference of means (DoM) for continuous endpoint, difference of proportions for binary endpoint, and difference of RMST or milestone for time to event endpoint. In Appendix A.4, supplementary materials we show that even for the estimand of DoM, illogical behavior of the estimates can still occur if incorrect analysis method (e.g., marginal model based analysis) is applied.

Marginal Analysis for Difference of RMST Can Lead to
Illogical Results Previous section showed that marginal model based analysis can lead to illogical results even for difference of means. Similarly, marginal type of nonparametric KM analysis ignoring subgroups for difference of RMST can lead to illogical behavior as well. This is not surprising as marginal analysis on KM estimator itself doesn't always follow logical relationship as demonstrated in Figure C.1 of Liu et al. (2020).
We simulate a simple dataset with sample size of 160 and two biomarker subgroups to show that when pooling patients from the subgroups together to estimate the RMST in the overall population, the resulting difference of RMST for the overall population is smaller than that of both subgroups. On the other hand, when SME is applied, the estimated overall difference is then between those of the subgroups. We refer readers to Appendix A.5, supplementary materials for more details.

Analysis Principle that Guarantees Logical Estimates
Realizing that some common analysis methods such as marginal model can lead to illogical relationships among the estimates for subgroups and overall population, Ding, Lin, and Hsu (2016) first proposed Subgroup Mixable Estimation (SME) principle because "Estimation should respect the logical relationships among the parameters that represent treatment efficacy in subgroups and their mixtures. " The original principle in Ding, Lin, and Hsu (2016) was described under the modeling framework. In this section, we generalize the SME principle to any estimation procedure including nonparametric approaches and then provide practical guidance on detailed steps to derive estimates for some commonly used estimands: Ratio of Medians (RoM), difference and ratio of Restricted Mean Survival Times (RMST) and milestone probabilities.

Subgroup Mixable Estimation (SME) Principle
The general principle of SME consists of the following three steps: 1. Obtain the following LSmeans adjusted cell means estimates: μ Rx g + ,μ Rx g − ,μ C g + ,μ C g − . These four estimates for each subgroup and treatment arm combination can be derived from the parameter estimates as a result of a modeling technique or after following some nonparametric method. 2. Mix within Rx or C arm on the probability scale using densities of Rx, g + and Rx, g − or C, g + and C, g − to derive the density (or distribution) function of Rx and C arm, respectively. Based on the law of probability, the mixing coefficient is population prevalence γ + and γ − = 1 − γ + . If γ + is unknown, it can be estimated by the pooled sample prevalence (across Rx and C arms) in the trial assuming patients are enrolled according to natural prevalence, that is, subgroups are not enriched. The following estimates for the overall population:μ Rx andμ C can then be estimated from the derived density for each arm. If we denote the vectorμ = μ Rx g + ,μ Rx g − ,μ C g + ,μ C g − ,μ Rx ,μ C , the associated covariance matrix for the vector can also be derived and estimated atμ, denoted asˆ μ . 3. Calculate estimates of relative efficacy (Rx vs. C) in g + and gand overall population as functions of the effect estimates in each arm and subgroup, that is, ,μ C and obtain associated simul-taneous confidence interval (CI) through the corresponding variance co-variance matrix estimates ofθ = θ g + ,θ g − ,θ as a function ofˆ μ .
The key is to estimate Rx or C arm density function by mixing subgroup densities using population prevalence as shown in step 2 rather than pooling g + and gpatients within each arm as commonly done in practice. After obtaining effect estimate based on the density function of each arm, calculating relative efficacy estimates (Rx vs. C) for the subgroups and overall population as a function of the corresponding effect estimates is the last step. It turns out for DoMs, one can skip these steps and directly calculate overall efficacy using subgroup efficacies that is, jumping directly to last step in (1). However, this relationship does not hold for many other estimands such as ratio of medians or RMSTs and only following SME can lead to the correct marginal estimate that holds the logical relationship.

Logical Estimation of Ratio of Median (RoM), Ratio and Difference of Milestone Probabilities and RMSTs
In this section, we show detailed steps of the estimation process for RoM under a Weibull model. Similar steps can be applied to the ratio and difference of milestone probabilities and RMST which are provided in the Appendices A.6 and A.7, supplementary materials. Since SME principle applies to both parametric and nonparametric methods, we then show the detailed steps on how SME can be followed to produce simultaneous CI for difference of RMST based on KM estimators of RMST from the subgroups.
Following the first step of SME principle, we need to compute cell means estimates which are the median estimates for the subgroups. They are readily derived as functions of parameter estimates from the Weibull distribution as follows: The second step following SME is to derive the median estimates for Rx and C arm by mixing the density functions.
Since median is the solution to the equation of survival function at 0.5, we mix the survival functions within each arm and get the median estimates as solution to the following two equations: corresponding estimates asv, then we have g(φ,v) = 0 for i = 1, . . . 6. Becauseφ ∼ N(φ, ) following multivariate delta method for implicitly defined random variables (Benichou and Gail 1989) j=1,...,6 and H = ...,6;j=1,...,5 . The estimatê v of v is derived atφ,v.
In the last step of SME, because the estimand is ratio of two medians, we first obtain the point estimate and associated variance covariance matrix for difference of log transformed medians, and subsequently the corresponding simultaneous CI. We then transform back to the original scale by taking exponential the CI in the log scale. In this way, the CIs are guaranteed to be positive. Denoteη = log(v) and the corresponding covariance matrix by η . Using delta method, we haveη . . , 6. We then take difference in the log transformed medians where we let Forming the vector u = (u 1 , u 2 , u 3 ), we then estimateû ∼ N(u, u = M η M T ) where M = ∂u i ∂η j , i = 1, 2, 3; j = 1, 2, . . . , 6. The covariance matrix is estimated atû, denoted asˆ û where the diagonal elements are se û i 2 . We further denote the corresponding standardized correlation matrix asĈû. Simultaneous confidence interval for u is then derived as I u = I u 1 × I u 2 × I u 3 , where I u i =û i ± q × se û i where the critical value q can be calculated using the following multivariate normal distribution N 0,Ĉû through Numerically, the q value can be obtained using qmvnorm function in mvtnorm package in R. Lastly, the point estimate and is then exp(û) and exp(I u ).

Following SME to produce simultaneous CI for RMST difference using nonparametric KM estimates
Let us denote the RMSTs as We first calculate the RMST by fitting Kaplan-Meier curves to each subgroup for Rx and C separately and then summarize the area under these curves.
According to Zhao et al. (2016), each of the above estimator follows an asymptotic normal distribution. That is,v i ∼ N v i , σ 2 i , i = 1, 2, 3, 4 where σ i can be estimated through perturbation or some closed-form approximation, denoted aŝ σ i . Since they are independent from each other, their variancecovariance matrix is a diagonal matrix.
The second step following SME is to derive the RMST for Rx and C by mixing the density functions. As is shown in Section 3.1.3, we havê Thanks to the simple linear form, we can conclude that v ∼ N (v, v ). As the last step, the RMST difference can be written as linear transformation of the vector v, denoted as Hv. Therefore,û =Ĥv ∼ N Hv, u = H v H . With covariance matrix estimated atû, the correlation can be derived and the critical value q used for the simultaneous CI derivation is calculated using (4). More details can be found in Appendix A.8, supplementary materials.
Similarly, simultaneous CI for RMST ratio, milestone probability difference and ratio can all be obtained following same steps of the SME principle (see details in Appendices A.9 and A.10, supplementary materials). The key step is deriving covariance matrix for the six dimensional effect estimatorv. This can be easily done for RMST and milestone probability because the effect estimator in Rx or C arm can be written as linear combination of subgroup estimators within each arm as shown in (5). It is not the case for median survival time unfortunately.

Simulation
In this section, we evaluate the performance of our proposed simultaneous CI using SME for both parametric weibull model and nonparametric Kaplan Meier (KM) methods through simulations. We assume there are two treatment arms {Rx,C} and two subgroups {g − , g + } with 150 patients in each treatment/subgroup combination and a total sample size of 600. We evaluate the following two scenarios (survival curves in Figure 1): • Scenario 1 (Exponential distribution): We assume survival time follows Exponential distributions with median survival time as 6,8,10,12 months for g − , C , g + , C , g − , Rx and g + , Rx , respectively. The censoring distribution is exponential with rate = 0.03933.
The censoring distributions for both scenarios are chosen so that there are around 400 total events expected at time of analysis for both scenarios.
We run 1000 simulations and calculate the ratio of medians, ratio and difference of RMSTs at 20 months, ratio and difference of 1-year survival rates using Weibull model, separate Weibull model and nonparametric KM method. Within each simulation for each method, we further derive the 95% simultaneous CIs for g − , g + and overall as well as the individual 98.3% CIs. Of note is that the individual 98.3% CI for the overall population is still derived following SME to ensure logical relationships, not based on the marginal analysis. For scenario 1, both Weibull and separate Weibull models are correct, while neither are true for scenario 2. Tables 2 and 3 summarize the simulation outputs using Weibull model and nonparametric KM method. Separate Weibull model gives very similar results as Weibull model and therefore is not presented here. Since estimates of ratio of medians may not respect logic using KM method, we only report the results using Weibull model. Table 2 shows that for Scenario 1, the correct Weibull model produces quite accurate estimators for all efficacy measures. Compared to the nonparametric KM method, Weibull model yields smaller bias and standard error (SE) with much shorter CIs. The simultaneous coverage probability is around 95% for the simultaneous CIs using either Weibull model or KM method with slightly higher coverage of the latter. Without incorporating joint distributions among {g -, g + , Overall}, Bonferroni based 98.3% individual CIs can also be calculated. As expected, the length of the CIs are slightly longer and the coverage probabilities are also higher than the simultaneous CI using joint distributions.
In terms of Scenario 2, Table 3 shows that nonparametric KM based estimators lead to less bias as Weibull model is not the correct model. However, due to the parametric modeling, Weibull model results in smaller SE and hence shorter CIs.
But in terms of coverage probability, Weibull model is slightly lower than 95% except for RoM. KM based simultaneous CI has coverage slightly higher than 95%. Comparing simultaneous CI to the 98.3% individual CIs, similar pattern in Scenario 1 is observed with slightly lower coverage probabilities and shorter CIs for all efficacy estimands and analysis methods.

Case Studies
To illustrate how to obtain logical inference for time to event endpoint following Subgroup Mixable Principle (SME), two Phase III confirmatory studies involving anti-PD1 checkpoint inhibitors in front line nonsmall cell lung cancer (NSCLC) were selected as real data examples. Due to the mechanism of action of anti-PD1 antibodies, it is of clinical interest to investigate if PD-L1 expression level can be correlated with treatment outcome. Therefore, we chose to use the PD-L1 expression defined subgroups, that is, PD-L1+ (≥ 1%) and PD-L1− (< 1%) subgroups to build our examples. The individual patient level time event data was reconstructed from the published Kaplan-Meier curves (Guyot et al. 2012) for the overall population and PD-L1+ and PD-L1− subgroups.
KeyNote-189 is a double-blind, randomized phase 3 trial of pembrolizumab combined with platinum-pemetrexed chemotherapy versus saline placebo combined with platinumpemetrexed chemotherapy in subjects with advanced or metastatic nonsquamous nonsmall cell lung cancer (NSCLC). The primary end points were overall survival (OS) and progression-free survival (PFS), as assessed by blinded,  1.123, 1.118, 1.120), 1-year survival rate difference = (0.1, 0.1, 0.1), 1-year survival rate ratio = (1.20, 1.18, 1.19) for (g -, g + ,overall).  independent central radiologic review. The overall survival data (Gadgeel et al. 2020) was fitted using the Weibull model (2). Figure 2 shows that Weibull model fits the data well except for the PD-L1+ subgroup in chemo arm. Estimates after applying SME as described in Section 3.3 and Appendices A.6-A.7, supplementary materials for logic respecting estimands comparing pembrolizumab plus chemotherapy and chemotherapy arms with associated 95% simultaneous CI for PD-L1+, PD-L1− and overall population are provided in Table 4 where RMST estimates are calculated with τ = 26.6. Overall, PD-L1− patients seem to be doing worse on chemotherapy comparing to PD-L1+ patients. But the relative efficacy (Rx vs. C) in terms of all four measures is slightly higher for PD-L1− patients than PD-L1+ patients. Efficacy estimate for the overall population is in between the subgroup estimates for all four measures and all 95% simultaneous confidence intervals exclude 1 (for ratio) or 0 (for difference) indicating significantly higher efficacy in pembrolizumab plus chemotherapy arm.  (Figure 3). The three arcs around the dots (or square/triangle) represent the simultaneous 95% CIs for the ratio of medians and they all showed significantly higher efficacy as none crossed the dotted grey diagonal line. Adding pembrolizumab to chemotherapy increases median survival by 76% for PD-L1 patients, 66% for PD-L1+ patients, and an overall 70% for both subgroups combined.
Nonparametric methods for time-to-event endpoint analysis are commonly done in practice. Next, we show the results of following SME as laid out in Section 3.3 and Appendices A.8-A.10, supplementary materials for difference and ratio of RMST and difference and ratio of 1-year survival rate. Resulting point estimates as well as 95% simultaneous CI are provided in Table 5. These 95% simultaneous CIs are always narrower than the 98.3% CIs calculated by Bonferroni inequality thanks to the knowledge of the covariance structure. Comparing to the Weibull model based estimates, instead of consistently higher efficacy in PD-L1− patients versus PD-L1+ patients across four measures, the nonparametric KM results suggests lower efficacy in PD-L1patients for ratio and difference of 1-year survival rates. This is mainly driven by the slightly lower 1-year survival rate estimate for PD-L1+ patients and higher estimate for PD-L1− patients based on the KM curves. In addition, KM based estimates lead to slightly wider simultaneous CIs, and specifically the lower bounds of CIs for ratio and difference of 1-year survival rates for PD-L1-patients now do not exceed the no effect bar anymore.
A mixture Weibull distribution was fitted to model the baseline hazard function h 0 (t) for the PFS endpoint in KeyNote-189. The mixture model fitting describes the survival distribution more satisfactorily (Figure 4) comparing to single Weibull model (2) as evidenced by improved AIC, BIC, and BIC2 criteria. Table 6 provides logic respecting measures for comparing two treatment arms applying mixture Weibull model. RMST estimates are calculated with τ = 19.3. We did not include  21.4 (11.5,31.4) RoM because it may not be logic respecting as discussed in Section 3.1.4. Different from OS, there do not seem to be much differences in PFS between PD-L1+ and PD-L1− patients receiving chemotherapy and efficacy is slightly higher in PD-L1+ patients across all four measures. With the exception of ratio of 1-year PFS rate, all other estimates show significant efficacy across PD-L1+, PD-L1− and overall population.
CheckMate-9LA is a randomized, open-label phase 3 trial of nivolumab plus ipilimumab combined with histology-based, platinum doublet chemotherapy, versus chemotherapy alone in patients with NSCLC (Paz-Ares et al. 2021). The primary endpoint was overall survival in all randomly assigned patients. Instead of fitting a single Weibull model (2), we fit two separate Weibull models (3), one for each subgroup for illustration purpose. We plot the fitted curves in Figure 5. It was shown in Section 3.1.4 that ratio of median is still logic respecting under two separate Weibull models, therefore, results for RoM is shown in the M&M plot in Figure 6. Results for difference and ratio of RMST (τ = 24.3) and 1-year, 2-year survival rates are all included in Table 7. Recall the published HR for overall population is 0.67, which is outside the range of (0.62, 0.64) from the subpopulations (Paz-Ares et al. 2021). If efficacy is summarized using any estimate in Table 7 instead, the result is guaranteed to be logical. Across all efficacy measures, there is evidence of significant efficacy between treatment arms in subgroups and overall population and PD-L1 subgroups do not seem to affect efficacy in prognostic or predictive fashion.

Discussion
In this article, we evaluate whether the commonly used efficacy estimands respect logic when there is subgroup effect in randomized clinical trials with time-to-event endpoint. Estimands in the overall population should stay in between the ones in g + and gsubgroups. As pointed out in the literature, hazard ratio is not logic respecting and not collapsible. This issue is not just theoretical. It has dire practical implications as illustrated by the OAK study. Incorrectly interpreting bTMB as a predictive biomarker instead of prognostic can deprive patients with low bTMB value from an effective therapy that may significantly prolong their life. Continued use of HR to summarize efficacy for clinical trials can potentially harm more patients due to such incorrect treatment benefit assessment.
The illogical issue of HR has not gone unnoticed. However, most papers still focus on how to "fix" HR such as educating readers not to compare marginal versus conditional HRs as they are like apple and oranges (Daniel, Zhang, and Farewell   2021). To other key stakeholders in drug development (e.g., clinicians) such message does not make much common sense as both are efficacy summaries, one for a subgroup while the other for the entire population. It is only natural to compare the effectiveness of the drug between these patient groups. We therefore recommend the use of logic respecting efficacy estimands such as ratio of medians, ratio and difference of RMSTs and milestone probabilities for efficacy summary instead of HR to avoid potential incorrect treatment benefit assessment.
Using these estimands will also ensure agreement between the marginal and conditional measures, leading to easier and more intuitive interpretations on efficacy readout of clinical trials. It is important to point out though that the choice of estimand should be carefully evaluated as a purely prognostic biomarker for one estimand may manifest itself as a predictive biomarker based on another estimand (Kent et al. 2020). The logic should not only hold for estimands at the population level but also ensured at sample level when analyzing real clinical trial data. We examine the common analyses practice and tools and point out that the current SAS LSMEANS option in PROC PHREG is merely a linear combination of the subgroup HRs mixed at the log scale which seems to be a result of over generalization of the linear model methodologies for continuous endpoint. However, it does not correspond to the true marginal HR for the overall population. By appearing in the range of subgroup HRs, we believe this has helped masking the illogical issue of HR in the past. On the other hand, for logic respecting efficacy estimands such as difference of means or difference of RMSTs, marginal analyses ignoring subgroup indicator G = {g + , g − } using either parametric models or nonparametric methods can lead to illogical results as shown in Section 3.3. The analysis principle called Subgroup Mixable Estimation (SME) can be used to guarantee logical behavior. We provide step by step illustration of the application of SME for ratio of median, difference and ratio of RMSTs and milestone probabilities under Weibull model as originally introduced in Ding, Lin, and Hsu (2016). There are inherent advantages of using parametric modeling. Even when misspecified, they can possess better operating characteristics in small sample sizes due to smaller standard errors (Wey, Connett, and Rudser 2015;Miltenberger et al. 2021). In nononcology, parametric models with normal errors are standard practice. Sir David Cox indicated his preference to parametric models during an interview with Nancy Reid (Reid 1994): "I think I would normally want to tackle problems parametrically, so I would take the underlying hazard to be a Weibull or something. I'm not keen on nonparametric formulations usually. " "Of course, another issue is the physical or substantive basis for the proportional hazards model. I think that's one of its weaknesses, that accelerated life models are in many ways more appealing because of their quite direct physical interpretation, particularly in an engineering context. " As Sir David Cox alluded to, it is also important to have biological interpretations that support assumptions required by the chosen model for data analysis. Specifically, the extreme value distribution (i.e., basis of Weibull model) arises as the limiting distribution for the minimum of a sample of iid nonnegative random variables. Meanwhile, cancer tends to progress by affecting multiple biological pathways. Therapeutic interventions can help to block some of these pathways, but cancer cells often mutate or find another pathway, eventually leading to recurrence in cancer growth. Time to event data can thus be viewed as a minimum of multiple times leading to cancer regrowth, supporting the plausibility of a Weibull model in oncology. Lastly, under a Weibull model, there is an easy connection between RoM and HR, that is, RoM = (1/HR) 1/κ where κ is the shape parameter.
However, limited by the parametric form, Weibull model can only fit time-to-event data with certain shapes. As observed in some clinical trials such as KeyNote-189 and KeyNote-407, Weibull model may not fit well. We therefore generalize the setting to accommodate other parametric, semiparametric and even nonparametric methods. Specifically, we illustrate the implementation of SME under two separate weibull models and mixture weibull model with the caveat that RoM may not be logic respecting under mixture Weibull. Variance and covariance matrix can be derived in this case because all three efficacy estimators are functions of model parameter estimators. Due to the popularity of nonparametric methods in practice, detailed steps of how to obtain logical estimates following SME for difference and ratio of RMSTs and milestone probabilities based on KM estimates are also provided. The co-variance can be derived when the marginal estimator is a linear combination of the subgroup estimators which is the case for RMST and milestone probability. Similar steps can be followed for semiparametric models such as Cox model to estimate these estimands. The resulting 95% simultaneous confidence intervals from either parametric, semiparametric, or nonparametric approaches provide joint inference on biomarker positive and negative subgroups and overall population.
Instead of focusing on how to combine relative efficacy in subgroups to form the efficacy in the overall population, the SME principle advocates following the probability nature. That is to (a) mix the two subgroups on the probability scale within each treatment arm Rx and C first as guided by the law of total probability; (b) derive the individual arm effect based on the mixture distribution for Rx and C respectively; and (c) calculate the relative efficacy between Rx and C from the individual arm effect. This is the natural process to derive any efficacy measure. Due to the nice linear relationship for difference of means between subgroups and overall population, most literature emphasizes on this linear form either in the original or log scale as we see implemented in LSMEANS in SAS and the definition of collapsibility. The SME principle on the other hand, proposes to follow the natural, step by step process without shortcuts. As it turns out, that is the simplest solution to the illogical issue. Standardization in Hernán and Robins (2020) shares the similar idea of obtaining marginal efficacy estimates by mixing within Rx and C first before calculating relative efficacy between arms. However, SME first proposed in Ding, Lin, and Hsu (2016), is a general principle applicable regardless of endpoint type (continuous, binary, and time-toevent) and analysis method (parametric, and nonparametric) while standardization is proposed for binary endpoint under parametric modeling (although it covers any number of continuous or categorical covariates). The other difference is that SME aims at providing simultaneous inference for subgroups and overall population assisting decision making with complete efficacy profile, while current literature including standardization is limited in producing inference for the overall population only.
For randomized clinical trials including primary objectives of evaluating differential efficacy in pre-specified biomarker subgroups, we suggest to follow standard practice in nononcology where one first tries to fit the data with parametric models such as Weibull or mixture Weibull model including treatment arm (T), subgroup (G) and interaction terms (T*G). Due to the series of issues associated with data-driven variable selection strategies as discussed in the literature (Copas and Long 1991;Derksen and Keselman 1992;Chatfield 1995;Harrel 2015) including high R 2 , biased regression coefficient and standard error estimation among others and the fact that a large p-value does not necessarily indicate the null is true in general, we recommend to always include treatment-by-subgroup interaction term in the model for inference in the subgroups and overall population. If separate weibull models are used for each subgroup, then only T is needed in the model. Model diagnostics will be performed and if model assumptions are clearly violated, alternative semiparametric or nonparametric methods can be used. The important step is to follow SME to produce the point estimate as well as simultaneous CIs for the efficacy of both subgroups and the overall population regardless of the analysis method. Based on the simple simulation, if weibull model is correct, the parametrically derived CI is shorter than KM based CI with similar coverage. On the other hand, if weibull model is incorrect, the parametric CI is still shorter but the coverage is slightly lower than 95% while KM based CI is slightly higher than 95%. This joint inference after multiplicity adjustment is then used to interpret efficacy results of the trial and determine the prognostic or predictive value of the biomarker.
Two Phase 3 randomized clinical trials in NSCLC with PD-L1+ and PD-L1− as the biomarker subgroups are used as case examples to illustrate how to apply SME in data analysis. We fit a Weibull model to the KeyNote-189 OS data, separate Weibull models (i.e., one for each biomarker subgroup) to CheckMate-9LA OS data, and a mixture Weibull model to the KeyNote-189 PFS data. The point estimates and 95% simultaneous confidence intervals are provided for the common logic respecting efficacy estimands. We also demonstrate the correct estimation process based on nonparametric Kaplan-Meier estimates. The 95% simultaneous confidence intervals are derived based on the correlation among the three efficacy estimators. These estimates are guaranteed to follow logic and the simultaneous CI provides tighter boundaries than the 98.3% confidence interval which is calculated by splitting the 5% alpha as typically done for multiplicity adjustment.
A user friendly graphical interface encompassing all aforementioned methods in this article are being developed based on R shiny package and will be made publicly available once completed. Packaged R codes can be shared upon request. This tool expands on the one developed in Ding, Lin, and Hsu (2016) under Weibull model, and to our knowledge, it will be the most comprehensive computer package that provides simultaneous inference on the subgroups and overall population with proper multiplicity adjustment for these time-to-event efficacy estimands in randomized clinical trials.

Supplementary Materials
The appendices in the Online Supplementary Materials provide additional technical details and case examples for Section 3.