An issue about the efficacy for the time-to-event outcome based on accelerated failure time model with interaction of unrecognized heterogeneity and main effect

Abstract The accelerated failure time model is an alternative method to deal with survival data if the proportional hazard model fails to capture the relationship between the hazard time and covariates. That is, the proportionality assumption is not suitable to analyze survival data. In this paper, we address the issue that the relationship between the hazard time and the main effect with unrecognized heterogeneity which interacts with main effect is satisfied with the accelerated failure time model to design a trial. The test statistic for the main effect is used to determine the total sample size for a trial and the proposed criteria are used to rationalize partition sample size.


Introduction
For the time-to-event outcome, the effect of unrecognized heterogeneity is neglected in the conventional method. However, more and more genetic studies showed that tumors of the same histologic diagnosis are different subtypes that are distinct with respect to clinical endpoints such as response to treatment and survival. Several literatures have indicated the problem of unrecognized heterogeneity that divides the patients into two distinct groups (e.g., Lagakos and Schoenfeld (1984); Struthers and Kalbfleisch (1986); Chastang, Byar, and Piantadosi (1988); Schmoor and Schumacher (1997)). Li et al. (2002) discussed the use of frailty hazard models for unrecognized heterogeneity. They derived analytic approximations for the asymptotic relative efficiency of the simple log-rank test relative to the optimally weighted log-rank test and for the power of the simple log-rank test when applied to subjects with unobserved heterogeneity. In this article, they extended the results of Lagakos and Schoenfeld (1984) to examine the impact of unmeasured heterogeneity on the efficiency and power of the simple log-rank test for treatment effect.
For the time-to-event outcome, the conventional method for sample size determination is based on the proportional hazard model. If the proportionality assumption fails to capture the relationship between the hazard time and covariates, the proportional hazard model is not suitable to analyze survival data. For example, time-dependent covariates exist in the data. The accelerated failure time model is an alternative method to deal with survival data. In this paper, we try to extend Ko papers (2014Ko papers ( , 2015aKo papers ( , 2015b and address the issue that the relationship between the hazard time and the treatment effect with unrecognized heterogeneity which interacts with treatment is satisfied with the accelerated failure time model to design a trial. The proposed criteria are used to rationalize partition sample size. This paper is organized as follows. In Section 2, we introduce the accelerated failure time model for survival data, the log rank test and the determination of the sample size by the log rank test for survival data. We also establish two criteria to examine whether the overall results can be applied to the region of interest in Section 2. We calculate the assurance probability for different criterion for the multi-regional trial in Section 4. An example is given to illustrate the proposed method in Section 3. Conclusion is given in Section 4.

Accelerated failure time model applied in the treatment response interacted with unrecognized heterogeneity in a clinical trial
The accelerated failure time model was presented to model the main effect, b, of the covariate directly on the length of survival time as where T is the survival time, X is a time-independent covariate and e is the random error. Suppose that S 0 is the baseline survival function of T given x 2 ¼ 0: Then S 0 is also the survival function of U ¼ exp ðeÞ: The extension of the accelerated failure time model considering subjects with unobserved heterogeneity such as the genetic type or the environmental factor and unobserved heterogeneity interacting with the main effect is where j is the frailties such as the genetic type or the environmental factor among subjects, j $ Nð0, r 2 Þ, and a is the effect of the interaction between the frailties and main effect. Based on Ko papers (2014Ko papers ( , 2015aKo papers ( , 2015b and Tseng, Hsieh, and Wang (2005) paper, the hazard rate function for an individual with the main effect can thus be expressed as where k 0 ðÁÞ is the hazard function for S 0 and _ ! is the first derivative of !: _ ! is shown as follows.
_ !ðt; b, a, XÞ ¼ exp ðbX þ j þ ajXÞ: The link function, U, can be defined as Now the induced hazard function based on (3) is given by In Tseng, Hsieh, and Wang (2005) paper, they parameterized k 0 , which is the hazard function of the baseline failure times, U, to estimate the baseline hazard function. Thus, The above induced hazard function based on (3) can be rewritten as follows where X is defined as k 0 ðE½!ðt; b, a, XÞÞ=k 0 ðtÞ: For the time-to-event data, the main effect is shown as the hazard ratio. In the proportional hazards model, the hazard ratio can be shown as follows.
where k 1 ðtÞ is the hazard function for a test product, k 0 ðtÞ is the hazard function for a placebo control and exp ðbÞ is the hazard ratio about main effect. The hypothesis of testing for the overall main effect is given as In the accelerated failure time model, the hazard function can be written as (6). From the model (6), we can let exp ðb Ã Þ ¼ X exp ðr 2 =2 þ r 2 a þ r 2 Â a 2 =2Þ exp ðbÞ, then model (6) can be rewritten as From Li et al. (2002) paper, the approximate distribution of the log rank test under (8) is given by where KðtÞ ¼ Ð t 0 kðsÞds, the functions p 0 and p 1 and the constants A 0 and A 1 such that sup log k 1 ðtÞ þ a @ @a a¼0 log k 1 ðtÞ: Under model (8), we also know k 1 ðtÞ ¼ À d dt log LðK 1 ðtÞÞ ¼ Àk 0 ðtÞ exp ðbÞ L 0 ðK 1 ðtÞÞ LðK 1 ðtÞÞ : where K 1 ðtÞ ¼ exp ðbÞKðtÞ and LðK 1 ðtÞÞ ¼ Ð e ÀK 1 ðtÞ exp ðð1þaÞjÞ dFðjÞ is the Laplace transform for the random variable exp ðð1 þ aÞjÞ: The above equation, k 1 ðtÞ, can be rewritten as Here Vðt, b, aÞ is independence of the time, t. Now let n 1 ¼ n 0 ¼ 1 2 n, and B (8) with power 1 À i and onesided type I error level of is where Z q , is the 100 Â q percentile of a standard normal distribution. Thus, for a study designed under the incorrect assumption of model (7) when in fact the model (8) holds, the actual power is where UðÁÞ is the standard normal cumulative distribution function.

Criteria for the similarity of efficacy in a global trial
Suppose that we are interested in judging whether the main effect is significant in a specific region, say the s th region. Let D s be the observed log hazard ratio in the specific region and D sc the observed log hazard ratio from regions other than the specific region. D s and D sc can be rewritten to the discrete form as a log rank test. That is, where Y 1s is the total number of subjects at risk in the treatment group in the specific region and Y 0s is the total number of subjects at risk in the placebo group in the specific region, N 1s is the total number of observed failures in the treatment group in the specific region and N 0s is the total number of observed failures in the placebo group in the specific region. and, where Y 1sc is the total number of subjects at risk in the treatment group from regions other than the specific region and Y 0sc is the total number of subjects at risk in the placebo group from regions other than the specific region, N 1sc is the total number of observed failures in the treatment group from regions other than the specific region and N 0sc is the total number of observed failures in the placebo group from regions other than the specific region.
Given that the overall result is significant at level, we will judge whether the treatment is effective in the s th region by the following two criteria.
i. D s ! qD for some 0 < q < 1, ii. D s ! qD sc , for some 0 < q < 1, The first criterion focuses on the similarity of main effect in the region of interest, D s , and overall main effect in all regions, D. When q is close to 1, it means that the main effect in the region of interest is as almost the same as overall main effect. When q is adjacent to 0, it means that the main effect in the specific region is not similar as overall main effect. The second criterion is defined as the consistency for the similarity of treatment effect in the region of interest and overall main effect in all regions. If q is close to 1, it means that the strict consistency for similarity is required. Otherwise, if we need the loose consistency of similarity, we can choose the small value of q.

Assurance probability of two criteria in case of three regions
We now describe the method of determination of the proportion of the specific region, denoted as p s , to ensure that the assurance probabilities of criteria (i)-(ii) given a positive value are maintained at a desired level, say 80%. For similarity, we consider the case of three regions (K ¼ 3). Without loss of generality, we assume that s ¼ 1: In other words, we want to see if the overall results can apply to the first region. The mathematical expressions of assurance probability for criteria (i) and (ii) are shown in Appendix. The Japanese Ministry of Health, Labor and Welfare (MHLW) suggests that the results between the specific region (Japan) and all populations are consistent when D s ! qD with q ! 0:5 so we choose q ¼ 0:5 in the study. We consider the cases of p 2 ¼ p 3 for criteria (i) and (ii). We want to see if the overall results can apply to the first region. Given ¼ 0.025, i ¼ 0.2, b ¼ À1:44, X ¼ 4, r ¼ 0:962, a ¼ 1:00: Table 1 exhibits the assurance probabilities of criteria (i) and (ii) respectively for different combinations of design parameters under model (7). Table 1 applies to trials with B ¼ 0:184: We consider various combinations of ðp 1 , p 2 , p 3 Þ: For a multi-regional trial in Asia, USA, and Europe regions, the proportion of USA and Europe is usually required equal. Thus, for instances, the first line in Table 1 corresponds to a design with p 1 ¼ 0.10, p 2 ¼ 0.45, p 3 ¼ 0.45 and total sample size n ¼ 526. The assurance probabilities of criteria (i) and (ii) are respectively 0.735, 0.724.
From Table 1, the assurance probability for the criterion (i) is over 80% when p 1 is over 0.2. For the criterion (ii) the assurance probability can reach 80% at p 1 ! 0:3: Table 2 exhibit the assurance probabilities of criterion (i) and (ii) respectively for different combinations of design parameters under the incorrect assumption of model (7) when in fact model the model (8) holds. The values of parameters without X for Table 2 is as the same as Table 1. The assurance probabilities of criteria (i) and (ii) in Table 2 is lower than the corresponding values in Table 1. If model (8) is true and we derive the a multi-regional trial under model (7), it caused that the unsufficient power for the overall trial and the assurance probabilities of criterion (i) and (ii) can't reach 80% level. The power under the incorrect assumption of model (7) is 0.001 when in fact the model (8) holds. In addition, The power under the incorrect assumption of model (8) without the interaction between frailty and main effect is 0.002 when in fact the model (8) holds.

Simulation study
In order to assess the performance of the proposed estimator, we introduce some simulation studies. In the first simulation, we want to explore the performance of our proposed estimator assuming heterogeneity across region with significant interaction with main effects. In the second simulation, we want to explore the performance of our proposed estimator assuming without heterogeneity across region. In the third simulation, we want to explore the performance of our proposed estimator assuming with heterogeneity across region but no interaction with main effects. Here we choose the parameters shown on model (8) as b ¼ 1, X ¼ 4, r ¼ 1, a ¼ 1 for the first simulation. We also choose the parameters shown on model (8)  The mean square error of the 1000 point estimates of the parameter is denoted as MSE.
In Table 3, the performance of our proposed estimator assuming heterogeneity across region with significant interaction with main effects is good. The values of the standard deviation of the 1000 point estimates of the parameter (DSE), the average of the 1000 standard error for the estimate of parameter (ASE), and the mean square error of the 1000 point estimates of the parameter (MSE) are small. It means that our proposed method can estimate the parameters accurately and precisely.
In Table 4, the performance of our proposed estimator assuming without heterogeneity across region is good. The values of the standard deviation of the 1000 point estimates of the parameter (DSE), the average of the 1000 standard error for the estimate of parameter (ASE), and the mean square error of the 1000 point estimates of the parameter (MSE) are small. It means that our proposed method can estimate the parameters accurately and precisely.
In Table 5, the performance of our proposed estimator assuming with heterogeneity across region but no interaction with main effects is good. The values of the standard deviation of the 1000 point estimates of the parameter (DSE), the average of the 1000  standard error for the estimate of parameter (ASE), and the mean square error of the 1000 point estimates of the parameter (MSE) are small. It means that our proposed method can estimate the parameters accurately and precisely. In conclusion from Tables 3-5, it can be seen that our proposed method can estimate the parameters accurately and precisely for the situations of heterogeneity across region with significant interaction with main effects, without heterogeneity across region, and heterogeneity across region but no interaction with main effects.

Example
An example is given to illustrate the proposed method. A randomized, double-blind, active-controlled, multi-regional trial was conducted to assess Bevacizumab, a monoclonal antibody targeting vascular endothelial growth factor, improves survival when combined with carboplatin/paclitaxel for advanced nonsquamous non-small-cell lung cancer (NSCLC). This randomized phase III trial investigated the efficacy and safety of cisplatin/gemcitabine (CG) plus bevacizumab in this setting. Eligible patients had histologically or cytologically documented, advanced (stage IIIB, with supraclavicular lymph node metastasis or malignant pleural or pericardial effusion, or stage IV) or recurrent nonsquamous NSCLC. Patients recruited in the trial from 3 regions, Asia, USA and Europe were randomly assigned to receive cisplatin 80 mg=m 2 and gemcitabine 1,250 mg=m 2 for up to six cycles plus bevacizumab (12.5 mg/kg), or placebo every 3 weeks until disease progression. The primary end point is progression-free survival (PFS). The information about the determination of the sample size is shown as follows. The study requires 80% power to detect a difference in progression-free survival (PFS) between the treatment and placebo at 0.05 type 1 error level, assuming an equal number of patients within each group. In addition, this study allows the unrecognized heterogeneity which interacts with treatment in this trial. The expected treatment effect is defined as b ¼ À1.44. Assume the death rate in two years as B ¼ 0:184: The accelerated failure time model is assumed for patients surviving situation and X ¼ 1:485: The failure time distribtion is assumed as Weibull distribution. The frailty effect is defined as r ¼ 0:945: The effect of the interaction between unrecognized heterogeneity and treatment effect is defined as a ¼ 1:00: If we assume that Asia is the region of interest and it is of interest to examine whether the overall results from the multi-regional trial can be applied to the region of interest under the condition that the overall treatment effect is statistically significant in the overall region. As seen in Table 1, the total sample sizes for this trial are 526 and we can see that when the proportion of the patients recruited in Asia is chosen to be 20% (p 1 ¼ 0.20), the assurance probabilities of criteria (i) and (ii) are 0.805, 0.783, respectively. When the proportion of the patients recruited in Asia is raised to be 30%, the assurance probabilities of the criteria (i) and (ii) reach the 80%.

Discussion
The choices of X and r 2 influence the results of the hypothesis testing. The parameter X is associated with the selection of the accelerated failure time model. The parameter r 2 is associated with the effect of unrecognized heterogeneity. The larger sample size n is required if we choose the smaller values of X or r 2 : Besides, in the choice of the parameter about the frailty (r 2 ), the adequate range for r 2 is between 0 and 1 because the large value of r 2 means that the frailty effect is neglected. Cui and Sun (2004) proposed a method to check the gamma frailty distribution under the marginal proportional hazards frailty model. The method is used to evaluate the existence of site or regional variability and can be used to pilot data to determine analytic model (frailty vs standard Cox).
The Log-rank test is chosen to calculate the sample because it does not give the weight on all subjects. It is suitable for the clinical trial design. If we choose the small q, we can get the greater values of the assurance probability (See Tables 6-7).
In this manuscript, we focus on the model with the interaction unrecognized heterogeneity and treatment. That is, the treatment effect is also seen as a random effect by the interaction term. In data analysis, the model checking for the interaction term of unrecognized heterogeneity and treatment can be found in commercial software as SAS by PROC GLIMMIX.
In this paper, we just discuss that the unrecognized heterogeneity is from the normal distribution. For the future work, we will explore how the unrecognized heterogeneity from different distribution under the accelerated failure time model to influence the sample size determination for a phase III clinical trial.