Stochastic curtailment tests for phase II trial with time-to-event outcome using the concept of relative time in the case of non-proportional hazards

ABSTRACT As part of the drug development process, interim analysis is frequently used to design efficient phase II clinical trials. A stochastic curtailment framework is often deployed wherein a decision to continue or curtail the trial is taken at each interim look based on the likelihood of observing a positive or negative treatment effect if the trial were to continue to its anticipated end. Thus, curtailment can take place due to evidence of early efficacy or futility. Traditionally, in the case of time-to-event endpoints, interim monitoring is conducted in a two-arm clinical trial using the log-rank test, often with the assumption of proportional hazards. However, when this is violated, the log-rank test may not be appropriate, resulting in loss of power and subsequently inaccurate sample sizes. In this paper, we propose stochastic curtailment methods for two-arm phase II trial with the flexibility to allow non-proportional hazards. The proposed methods are built utilizing the concept of relative time assuming that the survival times in the two treatment arms follow two different Weibull distributions. Three methods – conditional power, predictive power and Bayesian predictive probability – are discussed along with corresponding sample size calculations. The monitoring strategy is discussed with a real-life example.


Introduction
Interim analysis (IA) is routinely used in phase II/III clinical trials to monitor their progress.IA is carried out one or multiple times depending on the size of the trial while maintaining protocol guideline and is considered an integral part of the drug development process (Fernandes et al. 2009;Jennison and Turnbull 1990).IA helps to maintain design protocol, and ensure design parameters followed trial compliance (Jennison and Turnbull 1990).Results from IA guide the study sponsors to decide whether it is prudent to further invest in the current trial or should the trial be stopped for early evidence of futility or efficacy.Thus, IA helps to facilitate important economic decision-making and thereby proper utilization of limited study resources.IA should also be incorporated in a trial protocol for ethical reasons to ensure subjects are not treated with inferior/unsafe treatments (Jennison and Turnbull 1990).Furthermore, IA helps to monitor unpredictable or unrealistic treatment effects in an ongoing experiment and inform the data monitoring committee (DMC) regarding the scientific validity of the study.One study reported that out of 1772 randomized clinical trials (RCTs), 470 (27%) reported the use of DMC and further 116 (7%) clinical trials conduct some form of IA without mentioning DMC (Tharmanathan et al. 2008).Further meta-analysis suggested that 444 (76%) clinical trials out of 586 continued as planned and 13% stopped early for efficacy or futility (Tharmanathan et al. 2008).
A vast amount of literature exists on the topic of clinical trial design related to interim monitoring and adaptive designs.Group sequential designs (GSDs) with various stopping boundaries are frequently used for interim monitoring in large phase III trials.GSD design also allows to monitor a trial at the interim stage for treatment efficacy or futility.That is, in GSD design, at each interim stage (sometimes referred to as 'look'), a well-defined boundary can be constructed to monitor the trial that helps to facilitate the DMC to make an interim decision whether to stop the trial for early evidence of efficacy/futility or to continue the trial due to lack of overwhelming evidence of treatment effect (Dmitrienko, 2017;Jennison and Turnbull 1999;Phadnis and Mayo 2020;Pocock 1977;Proschan et al. 2006;Whitehead 1999).However, stochastic curtailment (SC) procedure became an attractive option for futility monitoring for small phase II trials.The basic idea of an SC-based test is to utilize the data accumulated at the interim looks to make an interim decision whether a trial should be stopped or continued based on the likelihood of observing a positive or negative treatment effect if the trial were to continue to its planned end.Three most popular SC tests in the context of clinical trial design are explained in detail by several authors (DeMets andWare 1982, Dmitrienko, 2017;Jennison and Turnbull 1999), and a review of the contemporary development of the SC-based methods is also discussed by these authors (He et al. 2015;Kundu et al. 2021;Kunzmann et al. 2022).Of these three methods, conditional power (CP) method is based on the frequentist design, predictive power (PP) is half frequentist and half Bayesian, and Bayesian predictive probability (BPP) approach adopts a fully Bayesian ideas to inform decision-making.
In phase II/III oncology clinical trials, time to event (TTE) endpoints are extensively used as a primary outcome measure by many researchers.While the literature on SC methods is very well developed for continuous and binary endpoints, only a few methods are available in the case of TTE endpoints (Jiménez et al. 2019;Kuehnapfel et al. 2017;Magirr et al. 2016;Royston et al. 2011).The most popular among these is the SC method utilizing the log-rank test, and this option is available in many standard statistical software.However, it is well known that methods based on the log-rank test are optimal in the case where the proportional hazards (PH) assumption holds true.When the PH assumption does not hold, to the best of our knowledge, there are no known SC methods that are readily available in literature or provided by standard software.The different variations of log-rank test proposed by various authors in the context of fixed two arm designs (Collett 2015;Freedman 1982;Lachin and Foulkes 1986), therefore, cannot be extended to construct SC tests when the PH assumption is not true.One of the popular methods available in standard software for designing fixed twoarm trials in the case of non-proportional hazards (NPH) is the method proposed by Lakatos (Lakatos 1988), however, it has not been extended to suit the SC testing framework.Recent developments in the design of two arm trials with NPH have seen considerable attention on the restricted mean survival time approach (Royston and Parmar 2013, 2014, 2016;Royston et al. 2011), but its extension to SC test has not been developed yet.More recently, clinical trial design for two-arm trials with the NPH scenario was proposed by Phadnis et al. (2017) and Jachno et al. (2019).In the last few years, the Weibull distribution has received considerable attention in the design of fixed single-arm and two-arm clinical trials with notable contributions by several authors (Wu 2015;Wu and Xiong 2015).In fact, Waleed et al. (2021) has developed a theoretical framework for conducting a single-arm trial using the Weibull distribution.In the context of fixed two-arm randomized controlled trial, recently, Phadnis and Mayo (2021) proposed a novel sample size calculation method allowing both non-proportional hazard (non-constant HR) and non-proportional time (non-AFT model), assuming that survival times for the two arms follow two different Weibull distributions.
Motivated by the work of Phadnis and Mayo (2021), in this article, we develop a methodology for conducting SC tests for two-arm phase II trials in the case of TTE primary outcomes for the NPH scenario.Our proposed approach allows us to conduct interim monitoring for early evidence of efficacy and futility utilizing a relative time (RT) framework (see section 2 for details), thereby allowing NPH.Specifically, we will consider two scenarios -(i) new treatment confers a small improvement relative to a standard control at the beginning of the trial, but this benefit increases over time, and (ii) new treatment confers an immediate large improvement relative to a standard control at the beginning of the trial, but this benefit decreases over time.
Before proceeding to the next section, the reader is recommended to read Appendix A.1-A.3, where we have briefly revisited the methodological framework for two-arm RCT with TTE outcomes in the case of non-proportional hazards and non-proportional time developed by Phadnis and Mayo (2021).These Appendix sections explain the notations used throughout our text and form the basis of the main SC method proposed by us in this paper.While the full details are available in the Phadnis and Mayo's paper, Appendix A.1-A.3 provides a quick overview of the relative time concept, the modeling framework setting up the hypothesis and sample size calculations accounting for administrative censoring and loss to follow-up.
The rest of the paper is organized as follows.In section 2, we present the main methods development for conducting SC procedures and their use in interim monitoring of a two-arm phase II trial.In sections 2.1 and 1.4, we discuss simulation results for various scenarios and its effects on the calculation of CP, PP and BPP.Finally, in section 2.2, we summarize and discuss the advantages and limitations of our proposed method.

Methods
SC is a valuable tool used in the monitoring of clinical trials.SC tests provide trialists the ability to take the decision on whether to terminate a trial prematurely or to continue the study based on the assessment of the observed interim data.This decision is based on the likelihood of observing a positive or negative treatment effect if the trial were allowed to continue to its end.The important distinction between SC and GSD methods is that while GSD methods rely on the calculation of stopping boundaries based on the notion of repeated significance testing by focusing on the currently available data, SC methods are more geared toward making predictive inferences (tied to the final trial outcome) given the data available in the interim (Dmitrienko et al. 2017).SC method was first proposed by Lan et al. (1982).Also, application of SC test in RCT was well developed (Dignam et al. 1998;Halperin et al. 1982;Lan et al. 2007Lan et al. , 2009)).In this section, we will implement three main SC methods for futility monitoring utilizing the RT design.

Concept of information time in SC tests
Information time plays an essential role in interim monitoring.It helps to determine the amount of statistical information gathered during the study duration.Traditionally, information time for the logrank test is defined as t ¼ v m V N , where v m ; m ¼ 1; 2; 3; . . .; N À 1 ð Þ is the variance at interim calendar time and V N is the final variance at the end of study (Wu and Xiong 2015).However, since the final V N is unknown, it can be replaced by the projected final variance c V N (Collett 2015).In interim monitoring framework, information time is commonly written as t ¼ d m D N , where d m is the expected (observed) number of events accumulated at interim stage and D N is the projected final number of events estimated using the fixed sample design.Wu and Xiong's (2015) paper modified the information time as Here, I m is defined as an information time at interim stage m and I N can be defined as a final projected information time.Thus, we can incorporate the information time in our interim monitoring methods.An approximate information at a calendar time t can be estimated by taking the ratio of inverse of the variance calculated at interim stage with the fixed approximated final variance, where � À 1 at the end of final analysis.

CP approach
CP was first introduced by Lan et al. (1982) in their seminal paper.CP is defined as a conditional probability that the final analysis will have a significant result at the end of the trial given the observed data at interim (Dmitrienko, 2017;Dmitrienko and Wang 2006;Jennison and Turnbull 1999;Proschan et al. 2006).CP will be calculated under three scenarios -based on the design effect, based on the current trend and under the null hypothesis.Let, at an interim stage m, Z m denote the interim test statistic.The CP at interim stage m defined by CP ϕ ð Þ ¼ PrfZ N will reject H 0 jZ m g.Here, to follow the conventional mathematical notation of SC-based tests and for the ease of readability, we are defining ϕ as a relative time effect size RT P mid ð Þ and Z m m ¼ 1; 2; 3; . . .; N À 1 ð Þ plays the role of test statistics Q 0 which is defined in the supplementary material ).The test continues to final stage N where it rejects H 0 if Z N > Z 1À α and accept otherwise.For each m ¼ 1; 2; 3; . . .
ffi ffi ffiffi .We can also define the variance at interim stage m, σ 2 m ¼ , where d m is the observed event at interim stage m for control or treatment arm and the information time is , where D N is the final event at the end of trial in either arm and the associated information time is � � and information fraction will be f ¼ d m D N .This information fraction is equivalent to information time discussed in section 2.1.
Under the null hypothesis when ϕ ¼ 0; the CP can be written as ffi ffi ffi ffi ffi ffi ffi ffi ffiffi . Similarly, we can use the effect size observed at the interim stage by utilizing the concept from Brownian motion framework (Lachin 2005).Under the current trend, effect size can be written as Here, we are utilizing a linear relationship of interim test statistic and the concept of Brownian motion: B t ¼ Z t ffi ffi t p (assume that the future data derive from the same distribution, given observed data so far (Lachin 2005)); therefore, CP under the current ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi we should continue the trial.Here, γ is known as a prespecified futility index and Z 1À α denotes the 100 1 À α ð Þth percentile of the std normal distribution and Φ x ð Þ is the cumulative probability function of the std normal distribution.We will terminate the trial due to futility/lack of efficacy if CP m ϕ ð Þ < 1 À γ ¼ 0:2 where it is assumed that γ ¼ 0:80.The type I error probabilities of an SC procedure for IA are less than α γ , i.e.CP ϕ 0 reject H 0 ð Þ � α γ , and type II error of SCP is no more than β γ 0 : Thus, to ensure type I and type II error probability of at most α and β, we can design the test by using type I error rate αγ and power 1 À βγ 0 at ϕ ¼ ϕ a for a one sided test (Halperin et al. 1982;Jennison and Turnbull 1999).Dmitrienko (2017) reported that when the observed treatment effect is different than the prespecified design effect, CP test produces an unreasonable positive outcome of a trial.Hence, they use another adaptive version of the CP test, where the effect size θ is , where c is used as 1 (Pepe-Anderson test) or c ¼ 2:326 (Betensky test).In this paper, we have used the notation CP ϕ A ð Þ to calculate CP under this adaptive effect size.Detailed computation is described in Dmitrienko's (2017) book.

Predictive power (PP) approach
On the other hand, PP was introduced by Spiegelhalter et al. (1994Spiegelhalter et al. ( , 2004) ) to monitor the trial.PP can be calculated as the CP function (frequentist component) is averaged over with posterior distribution function (Bayesian component) of ϕ, given its estimate c Z m at the interim stage m (Dmitrienko, 2017;Dmitrienko and Wang 2006;NCSS 2015).The development of the PP approach is given below.Let Z m and Z N denote the test statistics computed at the interim and final analyses, respectively.The mathematical formula for PP can be expressed as:

and πðϕj c
Z m ) distribution can be approximated by a normal is the posterior density of ϕ given its estimate c Z m at the interim stage m.
The trial is terminated at IA due to lack of treatment benefit if the PP is less than 1 − γ for some prespecified futility index γ, and the trial continues otherwise (Dmitrienko, 2017;Dmitrienko and Wang 2006;NCSS 2015).Under normality assumption, Spiegelhalter et al. (1994Spiegelhalter et al. ( , 2004) ) showed the above expression has a closed-form solution.Several sources in literature utilize the concept of normal approximation for the log-hazard ratio when dealing with TTE outcomes (Dignam et al. 1998;Spiegelhalter et al. 1986Spiegelhalter et al. , 1994Spiegelhalter et al. , 2004)).Similarly, the normal approximation was also used when dealing with log-odds ratio or in case of binomial responses (Dmitrienko, 2017;Dmitrienko and Wang 2006).Therefore, for the sake of simplicity, we also assume that at the interim stage, m observations can be summarized by a statistic z m and the distribution can be approximated by a normal distribution as , where σ 2 is assumed to be known and ϕ is the effect size under RT design.Also, it is mathematically convenient and realistic to use conjugate normal prior distribution with p ϕ ð Þ~ N(μ; σ 2 n 0 ), where μ is the prior mean and σ 2 is the known common variance and n 0 is the prior number of events.These sequence of standardized test statistics z m : m � 1 . . .N À 1 f g assumes to follow multivariate normal distribution and has the properties of Markov sequence as discussed by Jennison andTurnbull (1990, 1999).Furthermore, after observing m events at interim stage, we are interested in deciding whether we should continue to accrue further N events consisting of m þ n ¼ N total events.Let us say, we wish to make a prediction concerning future values of z n where z n ,N μ; σ 2 1 Since, we have observed Z m ; hence, the posterior distribution of ϕ condition on z m is and the posterior predictive distribution of future data can be written . By using the predictive distribution of z n ; we can calculate the PP as Under uniform prior distribution, n 0 !0; and let the information fraction f ¼ m mþn � I m I N ; then, the above equation can be written as The above equation was used to calculate PP with uniform prior in the following literature (Dmitrienko, 2017;Dmitrienko and Wang 2006;NCSS 2015).Please see Appendix A.4 for our detailed derivation of the above equation.

BPP approach
Designing clinical trials using the Bayesian method has become very popular and is routinely used for continuous or discrete data monitoring of an adaptive design.IA using BPP requires a statistician to construct prior distributions on certain input parameters to draw inference about posterior distribution of these parameters.In a general setting of BPP, we have a positive outcome if the posterior probability of clinically relevant improvement is greater than a prespecified threshold with P ϕ > δjdata ð Þ > η where η is between 0 and 1 (Dmitrienko, 2017;Dmitrienko and Wang 2006).In practice, η is set to be between 0.9 and 0.975 that is equivalent to frequentist significant level.A high BPP indicates that given the interim data, the treatment will likely be superior to the control if the trial continues to maintain its accrual and follow-up rate whereas a low BPP may indicate interim data does not have a strong signal for efficacy.Several papers recommend using the uniform prior distribution for calculating PP and BPP since uniform prior distribution ensures that both PP and BPP depend only on the observed data (Dmitrienko, 2017;Dmitrienko and Wang 2006;Spiegelhalter et al. 1994Spiegelhalter et al. , 2004)).We can write the BPP as where z n > n 0 þmþn n δ I À ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Spiegelhalter et al. (1986Spiegelhalter et al. ( , 1994Spiegelhalter et al. ( , 2004) ) showed that the above equation can be written as the following closed form and the derivation is given in Appendix A.5. Predictive probability of future posterior tail area is in the following closed form:

Simulation study
In a two-arm, randomized, survival clinical trial, often the research interest is to assess the treatment effect (improvement in survival benefit of treatment relative to the control).Phadnis and Mayo's (2021) discussed the example of a cholangiocarcinoma trial for a two-arm trial setting.Here, we show how SC methods can be deployed for that example, hypothetically assuming that interim data were available to do the CP, PP and BPP calculations.Thus, we are considering a scenario whether a clinician wants to perform a one-sided (two-sided) test with 80% power to detect an improvement of median PFS in treatment arm (6 months) versus the median PFS in control arm (4 months).Therefore, the study design parameters are as follows: one-sided test with α ¼ 0:05, power (ω) = 0.8, Weibull shape parameter for control arm (β 0 ) ranges from 0.25 to 1.50, median survival in C arm = 4 months, equal allocation ratio (r) = 1, accrual time (a) = 12 months, follow-up time (f) = 12 months and the dropout rate (ρ) = 0.2 was considered.IA is conducted at 18 months.Moreover, we consider different values of shape parameter of Weibull distribution to represent the increasing (β > 1), decreasing (β < 1) and constant hazard (β ¼ 1) since in TTE analysis, sample size calculation vastly depends on the shape parameter of Weibull distribution (Phadnis 2019;Phadnis et al. 2020).Furthermore, varying those parameters, we will consider the following four different scenarios: (i) RT p 1 ¼ 0:10 ð Þ ¼ 1:52 and RT p 2 ¼ 0:90 ð Þ ¼ 1:98, implying that at 10 th percentile, the treatment arm has progression-free survival (PFS) or overall survival (OS) by a factor of 1.52, whereas at 90 th percentile, the PFS in the control arm will be 1.98, which indicates a gradual improvement in treatment effect from 10 th percentile to 90 th percentile rather than instantons effect.(ii) Similarly, RT p 1 ¼ 0:10 ð Þ ¼ 2 and RT p 2 ¼ 0:90 ð Þ ¼ 1:5, implying a gradual decline in longevity over time in both treatment and control arms.(iii) Likewise, we also consider the scenarios where improvement of PFS will be at A graphical representation of the above scenarios is presented in Figure 1.The left panel in this figure illustrates the Kaplan-Meier (KM) curve when RT p 1 ¼ 0:10 ð Þ ¼ 1:50 and RT p 2 ¼ 0:90 ð Þ ¼ 2 and right panel in this figure depicts the scenarios when RT p 1 ¼ 0:10 ð Þ ¼ 2 and RT p 2 ¼ 0:90 ð Þ ¼ 1:5.Throughout the analysis, we have only considered one IA owing to the fact that phase II studies are generally small-to moderate-sized studies.However, if needed, similar kind of analysis can be adopted for studies with multiple interim looks with an adjustment of alpha level.
As the calculation of CP, PP and BPP in SC tests requires interim data, we decided to generate this (hypothetical) interim data.We followed the method of Wan (2017) in simulating interim data from two different Weibull distributions with median PFS of 6 months and 4 months in the treatment and control arms, respectively, maintaining a censoring rate of 20%.For the sake of simplicity, we assume that the IA will be conducted only after all the patients have been accrued in the study.Furthermore, interim event time was simulated from Weibull distribution for both control and treatment arms.Scale parameter of Weibull distribution calculated for control arm is that represents a median PFS time at 4 months and similarly for treatment arm, scale parameter can be calculated as that represents a median PFS at 6 months.In this section, we will briefly discuss the prior elicitation of the shape parameter.We explore three different priors in our mixed/hybrid Bayesian method (PP) and fully Bayesian analysis (BPP), which are uninformative or uniform, skeptical or weak, and enthusiastic or strong prior (Fayers et al. 2005).A prior distribution represents the level of skepticism or enthusiasm of a clinician's belief regarding the outcome of treatment effect.For uninformative prior, we assumed that the prior represents by a normal distribution with the equivalent of null hypothesis of no treatment different and an infinite variance (Fayers et al. 2005).Using the method described by Dignam et al. (1998) and Fayers et al. (2005), for one-sided probability, we can calculate the standard deviance of the skeptical prior as : Also, at interim stage, variance of RT effect size can be expressed as var σ m ð Þ ¼ σ 2 d m .This approximation helps us to determine the sample size corresponding to skeptical prior.Now, we can calculate the common standard deviation σ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Var σ m ð Þ � d m p and the number of events for skeptical prior can be written as n 0 ¼ σ 2 σ 2 secp .Therefore, we can write the skeptical prior ~ N 0; σ 2 n 0 � � : Similarly, enthusiastic prior assumes a normal distribution with mean log RT 0:5 and variance σ 2 n 0 , i.e., enthusiastic prior ~ N log RT 0:5 were calculated based on the explicit formula mentioned above, BPP ϕ ð Þ was calculated based on simulation-based approach.BPP ϕ ð Þ was calculated based on the 10,000 hypothetical simulated trials for each scenario mentioned above.

Results
In this section, BPP is compared with the frequentist CP methods and the hybrid PP approach.Also, we have discussed about the role of prior in clinical trial in terms of TTE endpoint with normal approximation.Furthermore, we also assessed the performance of power when PH assumption was violated but log-rank test was used to design the clinical trial.Tables 1-4 show the results for SC tests to monitor a hypothetical clinical trial.The left-most column represents the time ratio at the p th percentile using the notation RT p ð Þ.Table 1 shows the CP m ϕ ð Þ results under two different scenarios with p 1 ¼ 0:10 and p 2 ¼ 0:90 - , for varying values of the control Weibull shape parameter (see first two columns).The fourth column shows the number of total events D ð Þ and sample sizes N ð Þ equally allocated in each arm adjusted for accrual, follow-up and dropout for fixed design with 80% power.In the fifth column, d 0 m and d 1 m represent the number of events observed at the m th interim time point.Under CP, we consider several effect sizes based on the literature review.CP m ϕ H 0 À � is defined the CP under null hypothesis, whereas CP m ϕ H a À � was calculated based on alternative fixed design effect.Similarly, CP m ϕ T ð Þ and CP m ϕ A ð Þ denote the CP under current observed effect size and CP with adaptive effect size, respectively.The last four columns of Table 1 show the comparison of four CP methods.Similarly, Table 2 displays the PP and BPP calculations for the same inputs as in Table 1.The PP and BPP calculations are shown using all skeptical 1, we can see that, for β 0 = 0.75 and RT p 1 ¼ 0:10 ð Þ ¼ 1:52, RT p 2 ¼ 0:90 ð Þ ¼ 1:98, we have 50 events observed at control arm and 32 events observed at treatment arm.CP m ϕ H a À � indicates that at first interim look, there is 73.3% chance that the trial will be successful by the end of the study period, given the true effect was observed throughout the rest of the study.Likewise, CP m ϕ T ð Þ and CP m ϕ A ð Þ showed that there is a high probability that the trial will be effective if we continued to the end of the trial period.For RT p 1 ¼ 0:10 98. Based on that information, i.e., looking at the available data at interim stage, the trial should be continued.From Table 2 and Table 4, we can see that BPP showed more sensitivity toward the choice of prior distribution compared with the PP approach.Under an enthusiastic power, BPP tends to have high success probability compared to PP, and similarly, under a weak prior distribution, BPP has a tendency to produce lower success probabilities compared to PP.However, when the shape parameter is large, we have a very small sample size and the BPP did not perform very well.Similar strategy can be adopted for other scenarios as shown in Tables 1-4.
Table 5 shows the comparison between log-rank test and RT p ð Þ method.We would like to compare the performance of our proposed method under the RT p ð Þ framework to that of the standard log-rank test (which is known to perform best when the PH assumption is met).We consider three cases -(i) RT p 1 ð Þ and RT p 2 ð Þ are same, i.e., both the shape parameter of Weibull distribution for these arms are same, resulting in both PT and PH assumption being satisfied, (ii 98.We observe that in case i ð Þ for β 0 ¼ 0:75 ^β1 ¼ 0:75, as expected, both have almost similar power for CP method.However, the PP under uniform prior has relatively higher power than the log-rank test.For case Table 1.Conditional power for futility monitoring with RT method for r ¼ 1 (equal allocation ratio), one-sided test, power 80%, accrual time is 12 months and follow-up time is 12 months, respectively, with p 1 ¼ 0:10, p 2 ¼ 0:90 and control median PFS = 4 month.

Conditional power (CP)
RT are the observed number of events in the control arm and treatment arm, respectively, at the interim time.
Table 2. Predictive power and Bayesian predictive probability of futility monitoring for RT method with r ¼ 1 (equal allocation ratio), one-sided test, power 80%, accrual time is 12 months and follow-up time is 12 months, respectively, with p 1 = 0.10, p 2 = 0.90 and control median PFS = 4 months.

Predictive power (PP)
Bayesian predictive probability (BPP) (ii), since the data are generated consistent with the RT method, we have higher power in both SC methods than the log-rank test-based curtailment method.However, that is not the case (iii).We have relatively higher power in log-rank test that the RT method.We suspect the reason that power is relatively lower in that case and is mostly dependent on how RT p 1 ð Þ and RT p 2 ð Þ are defined when designing a TTE clinical trial.
As elucidated earlier, many authors suggested using SC tests only for futility monitoring while only a few authors implemented the SC tests for both efficacy and futility monitoring.Since, in our paper, we are conducting the SC tests for the RT method, we implemented both futility and efficacy criteria to assess the unconditional Type I error rate.Here, the unconditional Type I error rate was evaluated using the CP approach for the RT method.Our results indicate that if the RT method was conducted by allowing only futility monitoring, then the unconditional type I error rate is adequately maintained and ranges from 0.013 to 0.09.The maximum type I error rate of 0.09 occurs when we consider the extreme cases of the Weibull shape parameter (β ¼ 0:5).However, if the SC test is conducted by allowing for both futility and efficacy monitoring, then the unconditional type-I error is somewhat inflated in most scenarios -ranging from 0.02 to up to 0.159, and in extreme cases of the Weibull shape parameter (β ¼ 0:50) it is going up to a maximum of 0.19.These results suggest that the SC tests for the RT-based approach are more appropriate for futility monitoring in a phase II trial.In the case where both efficacy and futility monitoring are desired, some other designs such as a group sequential testing strategy maybe be more appropriate.
Table 3. Conditional power of futility monitoring for RT method with r ¼ 1 (equal allocation ratio), one-sided test, power 80%, accrual time is 12 months and follow-up time is 12 months, respectively, with p 1 = 0.25, p 2 = 0.75 and control median PFS = 4 month.

Discussion and conclusion
In this paper, we have proposed SC tests using three different methods of calculation -CP, PP and BPP for phase II trials with TTE outcome in the case of non-proportional hazards.In doing so, we have extended the novel sample size calculation methodology for 'fixed two-arm' trials developed by Phadnis and Mayo (2021) to accommodate interim monitoring that will allow researchers to take informed decisions on whether they should continue their trial In this work, we have assumed that both the treatment and control arm follow two different Weibull distributions allowing for increasing, decreasing and constant hazard (exponential distribution).This allows us to operate under the Relative Time framework where the most general case represents the scenarios where neither the PT nor the PH assumption is satisfied.That is, neither the hazard ratio nor the time-ratio is constant.This 'changing' (non-constant) time-ratio requires that the effect size be defined through two pairs of coordinates p 1 ; RT p 1 ð Þ f g and p 2 ; RT p 2 ð Þ f g as opposed to the usual 'single effect size' definition found in studies based on the PH assumption or based on the improvement in median (or mean) survival time.The actual choice of p 1 and p 2 depends on the specific disease that is being studied.For example, an oncology trial may use p 1 ¼ 0:25; RT p 1 ð Þ ¼ 2; p 2 ¼ 0:75; RT p 2 ð Þ ¼ 1:5 -reflecting a high treatment benefit at the 25 th percentile but this benefit tapers off at the 75 th percentile.On the other hand, a surgical intervention may use p reflecting a small treatment benefit immediately after surgery at the 10 th percentile, but this benefit may improve as the surgery becomes successful yielding higher treatment benefit at the 90 th percentile.
The methods available in literature for conducting SC calculations are largely focused on using the log-rank test and expected to perform best when the PH assumption is satisfied.Thus, our proposal will provide statisticians and researchers more options to conduct SC test using CP, PP and BPP when they suspect that the PH assumption is not valid, but when the choice of two different Weibull distributions is appropriate.Another advantage of our method is since it is based on the RT framework, decisions at the interim can be taken in a more intuitive way using the metric of 'improvement in longevity' compared to the traditional 'reduction in hazard' definition.The aim of phase II trial is to investigate treatments that look promising for further large sample phase III trials.For example, in the field of oncology, tumor response rate was popularly used as a primary endpoint.However, in recent times, PFS and OS have been used as TTE endpoints as recommended by Rubinstein (2014).In our experience, researchers working on phase II oncology trials discuss their hypothesis in terms of improvement in median PFS in their first meeting with a statistician.At this point, it is hard for the researchers to define the treatment effect size in terms of a hazard ratio.Thus, the RT(p) framework of our proposed method is well suited to meet this purpose.
We have shown how SC tests can be performed in three different ways using CP, PP and BPP approaches.SC tests carry the advantage that they can be done with a prespecified interim analyses or even in an unplanned way.SC method allows to monitor the trial by either using the prespecified effect size, using the current estimated or the null effect size.Furthermore, SC boundaries are continuous while group sequential boundaries are discrete and can easily visualize at the beginning of the trial (Davis and Hardy 1994).Thus, our proposed methods can be used in a flexible manner when dealing with TTE outcomes with NPH.
Despite the general applicability, there are some limitations inherent to our proposed methods.Our sample size calculations of NPH and NPT depend on the assumption of Weibull distribution in the two trial arms.If this assumption is violated, the performance of the method may be compromised.Another limitation of our method is that our method assumes a point estimate of the shape parameter of the Weibull distribution in the control arm that is available from historical sources.When no such reliable estimate is available, it may be difficult to execute our method.A simple ad-hoc solution would then be to assume that this shape parameter is exactly 1, implying that the survival time in the control arm follows an exponential distribution.In the absence of any prior historical knowledge, this would still be reasonable, although inaccurate, assumption.In our future work, we intend to address this issue in a more comprehensive manner.The third limitation of our method is that in its current form, it is developed for the scenarios represented in Figure 1.This is a direct consequence of the fact that our method is an extension of the assumptions used by Phadnis and Mayo (2021) wherein the two curves are allowed to cross very early or very late depending on the specific disease/condition under consideration.Thus, it is not intended for two arbitrarily crossing survival curves, and more research needs to be done to accommodate such features.Another limitation of our proposed SC tests for the RT-based approach was that the overall unconditional type I error rate was well maintained in most scenarios if the design only allowed for futility monitoring, whereas the unconditional type I error was inflated if the design is allowed for both efficacy and futility monitoring.We, therefore, recommend using our proposed SC tests primarily only for futility monitoring.
Overall, our proposed methodology offers an attractive feature to the clinicians and statisticians for designing randomized phase II survival trial.In conclusion, both Bayesian and frequentist methods may provide an insight for interim decision on whether a trial should continue or be stopped by taking into consideration the data observed at the interim stage of a trial.

Disclaimer
Although the examples discussed in this manuscript represent real-life clinical situations, the effect size definition(s) used in this manuscript is purely hypothetical in nature.We have not used any original datasets from our collaborations on previously funded grants, but we occasionally rely on our published results for parameter estimates used in this manuscript.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The author(s) reported there is no funding associated with the work featured in this article.
Hence the CP at stage for rejecting H 0 about a parameter ϕ at the end of the study, given Z m is here, I m = information time at stage m = 1 Variance = 1 σ 2 m and I N = information time at stage N = 1 σ 2 N
and d 1 m are the observed number of events in the control arm and treatment arm, respectively, at the interim time. m

Table 5 .
Compare RT methods with the log rank test statistics under different scenarios with r = 1 (equal allocation ratio), α = 0.05, one-sided test, power 80%, accrual time and follow-up time 12 months, respectively, with control median PFS = 4 month.CP m ϕ T ð Þ ¼ conditional power under current trend; PP u ϕ ð Þ = uniform prior with predictive power approach.