Group sequential designs for cancer immunotherapy trial with delayed treatment effect

ABSTRACT Cancer immunotherapy trials are frequently characterized by delayed treatment effects such that the proportional hazards assumption is violated and the log-rank test suffers a substantial loss of statistical power. To increase the efficacy of the trial design, a variety of weighted log-rank tests have been proposed for fixed sample and group sequential trial designs. However, in such a group sequential design, it is often not recommended for futility interim monitoring due to possible delayed treatment effect which could result a high false-negative rate. To resolve this problem, we propose a group sequential design using a piecewise weighted log-rank test which provides an event-driven approach based on number of events after the delayed time. That is, the interim looks will not be conducted until the planned number of events observed after the delay time. Thus, it avoids the possibility of false-negative rate due to the delayed treatment effect. Furthermore, with an event-driven approach, the proposed group sequential design is robust against the underlying survival, accrual and censoring distributions. The group sequential designs using Fleming-Harrington-($$\rho ,\gamma $$ρ,γ) weighted log-rank test and a new weighted log-rank test are also discussed.


Introduction
Recent advances in immunotherapies for treatment of cancer have led to developments of a variety of treatments which show great potential for improving long-term survival of patients.However, immunotherapy is frequently characterized by a delayed treatment effect compared with a standard chemotherapy.The log-rank test (LRT) is a conventional choice for testing the superiority of survival for a treatment over a control.In settings of delayed treatment effect where the proportional hazards assumption is violated, the LRT suffers a substantial loss of statistical power.Thus, additional resources will be needed to maintain the study power for immunotherapy trials with delayed treatment effects if the trial design is based on the LRT.Therefore, it is critical to correctly incorporate the delayed treatment effect into consideration for designing randomized phase III immunotherapy trials.
A variety of weighted log-rank tests have been proposed in the literature for fixed sample immunotherapy trial designs.For example, Xu et al. (2016) proposed a piecewise weighted log-rank test (PWLRT) which is a most powerful test for a threshold delayed effect model.Hasegawa (2014) proposed a weighted log-rank test (WLRT) with the Fleming-Harrington class of weights.For ethical reason, randomized phase III cancer clinical trials are often monitored during the trial to stop early for futility and/or efficacy.However, the literature has only a few discussions of immunotherapy trial designs with delayed treatment effects in the group sequential setting.Zhang and Pulkstenis (2016) proposed a group sequential design based on the LRT.Their design generally requires more subjects or longer follow-up time due to a decreased efficiency of the LRT in the presence of a delayed treatment effect.Hasegawa (2016) proposed a group sequential design using the Fleming-Harrington-(ρ ¼ 0; γ ¼ 1) weighted LRT (FH(0,1)-WLRT) but it is not fully efficient for the delayed treatment effect model.Magirr and Jimenez (2022) proposed a group sequential design using the modestly-weighted LRT, which is similar to an average landmark analysis from the delay time until the end of follow-up, thus, it is also not an efficient test under the threshold delay model.Li et al. (2021) proposed a group sequential design using the PWLRT with early termination for efficacy.All these proposed group sequential designs don't recommend for futility interim monitoring due to the delayed treatment effect.However, Korn and Freidlin (2018) argued that there is no need to abandon this important protection for clinical trial participants, because it offers considerable savings in time and patients treated when the experimental treatment is no better than the control.In this article, we propose a group sequential design using the PWLRT which provides an eventdriven approach based on number of events after the delayed time.Thus, it avoids the possibility of falsenegative discovery due to delayed treatment effect.Furthermore, with an event-driven approach, the proposed group sequential design is robust against the underlying survival, accrual and censoring distributions.We also developed group sequential designs using the FH(0,1)-WLRT and a new weighted LRT (NWLRT) for the purpose of comparison and providing alternative choices for the trial design.

Delayed treatment effect model
In this section, we introduce a threshold delayed treatment effect model.Let λ k ðtÞ be the hazard function for group k ¼ 1; 2 which represents the control and treatment, respectively, and t 0 be the true treatment delay time or lag time which is an unknown parameter and needs to be estimated based on the survival data of early clinical trials.Assuming a fixed time-lag effect, the hazard function of the treatment group can be described as follows: where δ is a constant hazard ratio (HR) after the lag time t 0 .The hazard ratio over the entire study duration is given by where Ið�Þ is an indicator function.The survival function of the treatment group is given as follows: We are interested in the following one-sided hypothesis for testing the difference of survival distributions between the experimental treatment and control groups or equivalently to testing the one-sided hypothesis for the hazard ratio δ after the lag time as follows: We will discuss how to use WLRTs for designing group sequential trials under the threshold delayed treatment effect model (1).

Sequential weighted log-rank test
To introduce the sequential WLRTs, assuming during the accrual phase of the trial, a total of n subjects in two groups are enrolled in the study, and let T i and C i denote, respectively, the failure time and censoring time of the i th subject, being measured from the time of study entry time Y i .Let Z i ¼ 0=1 be the control/treatment group indicator for the i th patient.We assume that the failure time T i is independent of the censoring time C i and entry time Y i , and fðY i ; T i ; C i Þ; i ¼ 1; � � � ; ng are independent and identically distributed within each group.When the data are examined at calendar time t � τ, where τ is the total study duration, we observe the time to failure X i ðtÞ ¼ T For the threshold delayed effect model (1), several weight functions were proposed in the literature.Xu et al. (2016) proposed using a piecewise weight function W n ðtÞ ¼ Iðt > t 0 Þ which gives 0/1 weights before/after lag time t 0 and it is an optimal weight function under the threshold delay model.Hasegawa (2016) proposed using the Fleming-Harrington-(0,1) weight function where Ŝð�Þ is the Kaplan-Meier estimate of the pooled data from two groups and ft j g are unique failure times of two groups.The FH(0,1)-WLRT avoids explicit depending on the delay time t 0 but it is not fully efficient under the threshold delay model.We propose the following new weight function which gives small weights before the lag time and full weights after the lag time and it is more powerful than the FH(0,1)-weight under the threshold delay model.
The variance estimate of n À 1=2 S W ðtÞ is given by n À 1 D W ðtÞ, where Thus, the sequential WLRT at calendar time t is given by

SCPRT procedure
We now apply the sequential conditional probability ratio test (SCPRT) (Xiong 1995) to the weighted log-rank score statistic S W ðtÞ based on its Brownian motion property.Let t � be the information time which is a function of calendar time t, then, for t > t 0 it is well known (Lan and DeMets, 1983) be the critical value of B 1 to reject a one-sided null hypothesis at the final stage, then, the conditional maximum likelihood ratio (Xiong et al. 2003) for the stochastic process B t � on information time t � is given as follows: Taking logarithm, the log likelihood ratio can be simplified as which has a positive sign if B t � > z 1À α t � and a negative sign if B t � < z 1À α t � .This equation leads to symmetric lower and upper boundaries for the WLRT at the k th interim look, which are given as follows: for k ¼ 1; . . .; K, where K is the total number of looks, and a is the boundary coefficient.It is crucial to choose an appropriate a for the design such that the probability of conclusion by sequential test being reversed by the test at the planned end is small, but not unnecessarily too small.Specifically, let D be the event that the conclusion at an interim time will be reversed at the final time and ρ ¼ max s P δ ðDjB 1 ¼ sÞ be the maximum conditional probability of discordance.Boundary coefficient a in equation ( 4) is determined by choosing an appropriate ρ (Xiong et al. 2003).A smaller ρ results in a larger a so that upper and lower boundaries are further apart, which leads to a larger expected sample size.We recommend ρ ¼ 0:02 which results in a SCPRT boundary that is efficient as well as preserving the accordance (agreement) of conclusions for the test at the early stopping and the test at the planned end.
The nominal significance levels at the k th interim look for testing hypothesis H 0 are given by where Φð�Þ is the cumulative distribution function of the standard normal.We accept or reject the null hypothesis at k th interim analysis if the observed p-value is greater than P a k or less than P b k , otherwise the trial continues to the next stage.The power function of a group sequential trial with K looks is given by following probability where P δ ð�Þ is the probability under the hazard ratio δ.
The proposed group sequential method based on the SCPRT procedure has the following benefits compared to other group sequential procedures.It has the smallest maximum sample size which is the same as the fixed sample design.The overall type I error and power of the SCPRT test is kept nearly the same as the fixed sample design.Specifically, for a group sequential procedure with K interim analyses, let P 0 ðδÞ and P K ðδÞ be the power functions for the fixed sample test and K-stage group sequential SCPRT test, respectively, and ρ max ¼ max δ P δ ðDÞ be the maximum discordant probability of the group sequential SCPRT procedure.Following Theorem 4.1 in Xiong (1995) for any δ, we have which implies that the difference between the two power functions is less than ρ max .Thus, with a small maximum discordant probability ρ max , the power of a fixed sample design provides approximately the same power for the group sequential trial based on the SCPRT procedure.With recommended maximum conditional probability of discordance ρ ¼ 0:02, it leads to a maximum discordant probability ρ max ¼ 0:0054.More details for computation of the maximum discordant probability ρ max can be found in Xiong et al. (2002).
The maximum conditional probability of discordance can be controlled from the study design, which is particularly important when the regulatory agencies require that the study design controls the probability of discordance for the group sequential trials.The SCPRT procedure has been implemented in user-friendly software, SCPRTinfWin (Xiong 2017), which can be downloaded from http:// www.stjuderesearch.org/site/depts/biostats/scprt.We present two-, three-and four-stage SCPRT group sequential boundaries in Table 1 for reference.

Information time for the weighted LRT
The time for interim analysis is determined by information spending on the trial or information time.Let t and τ be the calendar time elapsed in the study and the total study duration, respectively.According to Lan and Zucker (1993), the variance of test statistic determines the information.Therefore, the information time of the weighted log-rank score statistic can be defined as follows: Assuming the accrual distribution is Að�Þ, it has been shown (Tsiatis 1982) that the asymptotic variance of the weighted log-rank score test is given by where ω 1 and ω 2 ¼ 1 À ω 1 are the allocation ratios of the control and treatment groups, is the overall density function of two groups and d w ðtÞ ¼ n ðxÞdx is the expected weighted number of events between two groups accumulated up to calendar time t.Thus, the information time at calendar time t can be calculated as the ratio of expected weighted number of events up to calendar time t to that up to the end of study, Aðτ À xÞw 2 ðxÞf ðxÞdx.This information time can be calculated based on the underlying survival, accrual and censoring distributions which are inevitably misspecified at the design stage.Thus, we may update the information time at each interim analysis based on the observed number of events.The variance of S W ðtÞ up to calendar time t can be estimated as follows (Hasegawa 2016): where t j is the unique j th failure time and d j is the observed number of failures at time t j from the beginning of the trial and R t is a set of index of unique failures corresponding subjects' failures occurring before calendar time t.Thus, information time at calendar time t can be estimated by where for the FH(0,1)-WLRT, and W n ðtÞ is given by equation ( 2) for the NWLRT.Unfortunately, the total information n is not observed until the end of the trial.Thus, calculation of this estimated information time still depends on the underlying survival, accrual and censoring distributions.It is a disadvantage to use the FH(0,1)-WLRT and NWLRT for the group sequential trial designs.However, for the PWLRT, the information time at calendar time t can be simplified as follows: which is the ratio of the expected number of events up to calendar time t (after delay time t 0 ) to the total number of events after delay time t 0 (see Appendix 1 of supplemental materials).Thus, supposing there are a total of K interim looks, the information time for the k th interim look can be estimated by where d ðkÞ t 0 is the observed number of events between two groups after the lag time t 0 and up to the k th interim look.The final observed information d ðKÞ t 0 is not available at the k th interim look when k < K. Therefore, we recommend using an event-driven approach, that is, the trial will not end until we have observed the total number of events d ðKÞ t 0 after the delay time t 0 .Thus, we conduct the k th interim analysis once observed d ðkÞ t 0 events after the delay time t 0 which gives an information time t� for the k th interim look.Therefore, the information time and group sequential design are robust against the underlying survival, accrual and censoring distributions.

Sample size calculation
To preserve the overall type I error rate for the group sequential trial, the maximum sample size is determined by multiplying the sample size of a fixed sample test by an inflation factor for a specific group sequential design (Jennison and Bw 2000).The inflation factor can be computed numerically using the joint distribution of the sequence of test statistics at interim looks.However, because the power function of a group sequential design based on the SCPRT procedure is essentially the same as that of the fixed sample test, we only need to calculate the sample size for a fixed sample test which provides an adequate power while preserving the type I error rate for the group sequential trial design by using the SCPRT procedure.Therefore, we will derive a simple sample size formula for a fixed sample test under the threshold delayed treatment effect model.
Under a sequences of local alternative hypotheses (Schoenfeld 1981), logðδÞ ¼ Oðn À 1=2 Þ, where n is the total sample size of two groups, we have shown that the WLRT statistic Z W is an asymptotic normal distribution with mean ffi ffi ffi n p μ w =σ w and unit variance, where μ w and σ 2 w are given in the following equations ( 10) and ( 11), respectively (derivation is given in Appendix 2 of supplemental materials).Thus, sample size for fixed sample design and SCPRT group sequential design can be calculated by the following formula where μ w and σ 2 w are given as follows: and the functions πðxÞ and VðxÞ are given by where λ 1 ðxÞ and λ 2 ðxÞ are the hazard functions, ω 1 and ω 2 ¼ 1 À ω 1 are the sample size allocation ratios of the control and treatment groups, and GðxÞ is the common censoring distribution function of two groups.Let p 1 and p 2 be the failure probabilities of the control and treatment groups and P ¼ VðxÞdx be the overall failure probability of two groups.Then, the total number of events required for the study is given by d ¼ nP and the total number of events after delay time can be calculated by d t 0 ¼ nP t 0 where VðxÞdx is the overall failure probability of two groups after delay time t 0 .
For the PWLRT under the threshold delay model, sample size formula (9) can be approximated as follows.The number of events after delay time t 0 can be calculated by  where α and β are the type I and II error rates, respectively.It is clear that the power in ( 12) is driven by the number of events after the delayed time t 0 .Furthermore, from formula (12) it is easy to see that d t 0 does not depend on a specific value of the lag time t 0 .Total sample size for the trial is given by n ¼ d t 0 =P t 0 .We have also shown numerically that the number of events after delay time calculated by formula ( 9) is also nearly independent on the underlying survival, accrual and censoring distributions (see Table 2 and Figure 1).Numerical results also show that formula (9) provides accurate power/ sample size for the trial design even when the hazard ratio is small (which could be the case for delayed effect model) or unbalanced randomization allocation ratio (see Appendix 3 of supplemental materials).Details of derivation for both formulae (9) and ( 12) are given in Appendix 2 of supplemental materials.R codes for the sample size calculations for the fixed sample design and group sequential design are given in Appendix 4 of supplemental materials.

Simulation
To evaluate the accuracy of the proposed sample size formula (9), sample sizes are calculated under the Weibull threshold delay effect model with the following parameter settings: the Weibull distribution for the control group is S 1 ðtÞ ¼ e À λ 1 t κ with shape parameter κ ¼ 1 and 1.5, and median survival time of the control m 1 ¼ 8 (months) or hazard rate λ 1 ¼ logð2Þ=m κ 1 .We consider a delay time t 0 ¼ 4 (months); the hazard ratio after delay time t 0 is set to be δ ¼ 0:6; 0:65; 0:7; we use uniform accrual with accrual duration t a ¼ 8; 15 and follow-up time t f ¼ 8; 13; 18 (months); sample size allocation ratio is set to be ω 1 ¼ 1=2 (equal allocation).Sample sizes are calculated with a one-sided type I error rate of 5% and power of 90%.Empirical type I errors and powers are estimated by performing 10,000 simulation runs.
The simulation results (Table 2) for the fixed sample design are summarized as follows.The proposed formula (9) provides accurate sample size estimation and all three WLRTs preserve the type I error rate and power.The PWLRT is more efficient than the FH(0,1)-WLRT and NWLRT Figure 1.Relationship between number of events after delay time, total number of events, sample size (number of patients) and study duration with different delay time, accrual duration and follow-up time using the PWLRT.Solid line is sample size, dash line is total number of events, dot line is number of events after delay time.The figures show the data within interval [5,25] (months) only.
because the PWLRT is a most powerful test for the threshold delay model.For the PWLRT, the number of events after delay time is nearly constant which does not depend on the lag time, thus, the study power is determined by the number of events after delay time instead of the sample size or total number of events.Therefore, with an event-driven approach (number of events after the delay time) using the PWLRT, the study design is robust against survival, accrual and censoring distributions.For the FH(0,1)-WLRT and NWLRT, in contrast, sample size, instead of number of events after delay time or total number of events determined the study power.Therefore neither test is robust against survival, Table 3. Two-stage and three-stage group sequential designs using the piecewise weighted log-rank test (PWLRT).Exponential threshold delay model with lag time t 0 = 4 and hazard ratio after lag time δ ¼ 0:65 , uniform accrual with accrual duration t a ¼ 30 and follow-up t f ¼ 5 (months).Median survival time of the control 8 months, balanced information time for K-stage interim looks with K = 2, 3 and maximum conditional probability of discordance ρ ¼ 0:02 , one-sided type I error rate α ¼ 0:05 and power of 90%.The empirical type I error (α) , empirical power (EP), expected sample size, number of events and study duration are estimated based on 10,000 simulation runs (using event-driven approach).
403 152 153 28.9 30.9 Table 4. Two-stage and three-stage group sequential designs using the Fleming-Harrington-(0,1) weighted log-rank test (FH(0,1)-WLRT).Exponential threshold delay model with lag time t 0 ¼ 4 and hazard ratio after lag time δ ¼ 0:65, uniform accrual with accrual duration t a ¼ 30 and follow-up t f ¼ 5 (months).Median survival time of the control 8 months, balanced information time for K-stage interim looks with K ¼ 2; 3 and maximum conditional probability of discordance ρ ¼ 0:02, one-sided type I error rate α ¼ 0:05 and power of 90%.The empirical type I error (α), empirical power (EP), expected sample size, number of events and study duration are estimated based on 10,000 simulation runs (using calendar time approach).
accrual or censoring distributions.The NWLRT is slightly less efficient than the PWLRT but it is more efficient than the FH(0,1)-WLRT.
We also conducted simulations to study the operating characteristics of the proposed group sequential designs for the PWLRT, FH(0,1)-WLRT and NWLRT under the exponential delay model.The results are presented in Tables 3-5.First, sample size calculated from the fixed sample design preserves type I error rate and provides adequate power for the group sequential designs for all three WLRT tests.Second, the group sequential designs save the sample size, total number of events and study duration compared to the fixed sample designs under both null and alternative hypotheses, and three-stage designs save very little more sample size than two-stage designs.Third, using the SCPRT design the discordant probability is controlled by the maximum conditional probability of discordance ρ at the design stage.
Finally, we conducted simulations to study robustness against misspecified delay time for the proposed fixed sample and group sequential designs.Table 6 gives the simulation results for singlestage and two-stage designs.We specify the delay time in the design phase as t m ¼ 4. If the treatment effect starts later than the specified time, that is, the true delay time t 0 is longer than the specified delay time, then the power of all three tests is reduced by using t m as the delay time for data analysis.The longer the true delay time, the more power loss.On the other hand, if the treatment effect starts earlier than expected, that is the true delay time t 0 is shorter than the specified delay time, the type I error and power using Iðt > t m Þ weighted PWLRT are preserved with an event-driven approach, even in case of no delay at all.However, the powers of the FH(0,1)-WLRT and NWLRT are increased, and the longer the specified delay time than the true delay time, the more power, resulting in an overpowered study which could lead to a significant result for an uninteresting treatment effect.
Due to limited space, simulation results for the fixed sample test and group sequential designs under the Weibull distribution (with κ ¼ 1:5) with balanced and unbalanced randomization allocation, censoring due to loss to follow-up, and unbalanced or unequal spaced information times are given in Appendix 3 of supplemental materials.Table 5. Two-stage and three-stage group sequential designs using the new weighted log-rank test (NWLRT).Exponential threshold delay model with lag time t 0 ¼ 4 and hazard ratio after lag time δ ¼ 0:65, uniform accrual with accrual duration t a ¼ 30 and followup t f ¼ 5 (months).Median survival time of the control 8 months, balanced information time for K-stage interim looks with K ¼ 2; 3 and maximum conditional probability of discordance ρ ¼ 0:02, one-sided type I error rate α ¼ 0:05 and power of 90%.The empirical type I error (α), empirical power (EP), expected sample size, number of events and study duration are estimated based on 10,000 simulation runs (using calendar time approach).

Example
We will use the POPLAR trial (Gandara et al. 2018) as an example to illustrate the proposed group sequential design.POPLAR is an open-label phase II randomized controlled trial of atezolizumab versus docetaxel (control) for patients with previously treated non-small cell lung cancer.The deidentified data set of the POPLAR trial is publicly available.The original trial was designed assuming a median overall survival (OS) of 8 months for the control arm and a hazard ratio of 0.65, which translated into a median OS of approximately 12.3 months for the atezolizumab arm, under an exponential model.Recruitment lasted 8 months and patients were followed up to 13 months.Three interim analyses were planned, with two-sided α levels of 0.0001, 0.0001, and 0.001.The final analysis of OS was performed when 173 deaths had occurred in the intent-to-treat population, using a twosided significance level of 4.88%.The trial enrolled a total of 287 patients (Magirr and Jimenez 2022).Kaplan-Meier survival curves of the treatment and control groups are shown in Figure 2. The curves display a typical late separation pattern often seen with immunotherapy agents.We will illustrate how the trial might be designed more efficiently using group sequential approach by taking into account a delayed treatment effect.Using one-change-point Cox-regression model (Li et al. 2019), we obtained an estimate of the threshold lag time for the POPLAR trial as t 0 ¼ 8:6 months and the hazard ratio after lag time as δ ¼ 0:465.We further fit the Kaplan-Meier OS curve for the control group using Weibull distribution S 1 ðtÞ ¼ e À λt κ , where the estimated shape parameter κ ¼ 1:2 and hazard parameter λ ¼ 0:0439.The estimated Weibull distribution fits the Kaplan-Meier curve very well (see Figure 2).Thus, to illustrate the group sequential trial design by including consideration for the delayed treatment effect, we assume that the OS times for patients receiving the control follows the Weibull distribution, whereas the OS time for patients receiving atezolizumab follows a piecewise Weibull distribution with a lag time t 0 ¼ 8:6 months as follows: Table 6.Simulations to study sensitivity of the study power for misspecifying the delay time to be t m ¼ 4 (months) when the true delay time is t 0 ¼ 0, 2, 3, 4, 5, 6 (months).Simulations are conducted under the exponential threshold delay model for a fixed sample test design and two-stage group sequential design with median survival time 8 (months) for the control group, hazard ratio after delay time δ ¼ 0:65, uniform accrual with accrual duration t a ¼ 15 and follow-up time t f ¼ 18 with nominal one-sided type I error rate 5% and power of 90%.Sample size (n) is calculated under the delay time t m ¼ 4. The empirical type I error rate (α) and empirical power (EP) are calculated based on 10,000 simulated trials.where c ¼ e À λt κ 0 ð1À δÞ is a normalizing constant, where κ ¼ 1:2 and δ ¼ 0:465.For the trial design, we further assume that the accrual and follow-up durations are same as the original trial, that is a uniform accrual with accrual period of t a ¼ 8 months, follow-up time t f ¼ 13 months and the study duration τ ¼ t a þ t f ¼ 21 months.Using the PWLRT approach, to achieve 90% power with a one-sided type I error of 5%, the total number of events and sample size are d ¼ 181 and n ¼ 272 (accrual rate r ¼ 272=8 ¼ 32 per month), respectively.The number of events required after the lag time t 0 is d t 0 ¼ 61.Suppose we would like to design the trial with three interim analyses (including the final analysis) with a balanced information time t � ¼ ð1=3; 2=3; 1Þ.Using SCPRT procedure, given ρ ¼ 0:02 and K ¼ 3, the calculated boundary coefficient a ¼ 2:645 (see Table 1 in Xiong et al. 2003), thus the lower and upper boundaries are ða 1 ; a 2 ; a 3 Þ ¼ ðÀ 0:928; 0:015; 1:645Þ and ðb 1 ; b 2 ; b 3 Þ ¼ ð2:828; 2:671; 1:645Þ, and the corresponding significance levels are ðP a 1 ; P a 2 ; P a 3 Þ ¼ ð0:823; 0:494; 0:05Þ for futility and ðP b 1 ; P b 2 ; P b 3 Þ ¼ ð0:002; 0:004; 0:05Þ for efficacy (Table 1).For the PWLRT, using an event-driven approach, the number of events required after the lag time t 0 are 21, 41 and 61 at the first, second and final analysis, respectively; the corresponding calendar times for three interim analyses are 14.75, 17.67 and 21 months.The simulated empirical one-sided type I error rate and power for this 3-stage group sequential design are 4.97% and 90.3%, respectively; the cumulative stopping probabilities for three stages are (0.180, 0.526, 1) under the null hypothesis and (0.118, 0.402, 1) under the alternative hypothesis.
With the same setup, the total sample sizes for FH(0,1)-WLRT and NWLRT are n ¼ 388 and n ¼ 307 (accrual rate r ¼ 48:5 and r ¼ 38:4 per month, respectively), corresponding to the total number of events d ¼ 258 and d ¼ 205, respectively.Because the powers of FH(0,1)-WLRT and NWLRT are determined by sample size not number of events after lag time, thus, group sequential trials using the FH(0,1)-WLRT and NWLRT are designed based on calendar time approach.For example using formula (6), the interim calendar times of the NWLRT with balanced information times of 3 looks are 14.14, 17.44, 21 months at the first, second and final analysis, respectively.The simulated empirical one-sided type I error rate and power for this 3-stage group sequential design are 5.2% and 89.2%, respectively; the cumulative stopping probabilities for three stages are (0.176, 0.523, 1) under the null hypothesis and (0.074, 0.354, 1) under the alternative hypothesis.The procedure for FH(0,1)-WLRT is similar.

Discussion
In a study with delayed treatment effect, if the interim looks or futility analyses are conducted too early in the trial, we may not see the treatment effect in the early looks and the conclusion could be negative and stop the trial incorrectly.Therefore, in this article we proposed a group sequential design using a piecewise weighted log-rank test (PWLRT) which provides an event-driven approach based on the number of events after delayed time, that is, the interim looks will not be conducted until the planned number of events observed after the delay time.Thus, it can reduce the risk of incorrectly accepting the null hypothesis at the interim analysis, i.e., false-negative rate, caused by the delay of treatment effect.With an event-driven approach, another advantage is that the proposed group sequential design is robust against the underlying survival, accrual and censoring distributions.Although the PWLRT does not use the events before the delayed time for testing the treatment efficacy, for practical applications, additional analyses including events prior to delayed time need to be performed.It is particularly important to present the Kaplan-Meier survival curves with all data to confirm the delayed treatment effect and full treatment effect after the delay time.
The group sequential designs using Fleming-Harrington-(0,1) weighted log-rank test and a new weighted log-rank test are also discussed.However, because the powers of the FH(0,1)-WLRT and NWLRT are determined by sample size rather than number of events after the delay time, the eventdriven approach does not apply to the FH(0,1)-WLRT and NWLRT.If investigators are uncomfortable using the PWLRT which gives zero weight to the events occurred before the delay time, we recommend using the NWLRT which is more efficient than FH(0,1)-WLRT and doesn't lose much efficacy compared to the most powerful PWLRT.
t f under exponential delayed treatment effect model with median survival time of the control group m 1 ¼ 8 (months) with one-sided type I error rate 5% and power of 90%.The corresponding empirical type I error rates (α) and empirical powers (EP) are estimated with 10,000 simulation runs.

Figure 2 .
Figure 2. Kaplan-Meier and fitted Weibull curves for the POPLAR trial.The step functions are the Kaplan-Meier overall survival curves of the control and treatment groups.The dot lines are overall survival curves for the fitted Weibull threshold delay model of the control and treatment groups.
Based on the observed data fX i ðtÞ; Δ i ðtÞ; Z i ; i ¼ 1; � � � ; ng, let N i ðt; xÞ ¼ Δ i ðtÞIfX i ðtÞ � xg, Y i ðt; xÞ ¼ IðX i ðtÞ � xÞ, ðt; xÞ, then, the sequential weighted log-rank score statistic at calendar time t is given by where W n ðxÞ is a bounded weight function that converges in probability to a deterministic function wðxÞ which is independent of calendar time t.
W ðτÞ is the information time.Let B 1 be the B t � at the final stage with full information, that is t � ¼ 1, then, the joint distribution of ðB t � ; B 1 Þ has a bivariate distribution with mean μ ¼ ðθt � ; θÞ and variance matrix � ¼ ðσ ij Þ 2�2 with σ 11 ¼ σ 12 ¼ σ 21 ¼ t � and σ 22 ¼ 1.Therefore, according to multivariate conditional distribution theory, the conditional density

Table 1 .
The lower and upper boundaries a k and b k and significance levels P ak and P bk for k th interim analysis based on the SCPRT procedure with various number of stages K and over all significance (sig.) level α and the maximum conditional probability of discordance ρ ¼ 0:02.

Table 2 .
Sample sizes, number of events after lag time t 0 ¼ 4 (months) and total number of events are calculated by equation (9) for various hazard ratios δ, accrual durations t a and follow-up times