Group sequential multi-arm multi-stage survival trial design with treatment selection

ABSTRACT Multi-arm trials are increasingly of interest because for many diseases; there are multiple experimental treatments available for testing efficacy. Several novel multi-arm multi-stage (MAMS) clinical trial designs have been proposed. However, a major hurdle to adopting the group sequential MAMS routinely is the computational effort of obtaining stopping boundaries. For example, the method of Jaki and Magirr for time-to-event endpoint, implemented in R package MAMS, requires complicated computational efforts to obtain stopping boundaries. In this study, we develop a group sequential MAMS survival trial design based on the sequential conditional probability ratio test. The proposed method is an improvement of the Jaki and Magirr’s method in the following three directions. First, the proposed method provides explicit solutions for both futility and efficacy boundaries to an arbitrary number of stages and arms. Thus, it avoids complicated computational efforts for the trial design. Second, the proposed method provides an accurate number of events for the fixed sample and group sequential designs. Third, the proposed method uses a new procedure for interim analysis which preserves the study power.


Introduction
When multiple agents are available, traditional randomized controlled trials are used to compare each experimental arm with a control separately.A multi-arm trial allows simultaneous comparison of multiple experimental arms with a common control and provides a substantial efficiency advantage (Burnett et al. 2020;Parmar et al. 2014).Many novel multi-arm trial designs have been proposed, including the group sequential multi-arm multi-stage (MAMS) design (Jaki and Magirr 2013;Magirr et al. 2012;Stallard and Friede 2008), flexible group sequential MAMS design (Magirr et al. 2014), and optimal MAMS design (Wason and Jaki 2012).In this article, we will adapt the approach used by Magirr et al. (2012) and Jaki and Magirr (2013).In such trials, multiple arms are monitored in a group sequential fashion in the following manner: a) all arms (including the control) open to enrollment simultaneously and compare each treatment arm with the common control, b) if a treatment arm meets a futility criteria at an interim stage, it is dropped from the trial and the future subjects of the dropped arm are not re-allocated across the remaining arms, c) If a treatment arm meets an efficacy success criteria at an interim stage or all treatment arms are dropped for futility, the whole trial stops, otherwise the remaining arms continue to the next stage.Ghosh et al. (2017) pointed out that a major hurdle to adopting group sequential MAMS designs routinely is the computational effort of obtaining stopping boundaries that guarantee a strong control of the familywise type I error rate (FWER).One example includes the method proposed by Magirr et al. (2012) that was implemented in R package MAMS, hereafter referred to as MJW.The computational complexity of the MJW algorithm increases exponentially with the number of arms as well as with the number of stages J. Jaki and Magirr (2013) extend the MJW method for continuous outcomes to time-to-event endpoints.However, multi-arm survival trials having dropping arm(s) for futility could change the predefined information time for interim analysis.How to conduct an interim analysis for multi-arm survival trials has not been carefully discussed in the literature.Jaki and Magirr (2013) recommended that interim analyses should be conducted after having observed a pre-specified number of events from the control group.However, their simulation results showed that the empirical power could be below the nominal level by more than 5%.Hence, to overcome the problems of computational complexity and interim analyses, we have developed new group sequential MAMS survival trial designs based on the sequential conditional probability ratio test (SCPRT) procedure (Xiong 1995;Xiong et al. 2003).The proposed method improves on the method given by Jaki and Magirr in three directions.First, the proposed method provides explicit solutions for both futility and efficacy boundaries to an arbitrary number of stages and arms.Hence, it avoids complicated computational efforts in obtaining stopping boundaries for the method proposed by MJW.Second, the proposed method uses a combined number of events of the control and a treatment for interim analysis which preserves the nominal power.Third, the proposed method uses an accurate asymptotic normal approximation for the log-rank test and provides a more accurate sample size and number of events for the trial design.

Joint distribution of log-rank tests
We consider a multi-arm trial that compares K treatment arms to a common control arm and label the arms k ¼ 0; 1; � � � ; K, where 0 represents the control.Let λ k ðxÞ and S k ðxÞ be the hazard and cumulative survival functions of the arm K, respectively, and assume proportional hazards models between each treatment arm and the control, S k ðxÞ ¼ ½S 0 ðxÞ� δ k or λ k ðxÞ ¼ δ k λ 0 ðxÞ; k ¼ 1; � � � ; K, where δ k is the hazard ratio of the k th treatment arm to the control arm.Assume that a total of n patients are randomized to K þ 1 arms with the allocation ratio ω k ¼ n k =n for the k th arm (k ¼ 0; 1; � � � ; K), where n k is the sample size for k th treatment arm and n ¼ P K k¼0 n k is the total sample size.Let Z k be the standardized two-sample log-rank test for comparing the k th treatment arm to the control arm.It can be shown (Appendix 1 of Supplemental Material) that the asymptotic joint distribution The function GðxÞ in above expressions is the censoring distribution, which is determined by the accrual distribution and loss to follow-up distribution.With an equal allocation ratio, it is easy to verify that under the global null hypothesis

Familywise error rate
For a multi-arm trial, to control the overall type I error due to multiple comparisons, it is common to consider the FWER, which is the probability of rejecting at least one true null hypothesis across a set of null hypotheses H A strong control of the FWER at level α is that the FWER should be below α for all possible values δ k ; k ¼ 1; � � � ; K. Magirr et al. (2012) have shown that the probability of rejecting at least one true null hypothesis is maximized under the global null H G : δ 1 ¼ � � � δ K ¼ 1 for the simultaneous efficacy stopping design (stopping efficacy for a arm results in the whole trial terminated).Thus, controlling FWER under the global null hypothesis provides a strong control of the FWER.Therefore, we use the Dunnett correction (Dunnett 1955) by choosing a critical value c α that satisfies the following equation to provide strong control of the one-sided FWER at α level where ϕðx 1 ; � � � ; x K ; �Þ is a multivariate normal density function with means 0 and variancecovariance matrix � given by Equation (3).Using numerical integration, such as the method of Genz and Bretz (2009), which is implemented in R. For example, for a one-sided FWER α ¼ 5%, the critical value c α ¼ 2:062, with K ¼ 3.

Power under the least favorable configuration
The power of a multi-arm trial with treatment selection can be defined as the probability that without loss of generality, H ð0Þ 1 is rejected and treatment 1 is recommended.Under the alternative hypothesis, , where δ ð1Þ represents a clinically relevant improvement and δ ð0Þ ð < δ ð1Þ Þ is an uninteresting treatment effect such that if δ k � δ ð0Þ , then we would prefer not to proceed further in investigating treatment k.This is known as the least favorable configuration (LFC) (Dunnett 1984).Let n be the total sample size and c α be the critical value to reject the null hypothesis based on Dunnett correction.Assuming an accrual distribution AðxÞ, with an accrual duration t a , a follow-up time t f , total study duration τ ¼ t a þ t f , and loss to follow-up distribution G L ðxÞ and censoring distribution GðxÞ ¼ Aðτ À xÞG L ðxÞ, then, the power 1 À β under the LFC or total sample size n can be calculated by the following equation where ϕðx 1 ; � � � ; x K ; ffi ffi ffi n p Aμ; A�A 0 Þ is a multivariate normal density function with mean ffi ffi ffi n p Aμ and variance-covariance matrix A�A 0 under the LFC, μ ¼ ðμ 1 ; � � � ; μ K Þ 0 and � are given by Equations ( 1) and (2), and A is a contrast matrix.For example, if K ¼ 4, the contrast matrix A is given by The total number of events can be calculated by d ¼ np, where p can be calculated as the overall failure probability among the K þ 1 arms, with p k being the failure probability of arm k ðk ¼ 0; 1; � � � ; KÞ can be calculated as follows: where f k ðxÞ ¼ λ k ðxÞS k ðxÞ is the density function of the k th treatment arm, and GðxÞ ¼ Aðτ À xÞG L ðxÞ combines the administrative censoring distribution and loss to follow-up distribution.
Even though the total sample size calculation depends on the underlying survival, accrual, and censoring distributions, the total number of events are, however, very robust against these distributions (see Table 3 and Tables in Appendix 2 of supplemental material).

Group sequential MAMS design
We now consider a group sequential MAMS trial with a J-stage and K þ 1 arms with K treatment arms and a common control arm.At each interim analysis, the log-rank tests comparing each treatment arm to the control are calculated.If a test statistic is below the futility boundary, the treatment arm is dropped for futility.If a test statistic is above an efficacy boundary, the whole trial is stopped, which is referred to simultaneous efficacy stopping design adapted by MJW.

SCPRT boundaries
We now apply the SCPRT (Xiong 1995) to the log-rank score statistic WðtÞ based on its Brownian motion property.Let t � be the information time which is a function of calendar time t.Then, under the proportional hazard model SðtÞ ¼ ½S 0 ðtÞ� δ , it is well known that the sequential log-rank score B t � ¼ WðtÞ= ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi DðτÞ p follows a Brownian motion Nðθt � ; t � Þ, where the drift parameter is SCPRT boundary coefficient a for J interim analyses Therefore, according to the multivariate conditional distribution theory, the conditional density f ðB t � jB 1 Þ is the normal density of NðB 1 t � ; ð1 À t � Þt � Þ which is independent of the drift parameter θ.Let c α be the critical value of B 1 to reject the null for the final analysis.Then the conditional maximum likelihood ratio (Xiong 1995) for the stochastic process on information time t � is Using this logarithm, the log likelihood ratio can be simplified as This has a positive sign if B t � > c α t � and a negative sign if B t � < c α t � .This equation leads to lower and upper boundaries for The a in Equation ( 8) is the boundary coefficient, and it is crucial to choose an appropriate a for the design such that the probability of the conclusion using the sequential test at the interim analyses being reversed by the test at the planned final analysis is small, but not unnecessarily too small.Thus, an appropriate a can be determined by choosing an appropriate maximum conditional probability of discordance ρ.Specifically, Let D be the event that the conclusion at an interim time be reversed at the final stage and θ be a drift parameter of the Brownian motion B t .Let ρ s ¼ P θ ðDjB 1 ¼ sÞ, which is the conditional probability of discordance, given the final stage observation B 1 ¼ s, and let ρ ¼ max s ρ s , which is the maximum conditional probability of discordance.Boundary coefficient a in Equation ( 8) is determined by choosing an appropriate ρ.A smaller ρ results in a larger a, so that upper and lower boundaries are further apart, which leads to a larger expected sample size.We recommend ρ ¼ 0:02, which results in an SCPRT boundary that is efficient as well as preserving the accordance (agreement) of conclusions for the test at the early stopping and the test at the planned end.For balanced information time, the boundary coefficient a is calculated for a given ρ ¼ 0:02 in Table 1 for J ¼ 2; � � � ; 10.For unbalanced information time, a can be calculated using methods for calculating probabilities of multiple and ordered hittings developed by Xiong. 15 For group sequential interim monitoring, we define Z k;j to be the log-rank test that is using all data up to the j th interim look for comparing the k th treatment arm to the control and t � j is the information time at the j th interim look, where j ¼ 1; � � � ; J (including the final analysis).We have seen that the logrank test Z k;j has a form of , where t � is the information time.Thus, based on the SCPRT procedure, the group sequential futility and efficacy boundaries of the standard sequential log-rank tests Z k;j are given as follows: where c α is a critical value at the final stage with a one-sided FWER α, and a is a boundary coefficient which is determined by the maximum conditional probability of discordance ρ, the probability of the conclusion obtained by the sequential test at an interim look is reserved by the test at the end of study.If the log-rank test Z k;j satisfies Z k;j < l j , the k th treatment arm is dropped or if Z k;j > u j , the study is terminated.Otherwise, the k th treatment arm goes to the next stage.At the final analysis (with t � J ¼ 1 and l J ¼ u J ¼ c α ), if Z k;J > c α , the k th treatment arm is declared active against the control.The nominal significance levels at the j th interim look for testing hypothesis H ðkÞ 0 are given by where Φð�Þ is the cumulative distribution function of the standard normal.

Interim analysis
The time for interim analysis is determined by information spending on the trial or information time.For the trial with an time-to-event endpoint, the information is determined by the number of events instead of sample size.Let n be the total sample size and p be the overall failure probability.Then, the total number of events for the trial is e ¼ np and we define to be the number of events per arm.We may consider the following approaches for conducting interim analyses: (a) After a specific total number of events across all arms have been observed.However, this approach is problematic because dropping arm(s) means that more events are required from the remaining arms or it changes the subsequent pre-defined information times.(b) After an equal number of events have been observed in each arm (across all arms).However, it is impractical to conduct an interim analysis with an equal number of events for each arm (across all arms), since the treatment arm and control do not have an equal number of events at the same time.(c) After a specific number of events combined from the treatment and control arms have been observed.That is, when we observe a combined number of events e j ¼ 2e 0 t � j at j th interim look between a treatment arm and the control arm, we conduct an interim analysis for that treatment arm (vs the control), where t � j is the pre-specified information time at the j th look, e.g., t � j ¼ j=J for an equal number of events per arm per stage.This procedure needs to be performed for all remaining treatment arms at j th -stage (these interim analyses are not conducted at same calendar time as illustrated in Figure 1).(d) After a specific number of events of the control have been observed.That is, once we observe the pre-specified number of events of the control, we then conduct all interim analyses for each treatment arm versus the control.The procedure (c) depends on the combined number of events between a treatment arm and the control, instead of the total number of events in the trial.Hence, dropping arm(s) due to futility has no impact on the information time.Thus, the proposed procedure (c) for interim analyses can preserve the power for the group sequential MAMS trial by dropping the futility arm(s) during the trial.The procedure (d) is proposed by Jaki and Magirr (2013) and is implemented in R package MAMS.However, at an interim analysis, the number of events in a treatment arm may be much less than for the control arm.Thus, overall power may not be preserved.
Figure 1 illustrates the combined number of events in the process of interim analyses for a multiarm multi-stage trial with a three-arm (K ¼ 2) and two-stage design (J ¼ 2).For example, the required total number of events e ¼ 300 or number of events per arm e 0 ¼ 300=3 ¼ 100.Given information time t � j ¼ j=J for the j th interim analysis, the combined number of events between a treatment and control for the first and second interim analyses are e 1 ¼ 100 and e 2 ¼ 200, respectively.At calendar time t 0 , the observed number of events per arm are 50, 30, and 40 for control, arm 1, and arm 2, respectively.Thus, the combined number of events are 80 and 90 for comparing arm 1 vs control and comparing arm 2 versus control.Therefore, no interim analysis is conducted.At calendar time t 1 , the combined number of events between control and arm 2 is 100 while the combined number of events between control and arm 1 is 90.Thus, we conduct the first interim analysis for arm 2 versus control only.The interim analysis results in dropping arm 2 for futility.The trial continues with treatment arm 1 and the control to calendar time t 2 and the combined number of events between the control and arm 1 is 100.Thus, we conduct the first interim analysis for arm 1 versus the control.The results of analysis recommend further continuation for arm 1.After the first interim analyses for all treatment arms are done, the trial goes to calendar time t 3 , when the combined number of events between arm 1 and the control is 200.Thus, we conduct a second interim analysis (also final analysis) for arm 1 versus the control.
With the interim analysis approach based on the combined number of events between an treatment arm and the control, there is a window time period (interim window) at an interim analysis because the interim analysis occurs at different time points for different treatment arms.We recommend continuing enrolling patients during the interim window and only making the decision to stop the trial for efficacy and selecting an arm after all treatment arms complete their interim analysis during this interim window.If multiple treatment arms are successes at an interim window, the arm with the biggest treatment effect (the smallest hazard ratio) is selected.

Group sequential FWER and power
To study the FWER and power under the group sequential design, we define the following events A k;j ¼ ðZ k;j < l j Þ and B k;j ¼ ðl j < Z k;j < u j Þ.Then, the event that H 1 ; � � � ; H K all fail to be rejected is given by Thus, one-sided FWER α under the global null H G for a group sequential trial can be calculated as follows: We now denote � J as the probability under the LFC that no individual null hypotheses H ð0Þ k are rejected at analyses 1; � � � ; J À 1 and then at analysis J, H ð0Þ 1 is rejected and treatment 1 is recommended.This event can be written as (Magirr et al. 2012) with the conventions that is the event for selection of treatment 1 at J-stage.Thus, � J ¼ PðQ J jH a Þ and the power (1 À β) under LFC: H a : δ 1 ¼ δ ð1Þ ; δ 2 ¼ δ ð0Þ ; � � � ; δ K ¼ δ ð0Þ for the group sequential MAMS trial is given by 1 To understand the event Q J , we illustrate it for a MAMS design with K þ 1 ¼ 3 arms and J ¼ 2 looks.For J ¼ 1, we have and for J ¼ 2, we have The events � R K and Q J are complicated.However, for the proposed SCPRT design, it is not necessary to calculate these events theoretically because the FWER and power can be easily calculated for the fixed sample design.It can be shown that the overall type I error and power of the SCPRT procedure is approximately same as that for the fixed sample design.Specifically, for a group sequential procedure with J interim analyses, let � β 0 ðθÞ and � β J ðθÞ be the power functions for the fixed sample test and J-stage group sequential SCPRT test, where θ is a drift parameter of the Brownian motion and ρ max ¼ max θ P θ ðDÞ be the maximum discordant probability of the group sequential SCPRT procedure, respectively.Following Theorem 4.1 in Xiong (1995) and Tan and Xiong (1992) for any θ, we have which implies that the difference between the two power functions is less than ρ max .Thus, with a small maximum discordant probability ρ max , the power of a fixed sample design provides approximately the same power for the group sequential trial based on the SCPRT procedure.With recommended maximum conditional probability of discordance ρ ¼ 0:02, it leads to a maximum discordance probability ρ max ¼ 0:0054.More details for computation of the maximum discordance probability ρ max can be found in Xiong et al. (2003).
Thus, the FWER and power of the fixed sample design provide nearly the same FWER and power for the group sequential MAMS trial.In Section 6, we will conduct simulation studies to demonstrate that the proposed group sequential MAMS design preserves the nominal FWER and power from the fixed sample design.

Implementation in R
The proposed group sequential MAMS design has been implemented in R codes (see online Supporting Information), and the SCPRT procedure has been implemented in the SCPRTinfWin (Xiong 2017), and that can be downloaded at http://www.stjuderesearch.org/site/depts/biostats/scprt.Examples of the futility and efficacy boundaries l j ; u j of the SCPRT procedure with the Dunnett correction for ðK þ 1Þ-arm and J-stage designs are given in Table 2.The number of events and sample size per arm for fixed sample design can be calculated by using R function 'Size'.

Comparison
In this section, we compare the proposed method with Jaki and Magirr's method for fixed sample and group sequential designs.

Fixed sample design
For fixed sample design, we consider the LFC: , where δ ð1Þ 1 ð < δ ð0Þ Þ is hazard ratio of the treatment arm 1 vs. the control arm.Further, we assume equal sample size allocation among all arms and uniform accrual.Total sample sizes and number of events with 5% of FWER and 90% of power are calculated under Weibull survival distributions S k ðtÞ ¼ e À λ k t κ ðk ¼ 0; 1; � � � ; KÞ and parameter configurations as follows: shape parameter κ ¼ 0:7; 1; 1:5; number of treatment arms K ¼ 4; accrual duration t a ¼ 40 and 10 (months) and length of follow-up t f ¼ 20 and 10 (months).The value of λ 0 is selected to reflect median survival m 0 ¼ 20 and 10 (months) for the control, log hazard ratio ψ ð1Þ ¼ À log δ ð1Þ is set to be 0.40, 0.45, and 0.60 to represent the clinically relevant improvement (treatment arm vs. control), and ψ ð0Þ ¼ À log δ ð0Þ is set to be 0, 0.15 to represent effect sizes too small to justify further investigation.Table 3 presents the calculated sample size per arm (n 0 ) and number of events per arm (e 0 ) for the fixed sample design under the exponential distribution (κ ¼ 1).Results for Weibull distribution (κ ¼ 0:7 or 1:5) with and without loss to follow-up are given in Appendix 2 (Supplemental Material).The loss to follow-up is assumed to follow exponential distribution with the rate of 0.05.Simulations are conducted to study the accuracy of the proposed sample size calculation and results are given in Table 3 and summarized as follows.First, the number of events per arm (e 0 ) for the proposed method are very robust against the survival, accrual, and censoring distributions.Second, the empirical FWER (α) and power (1 À β) of the proposed method are all close to the nominal level of 5% and 90%, respectively.In contrast, for Jaki and Magirr's method, empirical powers are below the nominal level, with the deviation from the nominal level increasing as the target effect size, ψ ð1Þ ¼ À log δ ð1Þ , increases.Therefore, the number of events for the proposed method provides adequate power, while it is underestimated, particularly for a relatively large effect size for Jaki and Magirr's method.

Group sequential design
We next compared the proposed method with that of Jaki and Magirr's method for group sequential designs.Jaki and Magirr (2013) considered a two-stage (J ¼ 2) multi-arm trial in which four experimental treatments (K ¼ 4) are compared to a control with FWER α ¼ 0:05 and power 1 À β ¼ 0:9.Assuming that patients arrive according to a uniform distribution on interval [0,40] and a follow-up time of 12 months and survival distribution of control follows an exponential distribution with median survival time m 0 ¼ 20 months and various design parameter configurations.We also assume that the proportional hazard model S k ðtÞ ¼ ½S 0 ðtÞ� δ k , where δ 1 ¼ δ ð1Þ and δ 2 ¼ δ 3 ¼ δ 4 ¼ δ ð0Þ , both δ ð1Þ and δ ð0Þ are varied.To investigate the operating characteristics of the designs, the number of events per arm are calculated using the proposed method and simulations are conducted to estimate the FWER and empirical power and average number of events per arm under the null and LFC.Results for the proposed method are compared to those recorded in Table V of Jaki and Magirr's paper.For example, consider a twostage design with information time t � ¼ ð0:5; 1Þ and O'Brien-Fleming boundaries, the number of events per arm and sequential boundaries can be calculated using R function 'tite.mams' in R package MAMS for Jaki and Magirr's method, Table 3. Sample size per arm (n 0 ) and number of events per arm (e 0 ) calculated for the single-stage (J ¼ 1) four treatment arms (K ¼ 4) trials with one-sided FWER of 5% and power of 90% under exponential distributions for various design scenarios: various of median survival time of the control m 0 , accrual duration t a , follow-up time t f , and log hazard ratios ψ ð0Þ ¼ À log δ ð0Þ and ψ ð1Þ ¼ À log δ ð1Þ .Simulations are conducted to estimate the empirical FWER (α) and empirical power (1 À β) based on 10,000 simulation runs.

Design scenario
Results for K ¼ 4 #### Jaki and Magirr's method ###### tite.mams(hr=exp(0.3),hr0=exp(0), K = 4, J = 2, alpha = 0.05, power = 0.9, r = 1:2, r0 = 1:2, ushape="obf", lfix = 0  4 gives the five-arm (K ¼ 4Þ two-stage (J ¼ 2) designs using O'Brien and Fleming (OBF) and Pocock boundaries (P) (with zero futility boundary), triangular test (T) for Jaki and Magirr's method and SCPRT boundaries for the proposed method together with the number of events per arm (e 0 ) and per stage (e ð1Þ 0 ).Simulation-based empirical FWER (α) and power (1 À β) are also provided, together with the expected number of events under the global null hypothesis (ANE 0 ) and LFC (ANE 1 ).The results from Table V of Jaki and Magirr are also included in Table 4 for the purpose of comparison.Computations to produce the results in Table 4 are carried out in R package MAMS for Jaki and Magirr's method and using R functions for the proposed method (see online Supporting Information).The results are summarized here.First, with proper selection of the maximum conditional probability of discordance (ρ ¼ 0:02), it results in roughly 0.8% inflation of the nominal FWER 5% for the proposed SCPRT method (the actual FWER is roughly 0.0504).The simulated FWER for the proposed method can be 0.049, 0.050 and 0.052 (range from 0.049 to 0.052) (see Tables 4, 5  and 6).With 10,000 simulation runs, the standard error of simulation for 5% FWER is ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 0:05 � ð1 À 0:05Þ=10000 p ¼ 0:00217 and 95% confidence interval is (0.046, 0.054).Thus, all simulated FWERs are within the 95% confidence interval.Therefore, the proposed method preserves the FWER.However, Jaki and Magirr's method results in roughly 8% to 16% deflation of the nominal 5% FWER; the FWER ranges from 0.042 to 0.046.Jaki and Magirr Table 4. Group sequential boundaries and operating characteristics for group sequential MAMS trials comparing four treatment arms (K ¼ 4) with a control in two-stage designs (J ¼ 2) based on O'Brien and Fleming (OBF), Pocock (P), triangular (T), and SCPRT methods, where e 0 is number of events per arm and e ð1Þ 0 is number of events per arm per stage under exponential model with median survival time m 0 ¼ 20 (months).The empirical FWER (α), power (1 À β), average number of events per arm under H G (Ane 0 ) and LFC (Ane 1 ) are estimated based on with 10,000 simulation runs.

Boundary
Number  Jaki and Magirr (2013).Group sequential boundaries and number of events were calculated with onesided FWER of 5% and power of 90%.
did not explain why their method does not preserve the FWER.It may be due to errors in their simulation codes because the log-rank test should maintain the FWER based on our simulation results.Second, their simulated empirical powers are below the target value 90%, with deviation from the nominal level increasing as the target effect size, ψ ð1Þ ¼ À log δ ð1Þ increases.This is because their method for interim analysis uses a pre-planned number of events for only the control.This results in a combined number of events between the treatment and control arms that are much less than the pre-planed combined number of events.The proposed method preserves power well for all cases because it provides accurate sample size estimation and uses the combined number of events between a treatment arm and the control for an interim analysis.Third, for the operating characteristics of the group sequential designs, in terms of minimizing the expected number of events per arm, the SCPRT method is better than the Pocock method, and is similar to the O'Brien and Fleming method.Finally, the SCPRT method has the smallest final critical value which is same as that for the fixed sample design.However, the critical value increases from 2.161 to 2.169, 2.375, and 2.293 for a two-stage design by using O'Brien-Fleming, Pocock, and triangular test boundaries, respectively.This means that the significance levels at the final Abbreviations: Prob: Probability; cum: cumulative; e 0 : number of events per arm, τ: total study duration.
Table 6.Simulations are conducted to study the performance of simultaneous efficacy stopping designs under LFC for various true treatment effects with number of stage J ¼ 2; 3; 4 and number of arm K ¼ 2; 3, where ψ k ¼ À log δ k for k th is the log hazard ratio (HR) of the treatment arm vs. the control.The P k is the probability to select k th treatment and declare its efficiency which is estimated based on 10,000 simulation runs.
analysis are no longer at the nominal level of 5%.Instead, they are 4.9%, 3.0%, and 3.6% for the respective methods.
We also present a four-stage (J ¼ 4) design with two treatment arms (K ¼ 2) (Table 5).The number of e 0 ¼ 239 events per arm are required to detect treatment effects ψ ð0Þ ¼ 0 and ψ ð1Þ ¼ 0:30 with nominal FWER of 5% and power of 90% under LFC and total study duration τ ¼ 60 (months).Simulations are conducted to study the operating characteristics of the fourstage design using event-driven approach.The FWER and power of the four-stage design are 0.0513 and 0.902, respectively.The probability of trial stopping (an arm meets an efficacy success criteria or all treatment arms are dropped for futility) under LFC is increased from 0.080 at first stage to 0.478 at the final stage.The probability futility (under LFC) is 0.2% at first stage and increases to 8.4% at final stage and the probability success (under LFC) is 7.7% at first stage and increases to 39.4% at the final stage.The average number of events per arm (under LFC) is 171 events and average study duration (under LFC) is 49.42.Thus, the four-stage design could save about 28% number of events per arm and 18% of the total study duration under LFC.Similar results are observed under the global null (details see Table 5).

Multiple treatment effects
In this article, the power of the trial is defined under the LFC.However, in a real trial, it is unlikely that only one treatment arm is effective and all other treatment arms have an uninteresting treatment effect.It is more likely that there will be mixed effects among the treatment arms.To study the performance of the simultaneous efficacy stopping design (with treatment selection), we did simulations under various mixed treatment effect scenarios (see Table 6) with the number of stages J ¼ 2; 3; 4 and number of treatment arms K ¼ 2; 3. The simulations are conducted under exponential models with the same parameter configuration as given in previous subsections.The simulation results (Table 6) showed that when there is a 'quite good' arm, the probability of correctly selecting the best arm (P 1 ) could be significantly reduced.For example, when K ¼ 2 and treatment effects ψ 1 ¼ À log δ 1 ¼ 0:3 and ψ 2 ¼ À log δ 2 ¼ 0:25 for arm 1 and arm 2, respectively, the probability correctly selecting arm 1 (best arm) in a two-stage design is 65% only and the probability selecting arm 2 (P 2 )(a quite good arm) is as large as 30% for two, three, and four-stage designs.Similar results are obtained for the cases of three treatment arms (K ¼ 3).Therefore, when multiple treatment arms are likely to be effective under the simultaneous efficacy stopping design, the probability of correctly selecting the best arm can be significantly reduced, and there is a risk of selecting a 'quite good' arm instead of the best arm.When multiple treatment arms are effective, an arm-specific efficacy stopping design (efficacy stopping only stops the respective treatment arm) is more suitable for finding multiple effective arms.

Conclusion
In this paper, we developed a group sequential MAMS design with time-to-event endpoints based on the SCPRT procedure.The proposed method provides several improvements compared to Jaki and Magirr's method.First, the proposed method provides a more accurate estimation of the number of events (sample size).Second, a new interim analysis procedure with a combined number of events between a treatment and control is proposed, which preserves the study power by dropping arms for futility for the MAMS trial.Third, the proposed method has the advantage of simplicity of computation for the group sequential MAMS trial design because it provides analytical solutions for both futility and efficacy boundaries to arbitrary number of stages and number of arms.Furthermore, the proposed group sequential design based on the SCPRT procedure has the following additional merits compared to other group sequential methods.First, the SCPRT has the smallest maximum total number of events (sample size), which is same as that for the fixed sample design.Hence, it offers more saving on the sample size (or number of events) compared to other methods used for a multi-arm multi-stage design.Second, the FWER and power for the proposed method are nearly same as that of the fixed sample design.Thus, it provides a consistent conclusion between the fixed and group sequential designs.Third, the maximum conditional discordance of probability is controlled by the study design, which is particularly important when regulatory agencies require sponser to control the probability of discordance for group sequential trials.In addition, the proposed methods are very robust against accrual, censoring and survival distributions.Thus, we can use an event-driven approach for the proposed method.This is particularly important in practice because it is extremely difficult to correctly pre-specify the accrual and censoring distributions due to uncontrollable accrual rate.Therefore, the proposed method improves on Jaki and Magirr's method and provides a simple tool for designing group sequential MAMS trials.
However, when there are multiple effective treatment arms in a MAMS trial, the simultaneous efficacy stopping design with treatment selection is unsuitable because the power (probability) to select the best arm and declare its efficacy can be significantly reduced.Thus, there is a risk of selecting a 'quite good' arm ahead of the best arm.In such case, an arm-specific efficacy stopping design for a MAMS trial is more suitable for finding multiple effective arms.
multivariate normal distribution with means and K � K variance-covariance (or correlations) matrix where with and

Figure 1 .
Figure1.Illustration of interim analysis for two treatment arms (K ¼ 2) two-stage (J ¼ 2) group sequential MAMS survival trial where time interval ½t 1 ; t 2 � is the interim window for the first interim analysis.

Table 1 .
The maximum conditional probability of discordance ρ and boundary coefficient a for a J-stage (include final analysis) group sequential SCPRT procedure with balanced information times.
DðtÞ=DðτÞ.Now, let B 1 be the B t � at the final stage with full information t � ¼ 1, then the joint distribution of ðB t � ; B 1 Þ has a bivariate distribution with mean μ ¼ ðθt � ; θÞ and variance matrix �

Table 5 .
Group sequential boundaries, number of events per arm per stage e ðjÞ and operating characteristics for a group sequential MAMS trial comparing two treatment arms (K ¼ 2) with a control in a four-stage (J ¼ 4) design based on SCPRT method to detect treatment effects ψ ð0Þ ¼ 0 and ψ ð1Þ ¼ 0:30 with nominal FWER of 5% and power of 90% under LFC and total study duration τ ¼ 60, where e ðjÞ is the cumulative number of events per arm at j th stage, α and 1 À β are empirical FWER and power estimated from 10,000 simulation runs.