A randomized Bayesian optimal phase II design with binary endpoint

ABSTRACT In this paper, we propose a randomized Bayesian optimal phase II (RBOP2) design with a binary endpoint (e.g., response rate). A beta-binomial distribution is used to model the binary endpoint for a two-arm phase II trial. Posterior probabilities of the endpoint of interest are evaluated at each interim look and used in the decision to stop the trial due to futility. Compared with other Bayesian designs, the proposed RBOP2 design has the following merits: (i) strongly controls the type I error rate at a pre-defined level; (ii) optimizes the stopping boundaries, thus maximizing the power to detect treatment effects and minimizing the expected sample size for futile treatment; (iii) does not limit the number of interim looks, thus enabling frequent trial monitoring; and (iv) allows the stopping boundaries to be pre-defined in the protocol and is easy to implement. We conduct simulation studies to compare the proposed design with a group sequential design and other Bayesian randomized designs and evaluate its operating characteristics under different scenarios.


Introduction
A phase II clinical trial aims to determine whether a new experimental therapy has sufficient evidence of efficacy to support the necessity of it entering the next large-scale phase III trial (Thall and Simon 1995). The outcomes of phase II clinical trials often require short to moderate follow-up and are intended as guidelines rather than absolute rules. For example, a categorical endpoint (response/nonresponse), rather than overall survival, often serves as a key outcome of a phase II trial in anti-tumor drugs. In addition, phase II trials are usually designed as single-arm trials based on historical controls.
However, phase II trials do not demonstrate a promising capability of identifying effective new treatments. Hay et al. (2014) showed that the likelihood of progressing from phase III to marketing approval is only 58% for lead indication and 50% for all indications. The high false-positive rate in phase II trials, which are designed as single-arm trials with historical controls, is a crucial reason for the failure of large-scale phase III trials. Cannistra (2009) showed that the weak reliance on historical controls can lead to a biased estimation of the true treatment effect. Baey and le Deley (2011) used simulation studies to demonstrate that when the estimated response rate of historical controls is biased, the type I error rate can inflate and the power is uncontrolled. Many factors can influence the accuracy of estimation based on historical controls and thereby cause bias, for example imaging methods and techniques, transition in stage, improved supportive care, and improvements of standard therapies. In addition, it is often difficult to identify the true population on which the historical data are based. Some population characteristics from historical data (e.g., age, sex, and molecular subtypes) may differ from the target population. Therefore, the randomized phase II trial deserves to be considered more often. Taylor et al. (2006) compared the conventional single-arm design with the randomized design under the uncertainty of historical response rate and concluded that the randomized design is preferable to the single-arm design. Other studies (Pond and Abbasi 2011;Tang et al. 2010) have also compared the performance of the single-arm design with that of the randomized design and have shown that the randomized design can reduce bias caused by historical data-based estimation.
It is desirable to terminate a phase II trial early if the new treatment is futile. Exposing patients to futile treatment may raise ethical issues. Therefore, monitoring efficacy and making a decision to stop the trial due to treatment futility can (i) reduce the number of patients exposed to the ineffective treatment and their duration of exposure and (ii) reduce the duration and resource spending of the trial (Jiang et al. 2020). Simon (1989) proposed a now widely used optimal two-stage design, developed under a frequentist framework, known as Simon's two-stage design. It is a single-arm design with historical controls and can actively control the type I error rate and power at a pre-determined level. Jung (2008) proposed a randomized phase II design analogous to Simon's two-stage design. Sylvester (1988) and Stallard (1998) discussed the single-arm phase II design under Bayesian decision theory using a gain or loss function. Thall and Simon (1994) proposed to use posterior probabilities as an early termination boundary and to monitor data continuously during the trial. In their design, the prior distribution of the binary endpoint for experimental treatment is modeled as a beta distribution. At each interim look, the posterior probability that the response rate of the experimental treatment is more satisfactory than that of historical data is compared with upper and lower probability cut-off values. Zhou et al. (2017) and Zhou et al. (2020) proposed the Bayesian optimal phase II (BOP2) design for a single-arm trial. The BOP2 design can incorporate complex categorical endpoints modeled by a Dirichlet-multinomial distribution. At each interim look, the posterior probability that the response rate of the experimental arm is larger than the historical rate is compared with a probability cut-off value, on which basis the go/no-go decision is made. The probability cut-off value is calibrated to ensure a strong control on the type I error rate and maximization of power. Tan and Machin (2002) proposed the single-threshold design (STD) and the dual-threshold design in single-arm trials. The true response rate is modeled as a beta-binomial distribution, and the posterior probability is used to support the go/no-go decisionmaking. Sambucini (2007) proposed a two-stage design for single-arm trials that adopts the STD and the Bayesian predictive approach. Literature on the Bayesian randomized design for phase II is scarce. Zhong et al. (2013) proposed a two-arm two-stage Bayesian design, in which the sample size can be reestimated at each interim look. Cotterill and Whitehead (2015) argued that the binary endpoint used in conventional phase II trials is not always feasible; they proposed to use a Bayesian sample size calculation method for time-to-event endpoint under the proportional hazard model framework in a randomized phase II trial. Cellamare and Sambucini (2014) developed a two-stage design under the Bayesian predictive probability for the randomized design framework. The endpoint is modeled as a beta-binomial distribution, and the predictive probability that the experimental treatment is more effective than the standard treatment is compared with the futility stopping boundary to make the go/ no-go decision. Chen et al. (2017) integrated the Bayesian posterior probability into the single-arm Simon's two-stage design to accommodate a two-arm design. When both arms succeed at the second stage, the Bayesian posterior probability serves as a criterion to pick the winner between the two arms. Yin et al. (2017) proposed a hierarchical Bayesian design (HBD) for a randomized phase II trial with a binary endpoint. The HBD can test new treatments among several groups and reduce sample size compared with individual designs. Zhao et al. (2022) proposed a two-arm BOP2 design for randomized clinical trial with single, multiple primary and coprimary endpoints for superiority and noninferiority trials.
In this paper, we extend the BOP2 design developed by Zhou et al. (2017) and Zhou et al. (2020) to accommodate a randomized design setting. The beta-binomial distribution is used to model the binary endpoint for both arms in a phase II trial. At interim looks, the decision to stop for futility is made by evaluating the posterior probability that the experimental treatment is more efficacious than the standard treatment. Compared with other randomized Bayesian designs, the proposed RBOP2 design (i) has a strong control on the type I error rate at a pre-determined value; (ii) optimizes the stopping boundaries to enable the maximization of power and the minimization of the expected sample size when the null hypothesis is true; (iii) does not limit the number of interim looks and can monitor the trial frequently; and (iv) allows the stopping thresholds to be pre-defined in the protocol and is easy to interpret and implement. This paper is structured as follows. In section 2, we describe the probability model and design criterion. In section 3, we perform simulation studies to illustrate the operating characteristics for the proposed design. In section 4, we discuss the choices of the prior distribution. In section 5, we compare the two-arm BOP2 design proposed by Zhao et al. (2022) with RBOP2 design using numerical simulations. In section 6, we compare the frequentist group-sequential design with the proposed design using numerical simulations. In section 7, we compare the existing Bayesian randomized twostage design with the proposed design using numerical simulations. Finally, we provide a conclusive discussion in section 8.

Probability model
In a phase II randomized two-arm clinical trial with a binary endpoint (Success vs. Failure), let p x be the true probability of success among patients who receive the experimental treatment (arm X) and p y be the true probability of success among patients who receive the standard treatment (arm Y). Under the Bayesian framework, let p x and p y follow beta prior distributions with hyper-parameters (α 0x , β 0x ) and (α 0y , β 0y ), respectively. The prior probability density functions of p x and p y are As the trial progresses, the posterior probability is updated based on the accumulated data. Suppose that at an interim look, n x and n y denote the number of patients enrolled and evaluated in arms X and Y, respectively, and the number of "successes" is x and y in arms X and Y, respectively. According to the conjugacy properties of the beta distribution, the posterior probability density functions of p x and p y are π p y j n y ; y � � ¼ beta p y ; α 0y þ y; β 0y þ n y À y � � : Because of the independence between arms X and Y, the joint posterior density function of p x and p y , denoted by πðp x ; p y jn x ; n y ; x; yÞ, is the product of the marginal posterior density functions, separately. At a given interim look, the decision to stop for futility is made by comparing a constant C with the posterior probability that p x exceeds p y by at least Δ based on the accumulated data and priors, where ∆ usually denotes the minimum meaningful clinical difference in practice. This posterior probability can be expressed as P p x � p y þ Δjx; n x ; y; n y � � ¼ ò 1 0 ò 1 p y þΔ πðp x ; p y jx; n x ; y; n y Þdp x p y ¼ ò 1 0 ½1 À Beta p y þ Δ; α 0x þ x; β 0x þ n x À x � � �beta p y ; α 0y þ y; β 0y þ n y À y � � dp y :

Design settings
Based on the probability models in section 2.1, the phase II randomized two-arm clinical trial with a binary endpoint consists of R interim looks and a final analysis. At each interim look, the futility stopping decision is based solely on the value of the posterior probability (1) and the probability cutoff value C. At the final look, if the futility stopping boundary is not crossed, efficacy of the experimental treatment can be claimed. The maximum sample size is denoted by N with N x and N y in arms X and Y, respectively, and the numbers of enrolled patients in arms X and Y are denoted by (n x1 , n y1 ), . . ., (n xR , n yR ) at each interim look, respectively. At the final look, the trial completes enrolment for all N patients. Suppose that at the r th interim look, n r patients are enrolled and evaluated with n xr patients and n yr patients in arms X and Y, respectively. x r and y r are the numbers of "successes" in each arm. As described in section 2.1, we will compare the posterior probability that p x exceeds p y by at least Δ with a probability cut-off constant C to make the decision of early termination for futility at each interim look. Therefore, at the r th interim look, the trial will be terminated if An advantage of the RBOP2 design is that its stopping boundaries can be enumerated before the start of the trial instead of solving (2) when the data are obtained. Therefore, a new statistic and its cut-off value are required. Intuitively, the larger the difference between the two observed proportions of "successes", the greater the posterior probability that p x � p y þ Δ. Therefore, we define the observed difference of proportions of "successes" as Δ ¼ x r n xr À y r n yr Figures 1 and 2 show the relationship between Δ and the posterior probability (1) at the r th interim look under different priors (choices of prior distributions will be discussed in section 4) for the sample size scenario (n xr ¼ 10; n yr ¼ 10). Figure S1 and S2 in the supplementary materials show the same relationship as Figure 1 and Figure 2 under different sample size scenarios. Generally, they match the intuition that as Δ increases, so does the posterior probability (1). However, because of the discrete nature of the probability model, the posterior probability (1) is not a strictly monotonic function of Δ at given interim looks. For example, in Figure 1 (b), when Δ ¼ 0, there are 11 possible combinations of x r ; y r ð Þ, namely, (0, 0), (1, 1), (2, 2) . . . (10, 10). Each combination of (x, y) leads to a different value of the posterior probability (1) as shown in Figure 1 (b) (11 data points exist when Δ ¼ 0). We take a conservative view and define f r ðΔÞ as a function of Δ at the r th interim look as follows: where Δ ¼ x r n xr À y r n yr , for fixed n xr and n yr . Figures 3 and 4 show the relationship between f rΔ À � and the posterior probability (1) as Δ varies at the r th interim look for the sample size scenario (n xr ¼ 10; n yr ¼ 10). Figure S3 and S4 in the supplementary materials show the same relationship as Figure 3 and Figure 4 under different sample size scenarios. These figures indicate that f rΔ À � is a monotonically increasing function of Δ . Therefore, at the r th interim look, the futility stopping threshold Φ is the maximum Δ that satisfies f rΔ À � � C, which is Posterior probability P p x � p y þ Δjx r ; n xr ; y r ; n yr À � when p y ¼ 0:2 and Δ ¼ 0:2, based on a vague prior sample size for both arms (α 0x ¼ 0:4, β 0x ¼ 0:6, α 0y ¼ 0:2, β 0y ¼ 0:8) as Δ varies at the r th interim look for the sample size scenario (n xr ¼ 10; n yr ¼ 10).
If Δ � Φ at the r th interim look, any combination of x r ; y r ð Þ leads to the inequality (2). Because f rΔ À � is a monotonically increasing function of Δ , the stopping threshold Φ can be calculated given some probability cut-off value of C before the start of the trial, thus making the RBOP2 design easy to understand and implement. The stopping thresholds, Φ, given the probability cut-off values C = 0.25, 0.5, and 0.75 are also shown in Figures 3 and 4. . Posterior probability P p x � p y þ Δjx r ; n xr ; y r ; n yr À � and f rΔ À � when p y ¼ 0:2 and Δ ¼ 0:2, based on a vague prior sample size (α 0x ¼ 0:4, β 0x ¼ 0:6, α 0y ¼ 0:2, β 0y ¼ 0:8) as Δ varies at the r th interim look for the sample size scenario (n xr ¼ 10; n yr ¼ 10).

Optimizing the design parameters
The RBOP2 design can control the type I error and power through optimization of the probability cut-off value, C. To illustrate the optimization process, we set up hypotheses to reflect the clinical interest under a frequentist framework. The null hypothesis H 0 and alternative hypothesis H 1 are as follows: where Δ ¼ p x À p y is the minimum meaningful clinical difference. Under H 0 or H 1 , the treatment is considered futile or efficacious, respectively. When the observed difference of proportions of "successes", Δ , is larger than the stopping threshold, Φ, at the final look, we reject H 0 and claim efficacy of the experimental treatment. The type I error rate, α, is defined as the probability of rejecting H 0 when H 0 is true. Power, 1 À β, is defined as the probability of rejecting H 0 when p x À p y ¼ Δ, and β is also known as the type II error rate. Zhou et al. (2017) proposed to use a probability cut-off value that is a function of the proportion of information at each interim look. This probability cut-off value has the form of where λ and γ are the parameters, which will be calibrated by simulation; n equals n x þ n y , which is the total sample size at a given interim look under the design settings of section 2.2; N is the maximum sample size. C n ð Þ is designed as a monotonically increasing function of n=N because at the initial stage of the trial, the information is not sufficient to decide about futility confidently, and a relaxed probability boundary is necessary to prevent a premature termination. However, when the trial accumulates more information and uncertainty is reduced, a strict probability boundary is necessary to terminate the treatment if it is inefficacious and prevent patients from being exposed to futile treatment.
In the following discussion, a search algorithm is used to enumerate all of the possible combinations of λ and γ and sometimes N, and the favorable combination is selected. First, assuming a fixed sample size N, parameters λ and γ are calibrated using simulation to maximize the power for the trial success while keeping the type I error rate less than a pre-determined value. According to Zhou et al. (2017), the algorithm can be described as follows: Step 1: Obtain H 0 and H 1 and the desirable type I error rate from clinicians.
Step 2: Determine the values of (λ; γ) that yield the desirable type I error rate. These values can be determined through a grid search.
Step 3: From the sets of (λ; γ) identified in step 2, select the set that yields the maximum statistical power as the optimal parameters. Assuming the sample size N is not fixed, parameters λ and γ and sample size N are chosen to meet the pre-defined type I error rate and power while minimizing E N=H 0 ð Þ, which is the expected sample size under H 0 . According to Zhou et al. (2017), the algorithm can be described as follows: Step 1: Obtain H 0 and H 1 and desirable types I and II error rates from clinicians.
Step 2: Determine the values of (N; λ; γ) that yield the desirable types I and II error rates. These values can be determined through a grid search.
Step 3: From the sets of (N; λ; γ) identified in step 2, select the set that yields the smallest E N=H 0 ð Þ as the optimal parameters.
Unlike other Bayesian designs, the RBOP2 design has a strong control on the type I error rate. Therefore, the RBOP2 design possesses frequentist properties that make it easy to apply in practice.

Simulation studies
In this section, we perform a simulation study under the proposed design and show the optimization of the sample size and stopping thresholds. All simulation studies are performed using the R programming language, and the code is available from the authors upon request. All results are based on 100,000 simulations.
In the beginning, we pose a typical clinical question, propose hypotheses, and suggest design parameters. Consider the following scenario: i) The purpose of a phase II randomized two-arm clinical trial is to evaluate the efficacy of a new experimental treatment (arm X) against that of the standard treatment (arm Y) in a specified population; ii) The primary endpoint is response rate, and patients are randomized to arm X and arm Y in the ratio 1:1; iii) The null hypothesis is p x À p y ¼ 0 and the alternative hypothesis is p x À p y > 0, where p x and p y denote the response rates in arms X and Y, respectively; iv) Given the historical data, the reference response rate in arm Y, p y , is 0.2 and the minimum meaningful clinical improvement, Δ, is 0.2; v) The expected type I error rate is no more than 5% (one-sided), and the target power, 1 À β, is 80% under the minimum meaningful clinical improvement.
Under this design setting, the vague priors are selected for both arms. The selection of prior distributions will be discussed in section 4. The interim analyses are performed at the time point of completion of efficacy evaluation for the first 50 patients and then after every 20 patients. The following metrics are used to evaluate the operating characteristics for the RBOP2 design: (i) PRN (%), the percentage of trials that reject the null hypothesis. If the null hypothesis is true, PRN (%) represents the type I error rate, while if the alternative hypothesis is true, PRN (%) represents the power; (ii) PET (%), the percentage of trials that are terminated early; (iii) ASS, the average sample size used before the trial stops. Table 1 shows the optimizing process and its operating characteristics under each sample size scenario. As expected, the power increases as the sample size increases. For example, under the alternative hypothesis ðΔ ¼ 0:2), the power increases from 63.2% to 86.6% when the maximum sample size increases from 70 to 150. When the maximum sample size is 130 and 150, the power is 80.7% and 86.6% and ASS is 120.1 and 143.0, respectively. Both scenarios meet the pre-specified Table 1. Operating characteristics for the proposed design under different sample size scenarios with design settings as follows: i) a vague prior sample size (α 0x ¼ 0:4, β 0x ¼ 0:6, α 0y ¼ 0:2, β 0y ¼ 0:8); ii) 1:1 randomization ratio; iii) H 0 : p x À p y ¼ 0, H 1 : p x À p y > 0, p y ¼ 0:2, Δ ¼ 0:2. target power (80%). Under the null hypothesis ðΔ ¼ 0), the ASS of the sample size scenario (50,70,90,110,130), 69.2, is less than that of the sample size scenario (50,70,90,110,130,150), 83.0. Therefore, 130 is chosen as the optimized sample size, N opt . Figure 5 shows the futility stopping thresholds for each interim and final analysis when N opt ¼ 130. The stopping thresholds,Φ, are (0.00, 0.0286, 0.0667, 0.0909, 0.0923) for interim looks 1-4 and the final look, and the optimized λ is 0.13 and γ is 0.18.

Choices of prior distributions
Under the Bayesian framework, prior distribution selection is very important. In section 2.1, we adopted beta prior distributions. For the hyper-parameters of the beta prior, we propose to use the success rate estimated from the historical research as the mode of the prior beta for control arm Y and the success rate estimated from the historical research plus the minimum meaningful clinical difference, Δ, as the mode of the prior beta for treatment arm X.  (50,70,90,110,130) with design settings as follows: i) a vague prior sample size (α 0x ¼ 0:4, β 0x ¼ 0:6, α 0y ¼ 2, β 0y ¼ 8); ii) 1:1 randomization ratio; iii).
In Table 2, we set a vague prior on treatment arm X (α 0x þ β 0x ¼ 1), because the evidence of efficacy is scarce for a new treatment. We set a strong prior on control arm Y (α 0x þ β 0x ¼ 10; 20; or30) because more data have accumulated for a standard treatment. Table 2 compares the operating characteristics of different weights on prior distributions for control arm Y, while a vague prior is set on treatment arm X. For each scenario, the PRN (%) is no more than 5% under the null hypothesis (p y ¼ 0:2, Δ ¼ 0), and thus the type I error rate is strictly controlled at the pre-specified level. Under the null hypothesis (p y ¼ 0:2; Δ ¼ 0), the ASSs for the Vague Prior, Strong Prior#1, Strong Prior#2, and Strong Prior#3 scenarios are 69.2, 85.1, 100.7, and 124.0, respectively, and the PETs under the Vague Prior, Strong Prior#1, Strong Prior#2, and Strong Prior#3 scenarios are 92.2%, 86.5%, 72.6%, and 62.6%, respectively. This property shows that the design based on vague priors on arm X and Y has more opportunities to stop early if the experimental treatment is futile and can prevent more patients from being exposed to an ineffective treatment. Under the alternative hypothesis (p y ¼ 0:2; Δ ¼ 0:2), the ASS increases from 125.9 to 149.5 when the prior sample size increases from 10 (Strong Prior#1) to 30 (Strong Prior#3), while the power only increases from 84.1% to 88.2%. In addition, if the true reference response rate for arm Y is underestimated, the type I error inflates and power decreases for all scenarios of prior distributions. For example, if the true response rate for arm Y is 0.3, the type I error inflates to 7.0%, 7.0%, 7.5%, and 6.6% under Vague Prior, Strong Prior#1, Strong Prior#2, and Strong Prior#3, respectively.
In Table S1 on the supplementary materials, we compare the operating characteristics of different weights on priors for both arms, while the weight is the same for both arms in each scenario. In Table S2 on the supplementary materials, we compare the operating characteristics of different weights on priors for treatment arm X, while we set a vague prior on control arm Y. In Table S3 on the supplementary materials, we set strong priors on control arm Y, and the mode of the prior beta is underestimated as 0.1 (Strong Prior#1), exactly estimated as 0.2 (Strong Prior#2), and overestimated as 0.3 (Strong Prior#3). The vague prior is set on treatment arm X. The comparison of different prior settings in Tables S1, S2 and S3 shows the same operating characteristics as in Table 2.

Comparing with the two-arm BOP2 design
The two-arm BOP2 design proposed by Zhao et al. (2022) is an extension of the single-arm BOP2 design. This design can handle single, multiple primary, and coprimary endpoints for superiority and noninferiority trials in a two-arm trial setting. In the analysis described in this section, we focus on two-arm randomized trials with single binary endpoint testing for superiority (with the superiority margin set as zero), and the interim analyses are subject to a futility stopping rule. The two-arm BOP2 design of Zhao et al. (2022) enumerates all possible outcomes (x r ; y r ) that yield inequality (2) as the indicators to make go/no-go decisions in the interim analyses. In contrast, we propose a design, RBOP2, that employs a more conservative approach to make go/no-go decisions, as discussed in Section 2.2. Table 3 summarizes the numerical results of the operating characteristics of the RBOP2 design and two-arm BOP2 design. Under the null hypothesis (p y ¼ 0:2; Δ ¼ 0), both the RBOP2 and two-arm BOP2 design can control the Type I error rate at the pre-specified level (one-sided 0.05), as expected. The PET and ASS are 92.2% and 69.2, respectively, for the RBOP2 design, superior to those of the twoarm BOP2 (PET = 87.0%, ASS = 75.3). Under the alternative hypothesis (p y ¼ 0:2; Δ ¼ 0:2), the power of the RBOP2 design is 80.7%, which is superior to that of the two-arm BOP2 design (power = 78.7%), although the ASSs are similar for both designs (120.1 and 121.8). Table S4 in the supplementary materials compares the RBOP2 design and two-arm BOP2 design in different sample size scenarios. The results are similar to those presented in Table 3. Table 2. Operating characteristics for the RBOP2 design using different priors on the control arm with design settings as follows: i) 1:1 randomization ratio; ii) H 0 :

Comparing with frequentist group-sequential design
The conventional group-sequential design with a beta spending function to specify futility stopping boundaries at interim looks has a character similar to that of the RBOP2 design. In this section, we compare the operating characteristics for the group-sequential design with interim futility monitoring with those of the RBOP2 design. The hypothesis is set up as described in section 2.3, and the Z test statistic for the difference of two proportions is σ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The design parameters are the same as described in section 3. For the RBOP2 design, we choose a vague prior for both arms. For the group-sequential design, we use the power family method to allocate the type II error at each stage. The beta spending function has the form β n N � � ρ ; Table 3. Operating characteristics for the RBOP2 design and two-arm BOP2 design with design settings as follows: i) a vague prior sample size (α 0x ¼ 0:4, β 0x ¼ 0:6, α 0y ¼ 0:2, β 0y ¼ 0:8) for RBOP2 and two-arm BOP2; ii) 1:1 randomization ratio; iii) H 0 : p x À p y ¼ 0, H 1 : p x À p y > 0, p y ¼ 0:2, Δ ¼ 0:2; iv) sample size for each look: (50,70,90,110,130 where n = n x þ n y is the total sample size at a given interim look; N is the sample size at the final look; β is the type II error rate. The spending function is O'Brien-Fleming (OBF)-like when ρ ¼ 3 and Pocock-like when ρ ¼ 1. The binding boundaries are used because the trial will be terminated for futility. Table 4 shows the numerical results of operating characteristics for the RBOP2 design and groupsequential design under the OBF-like (denoted as GS#1) and Pocock-like (denoted as GS#2) futility boundaries. The sample size and timing of an interim look are defined similarly for the RBOP2, GS#1, and GS#2 designs. Under the null hypothesis (p y ¼ 0:2; Δ ¼ 0), the RBOP2 design can control the type I error rate at the pre-specified level (one-sided 0.05), similar to the group-sequential design. The PET is 92.2% for the RBOP2 design, which is superior to GS#1 (PET = 86.1%) and GS#2 (PET = 90.7%). The ASS is 69.2 for the RBOP2 design, which is superior to GS#1 (ASS = 85.1) and similar to GS#2 (ASS = 69.4). Under the alternative hypothesis (p y ¼ 0:2; Δ ¼ 0:2), the powers for the RBOP2, GS#1, and GS#2 designs are 80.7%, 80.3%, and 78.6%, respectively, and their ASSs are 120.1, 125.6, and 119.9, respectively. The powers and ASSs are similar for the RBOP2, GS#1 and GS#2. If the reference response rate p y is mis-specified and the true value is 0.1, the RBOP2 design still has a higher PET (96.6%) than GS#1 (86.6%) and GS#2 (90.8%), and a smaller ASS (65.7) than GS#1 (84.8) and GS#2 (68.4), under the null hypothesis.

Comparing with Bayesian two-stage design
In this section, we compare the proposed design with an existing Bayesian two-arm two-stage design developed by Cellamare and Sambucini (2014) and denoted as the CS design in the following paragraphs. In the CS design, the randomization ratio is fixed at 1:1 between the two arms. The difference between the number of "successes" in the two arms is defined as the estimator for deciding whether to stop early for futility at the end of stage 1 or claim for efficacy/futility at the end of stage 2. The beta distributions are used to model the analysis and design priors. These characteristics of the CS design are similar to those of the proposed design under a two-stage and 1:1 randomization ratio scenario. The same denotations and hypothesis setting used in preceding paragraphs are applied as follows: the design priors for arms X and Y in the CS design are used to reflect clinical interests (for example, reference Table 4. Operating characteristics for the RBOP2 design and group-sequential designs (GS#1: OBF-like, GS#2: Pocock-like) with design settings as follows: i) a vague prior sample size (α 0x ¼ 0:4, β 0x ¼ 0:6, α 0y ¼ 0:2, β 0y ¼ 0:8) for RBOP2; ii) 1:1 randomization ratio; iii) H 0 : p x À p y ¼ 0, H 1 : p x À p y > 0, p y ¼ 0:2, Δ ¼ 0:2; iv) β ¼ 0:2; v) sample size for each look: (50,70,90,110,130 response rate); p y and p x ¼ p y þ Δ are used to construct design priors for arms Y and X, respectively, with the same prior sample size, n D ¼ 1; the analysis priors for arms Y and X are centered on the reference response rate, p y , with a vague prior sample size. For example, in Table VI of Cellamare and Sambucini (2014), the sample size and futility stopping rule are listed under different design parameters. When p y ¼ 0:05 and Δ ¼ 0:15, the total sample size for both arms is 28 at stage 1 and 64 at stage 2; if x 1 À y 1 < 2, the trial will terminate for futility at stage 1, and if x 2 À y 2 < 6, the trial will terminate for futility at stage 2. Table 5 compares the operating characteristics for the CS design and the proposed design under the scenarios illustrated in Table VI of Cellamare and Sambucini (2014). As expected, the RBOP2 design can control the type I error strictly at the pre-specified level. However, in the CS design, the type I error is not controlled strictly because this design is fully based on the Bayesian framework. For scenarios in which the CS design's type I error rate is less than 5%, the RBOP2 (α ¼ 0:05) design has higher power than the CS design (except when p y ¼ 0:2 and Δ ¼ 0:25, the power of the RBOP2 design is 61.5%, which is slightly lower than 62% under the CS design). For scenarios in which the CS design's type I error rate is less than 10%, the RBOP2 (α ¼ 0:1) design has higher power than the CS design. Table 5. Operating characteristics for the CS design and RBOP2 design with design settings as follows: i) 1:1 randomization ratio; ii) H 0 : p x À p y ¼ 0, H 1 : p x À p y > 0; iii) A vague prior sample size (α 0x ¼ p y þ Δ, β 0x ¼ 1 À α 0x , α 0y ¼ p y , β 0y ¼ 1 À α 0y ) for RBOP2 design; iv) Design priors with prior sample size n D ¼ 1 and analysis priors with a vague prior sample size for CS design.

Conclusion
In this paper, a randomized version of the BOP2 design is proposed with a binary endpoint. Unlike other Bayesian designs, the RBOP2 design has a strong control on the type I error rate and can maximize power through optimization. It allows the stopping thresholds to change according to the proportion of information accumulated during the trial. An important feature of the RBOP2 design is that the threshold for go/no-go decisions is easy to understand, and the thresholds can be enumerated in the protocol before the start of the trial, making it convenient to apply in practice. Simulation results show that the RBOP2 design has good operating characteristics. The RBOP2 design does not limit the number of interim looks. If the endpoint can be observed easily and quickly, the interim looks can be planned to be frequent. However, if the endpoint is late-onset, the interim looks will delay the completion of the trial and interrupt the patients' enrolment.
We have assessed the performance of the proposed design on different settings of prior distributions. Vague priors on both the treatment arm and control arm have a more satisfactory characteristic, as they are more effective than other prior settings at preventing patients from being exposed to an ineffective treatment. Moreover, we have compared the two-arm BOP2 design proposed by Zhao et al. (2022) with the RBOP2 design. If the treatment is futile, the RBOP2 design has a higher early stopping probability and smaller sample size than those of the two-arm BOP2 design. The frequentist groupsequential design with futility stopping boundaries was compared with the proposed design. The proposed design can control the pre-specified type I error rate as the group-sequential design, and it has a higher probability of terminating the trial early when treatment is futile than the groupsequential designs with OBF-like and Pocock-like boundaries.
The proposed design can be applied to any randomized phase II design with a single binary endpoint. It can be extended to handle complex categorical endpoints, time-to-event endpoints, and scenarios comprising more than two arms.

Disclosure statement
No potential conflict of interest was reported by the author.

Funding
The author reported there is no funding associated with the work featured in this article.