Probability of Study Success (PrSS) Evaluation Based on Multiple Endpoints in Late Phase Oncology Drug Development

Abstract Phase 2 Oncology clinical trials are increasingly designed with multiple primary endpoints, such as progression-free survival (PFS) and overall survival (OS). While this gives a clearer picture of treatment benefit than trials with a single endpoint, it complicates the decision to begin a registration (phase 3) trial following a phase 2 trial. Methods for calculating the probability of study success (PrSS) assume the distribution of OS is well understood, but rarely consider the case of multiple primary endpoints. We introduce the BAMBOO method (BAyesian Model that Brings phase 2 cOmposite endpOints) to address this gap. We propose modeling the phase 2 log hazard ratios for PFS and OS as a bivariate normal random variable, with the unknown correlation parameter estimated with a meta-analysis or via simulation. The PrSS of a phase 3 study with multiple primary endpoints study can then be obtained using the posterior predictive distribution of this bivariate normal random variable. We provide methods to include additional surrogate endpoints, such as objective response rate, by extending the model to a multivariate normal of arbitrary dimension. Simulation results suggest the proposed method can better address the known prior-data conflict issue comparing to existing approaches, while borrowing historical data with caution is generally recommended under high discrepancy and small sample size. Finally, we provide an example of how BAMBOO can aid in planning a phase 3 trial.


Introduction
Oncology therapies are often associated with delayed clinical benefit and substantially improved overall survival (OS). While OS is considered the gold standard to demonstrate the clinical benefit, this endpoint may unrealistically extend the duration of a phase 2 or phase 3 trial. Progression-free survival (PFS) is often used as a surrogate endpoint for OS, though the correlation between the endpoints may be unclear. This unknown correlation creates a risk for sponsors who wish to gate hypothesis tests of OS behind hypothesis tests of PFS. For example, the KEYNOTE-010 study in Herbst et al. (2016) met the OS endpoint (hazard ratio, or HR = 0.71, p = 0.0008) but did not meet the PFS endpoint (HR = 0.88, p = 0.07) for pembrolizumab 2 mg/kg versus docetaxel. Recent years, more and more immuno-oncology (IO) studies considered PFS and OS as multiple primary endpoints in the phase 3 development, for example, KEYNOTE 010 (Herbst et al. 2016), KEYNOTE 189 (Gandhi et al. 2018) andKEYNOTE 407 (Paz-Ares et al. 2018).
Another challenge is to determine the threshold of evidence (often called the "go/no go decision") from a phase 2 study that triggers phase 3 development. There are a number of methods developed in the literature to determine how to make go/no go decision analytically. Wang et al. (2013), developed a two-step approach to make go/no go decision based on the predictive power. It included a Bayesian modeling step to synthesize relevant data to derive the distribution of the treatment effect and then it requires a second step to evaluate the probability of study success (PrSS) of phase 3 study via trial simulations. Hong and Shi (2012) proposed to use the predictive power approach to assist the phase 3 go/no go decision by evaluating the strength of phase 2 efficacy effect using the PFS HR, OS HR or both. Sabin et al. (2014) presented an example of evaluating the OS success of pancreatic cancer based on the phase 2 PFS results. Saint-Hilary et al. (2019), provided a general approach to compute the PrSS of the final endpoint in phase 3 based on surrogate endpoints. They also proposed methods to address potential discordance between the prior and the observed responses in a phase 2 trials. However, none of the above approaches extended the methods to the case where multiple endpoints in phase 3 were considered.
In this article, we considered a specific case where phase 3 study used both PFS and OS as the study endpoints. We proposed a BAyesian Model that Brings cOmposite endpOints (BAMBOO) of both PFS and OS HRs from phase 2 interim analysis and final analysis to evaluate PrSS of the phase 3 study. The BAMBOO approach assumes that the PFS and OS HRs are bivariate log-normally distributed with a covariance matrix known up to an unknown correlation coefficient. The correlation can be estimated with two approaches. When sufficient historical information is available, a meta-analysis can be used to calculate the correlation between PFS and OS on the loghazard ratio scale. However, when the number of historical studies in the targeted disease setting is limited, we proposed a novel simulation approach to derive the correlation between PFS and OS log HRs based on the assumption that individual patient's OS = PFS + post-progression survival (PPS) time. Using either of these approaches, the PrSS of the phase 3 study, defined as Pr(PFS HR 3 < c 1 or OS HR 3 < c 2 ), can be calculated using the posterior predictive distribution from the Bayesian model. Here c 1 and c 2 are critical boundaries for PFS HR and OS HR deemed to be statistically significant. A bivariate critical success factor (CSF) based on phase 2 PFS and OS HR can be obtained with the corresponding phase 3 PrSS. The go/no go region is displayed in a contour plot to facilitate the decision making. When the OS events take significantly longer time to mature comparing to PFS events, the bivariate CSF decision rule can be marginalized to depend only on PFS. In this article, we will use the IO drug development as the example, but the go/no-go decision rule and BAMBOO approach should be applicable to all cases.
With the rapidly evolving landscape in cancer treatment, sponsors are often eager to trigger a phase 3 trial based on early efficacy signals such as matured overall response rate (ORR) results while they continue to follow up PFS and OS events. The marginalized PFS HR CSF (progression-free survival hazard ratio critical success factor) can be further linked to the ORR improvement ( ORR) for early decision making.
The rest of the article is organized as follows. Section 2 describes the framework and necessary steps of evaluating the PrSS of the phase 3 studies with both PFS and OS as the study endpoints. The detailed methodology of calculating the PrSS will be included. Sections 3 illustrates a lung cancer study design example with the details of determining the correlation between PFS and OS log HR. Section 4 provides simulation studies on operating characteristics of the proposed Bayesian assessment framework under several existing and newly proposed prior choices for comparison. Section 5 gives the summary and discussion.

Methods
Consider designing a phase 3 trial with dual endpoints of PFS and OS, the total Type I error α is split between PFS at α 1 and OS at α 2 , where α = α 1 + α 2 . At the final analysis of the phase 3 study, there are n 31 PFS events and n 32 OS events with corresponding critical values for PFS and OS HRs c 1 and c 2 , respectively. The sample sizes-with their corresponding efficacy boundaries, Type I error rate, statistical power-can be obtained by group sequential methods which are widely available in common statistical software. The details can be found in supplementary materials. In this section, we will present the BAMBOO approach for PrSS evaluation given the sample size and critical boundaries with jointly modeling of multiple endpoints. The key assumption is that the hazard ratios and correlation between endpoints (log HR) are identical across phase of the trials. The hazard ratio for phase 3 can then be estimated with phase 2 observations and historical data, and the sponsor can gauge the chance of meeting success criteria. Hence, the proposed framework can assist the go/no go decision for a target PrSS level. We detail the various steps and statistical considerations of leveraging historical information in this section.

General PrSS Calculation
We first introduce notation.
be the log HR of PFS and OS in phase i. Assuming 1:1 randomization, the asymptotic distribution is where σ 2 ij = 4/n ij and n ij is the number of events for the jth endpoint in phase i. We suppress the dependence of the variance σ 2 ij (and hence i ) on the number of events n ij for notational simplicity. The parameterization in (1) assumes the mean, μ, and correlation, ρ, do not change over phases. The prior is factored as π(μ, ρ) = π(ρ) π(μ | ρ). We consider a pointmass prior for ρ: π(ρ) = δρ. The point mass is placed at a reasonable value,ρ, which can be determined from literature or calibrated via simulations. Hence, we further suppress the dependence of i on ρ and the prior component π(ρ) in the PrSS derivation. However, an additional layer of uncertainty on ρ may be incorporated by replacing the point mass in π(ρ) by a continuous probability distribution.
For phase 2 data, Model (1) is Likelihood: where w 2 = V 2 −1 2 is the weight on phase 2 data Y 2 , and I 2×2 is the 2-by-2 identity matrix.
Let S be the set of Y 3 values that conclude a study success in phase 3, The posterior parameter distribution for μ in phase 2 becomes the prior for μ in phase 3. The phase 3 model and PrSS are: Prior from phase 2 in (3): The derivation of (4) can be found in supplementary materials. Due to the convenient form of the posterior predictive distribution (PPD) in (4), it is natural to adopt Monte Carlo approximation to the PrSS for generally defined study success S: where Y (b) 3 is the bth sample drawn from PPD with a sufficiently large B to approximate the integral. For a target PrSS level, one can reversely determine the required phase 2 PFS and OS HRs (or one given the other), that is, the CSF cutoff of phase 2 for triggering a phase 3 study.
Note this general PrSS assessment framework is not restricted to the bivariate case (e.g., PFS/OS), and the results can be generalized to J endpoints if the multivariate normal distribution is assumed on appropriate scales (e.g., log HR).

Accounting for Prior-Data Conflict
When a diffuse prior is adopted for μ to represent vague information, for example, m 0 = (0, 0) and V 0 = 100I 2×2 , the PrSS calculation is primarily driven by phase 2 data. On the other hand, if an informative prior for μ is adopted, such as prior constructed from similar historical trials or a completed phase 1b trial, it can have a substantial impact on the PrSS calculation. This is particularly noticeable when the number of PFS or OS events is relatively small for example at an interim analysis. One approach to discount the historical data is the power prior (Ibrahim and Chen 2000) by replacing the prior π(μ) in (2) with The power parameter, d ∈ [0, 1], discounts the influence of historical information and can be prescribed prior to any analysis, (e.g., Cho et al. 2018). Discounting historical data under normal density via power prior is equivalent to inflating the prior variance V 0 by V 0 /d for d ∈ (0, 1]. The power prior can include multiple data sources as prior components with the expression: where (m k0 , V k0 ) are the mean and covariance matrix for the kth component. For diffuse prior component d k = 1 without further attenuating it. For informative prior components, the discount factors d k ∈ [0, 1] are to be prescribed with a rationale or estimated using hierarchical priors which incur additional hyperparameters. An initial check for prior-data conflict using χ 2 -statistics can be found in supplementary materials. An alternative approach is to extend the normal prior density (2) or approximate more general distributions (e.g., Schmidli et al. 2014) with a mixture of normal densities: where π k is the prior weight for each density with K k=1 π k = 1. A common choice is normal mixture densities with K = 2 components: a diffuse prior for k = 1 with rather dispersed V 10 and typically neutral mean m 10 = 0 (or hazard ratio = 1), and an informative prior for k = 2 with m 20 and V 20 extracted from historical trials.
Under phase 2 data likelihood Y 2 ∼ N (μ, 2 ) and mixture prior in (6), the phase 2 posterior distribution is normal mixture: with posterior weight adjusted by the phase 2 observed data Y 2 , where g(y; μ, ) is the normal density with mean μ and covariance evaluated at y. The derivation of (7) can be found in supplementary materials. The posterior mixture in (7) will then be used as prior distribution for phase 3, without further updatingπ k based upon phase 3 data to preserve the convenient form of the phase 3 PPD for PrSS calculation.π k will be treated as "hyperparameters" in a prior specification. Correspondingly, the phase 3 PPD is It is worth noting that if the prior weight π k = 1,π k in (9) will not be updated by Y 2 and the mixture prior (6) reduces to a single normal prior being the kth component as discussed in the proceeding paragraphs. In practice, π k 's can be prescribed and fixed without being adjusted by the phase 2 data, if there are reasonable choices. This is particularly recommended for diffuse prior component as its adjusted weight by the phase 2 data in (8) can be almost zero due to its negligible probability mass over the possible range. More discussion on the prescribed weights for the mixture priors by empirically checking priordata conflict can be found in supplementary materials.

Marginalization in PrSS Calculation
Frequently, a marginal CSF based on a subset of multiple endpoints is desired for early phase decision making even when the PrSS of the phase 3 depends on additional endpoints. For example, the decision to begin a phase 3 study may be made when the OS data information is highly immature, and PrSS evaluation will depend primarily on PFS data information that is available from phase 2. On the other hand, study success can depend on a subset of the jointly modeled endpoints. In both cases, a subset of endpoints will be marginalized out from the joint distribution.
For marginalization in phase 3 of J jointly modeled endpoints, let Y * 3 = (Y 3j 1 , . . . , Y 3j L ) , L ≤ J for a subset of L endpoints that determine the study success S. Let Y * 3 = EY 3 with L × J projection matrix E = (e 1 , . . . , e L ) , where the 1 × J vector e l has all entries zero except the j l th entry being 1. Using (4) and the known properties of normal distribution, Y * 3 = EY 3 ∼ N (Em 2 , E E ) with = 3 + V 2 . It turns out Em 2 = (m 2j 1 , . . . , m 2j L ) m * 2 are the rows of m 2 that corresponds to {j 1 , . . . , j L }, and * = E E is the corresponding L × L subblock in .
Next, for marginalization in phase 2, we decompose the data with M < J endpoints that are available for making go/no go decision in phase 2, and its complementary set Y c 2 with J−M endpoints that are unobserved or with little information in the current trial. We shall present three different prior choices including Imputation, Surrogate, and Singular prior to cope with such unobserved endpoints for marginal CSF cutoff using observed ones in phase 2.

Imputation Prior
Note the mean m 2 = w 2 Y 2 + (I J×J − w 2 )m 0 of the PPD in (4) is linear in phase 2 data, Y 2 . After reordering and partitioning , the mean m * 2 can be further written as a linear function of Y c 2 : with a = EV 2 V −1 0 m 0 + EG * Y * 2 , and b = EG c , where G * consists of the columns of V 2 −1 2 that correspond to Y * 2 , and G c to its complement. To obtain the CSF cutoff for Y * 2 , one can integrate out the unobserved endpoints Y c 2 in phase 2 in (10) with a certain prior distribution on Y c 2 . The flat prior π(Y c 2 ) ∝ 1 leads to an improper marginal distribution. However, under the proper prior Y c 2 ∼ N (m c , V c ), the marginal PPD based on Y * 2 can be written as The derivation of (11) can be found in supplementary materials. The PrSS can be evaluated by sampling Y * 3 from the marginal PPD in (11) and checking if they fall in S for Monte Carlo approximation.
The analytical approach to integrate out the unobserved endpoints in (11) can be sensitive to the prior choice, N (m c , V c ) for Y c 2 . While a diffuse prior, N (m 0 , V 0 ), for the mean is generally preferable for providing data-driven PrSS and CSF cutoff for phase 2, the vague choice of N (m c , V c ) for the unobserved endpoints is not recommended for the marginalization due to that (11) can be dominated by the dispersed covariance matrix V c . Therefore, an informative prior for Y c 2 (i.e., a certain amount of knowledge on the unobserved endpoints in phase 2), upon availability, can be used to produce more reasonable and A c , B c along with V c can be externally acquired using historical records of (Y c 2 , Y * 2 ) for linear regression with actual sample size or number of events for each study as weights. For example, when Y * 2 is for PFS and Y c 2 for OS, all A c , B c and V c become scalars and it reduces to a simple linear regression; when Y * 2 further incorporates ORR, the log-hazard ratio of PFS and log-odds ratio of ORR can be jointly modeled via multivariate regression. Since this approach seemingly "imputes" the unobserved Y c 2 using available data on Y * 2 , we refer to it as the "Imputation prior" approach for obtaining a marginal CSF for Y * 2 in phase 2.

Surrogate Prior
Alternatively, the conditional prior can be specified through the mean: , where μ c and μ * are the corresponding mean of Y c 2 and Y * 2 . For example, the surrogate prior approach (Saint-Hilary et al. 2019) adopted a linear representation for π(μ c | μ * ): via Bayesian meta-analysis. The study success S in Saint-Hilary et al. (2019) is based on phase 3 version of Y c 2 (unobserved in phase 2) as "final" endpoints, and Y * 2 corresponds to "surrogate" endpoints which are not involved in S. In order to extend the surrogate prior approach for generally defined study success S based on multiple endpoints, it is convenient to write the corresponding joint distribution μ ∼ N (m 2 , V 2 ) with where m * p and V * p are the posterior mean and variance for μ * based on the phase 2 data Y * 2 and prior π(μ * ). Therefore, the subsequent steps for PrSS based on generally defined S proceed as before.
For the example using PFS (j = 1) and OS (j = 2) as endpoints, suppose the historical data consist of n studies with log HR values ) for PFS and OS, and number of events n rj , r = 1, . . . , n, j = 1, 2. The Bayesian metaanalysis model is The study-specific covariance r is assumed known or constructed using the correlation ρ from a bivariate normal model fitted to all n studies, and 4/n rj as variance for individual endpoint in each study). Since the meta-analysis model is fitted to the historical data with lack of preceding information, the diffuse prior can be used for the parameters in the meta-analysis model, for example, N (0, 10 2 ) for ξ * r , A s , B s , and Inverse-Gamma(0.001, 0.001) for V s . The posterior estimates of (A s , B s , V s ) are used in the surrogate prior approach.

Singular Prior
A third approach to obtain a marginal CSF for Y * 2 is to allow the number of effective phase 2 events used for PrSS calculation to be zero for index j that corresponds to the unobserved endpoints Y c 2 . Consequently, σ −2 2j = n 2j /4 = 0 and the phase 2 data precision matrix −1 2 is singular with zero entries for all rows and columns that involve Y c 2 , yet the prior covariance matrix V 0 helps circumvent the singularity of the posterior V 2 required for PrSS calculation. As desired, the phase 2 data input for phase 3 PrSS calculation through −1 2 Y 2 completely depends on Y * 2 . Due to the nature of allowing a singular precision matrix for phase 2 data, we indicate this approach as "Singular prior. " Hong and Shi (2012) covered examples with J = 2 endpoints PFS/OS and study success S that depends on OS efficacy.
In summary, the BAMBOO framework for PrSS evaluation for multiple endpoints and general success criteria be used with: diffuse priors for data-driven decision, informative priors based on historical data, and power or mixture priors to include then discount historical data. To obtain marginal CSF for available phase 2 endpoints, the BAMBOO approach also adopts: Singular prior that erases the unobserved endpoints by allowing a singular precision matrix, Surrogate prior that adopts a conditional prior specification on the mean parameters, and Imputation prior that incorporates the predictive distribution based on data model. Operating characteristics of these prior choices will be elaborated through examples in the subsequent sections.

PrSS Evaluation based on Different Definitions of Study Success
One merit of the proposed PrSS evaluation framework is that it allows flexible definition of study success. Example PrSS assessment results using (5) for four different definitions of study success are presented in Figure  Except (a), all examples were calculated using diffuse prior for the mean such that the PrSS profiles were primarily driven by the phase 2 data, a pair of PFS/OS observations out of [100 × 100]grids over the study range to interpolate the PrSS surface. The setting for the number of events is the same as the interim analysis in the illustration in Section 3. For example (a), the normal mixture prior is elicited with equal prior weights on one component retrieved from historical data as will be described in Section 3.2, and another component centering around neutral mean (0, 0) for the log HR but with the same variance. This specific mixture prior was chosen only for illustrative purposes in Figure 1(a). In this case, if a diffuse prior is adopted for the mean (figure not shown), the PrSS for phase 3 OS does not visibly rely on phase 2 PFS due to V 2 is dominated by 2 under diffuse prior with rather dispersed V 0 in (3). As a result, w 2 = Figure 1. Example of PrSS under generally defined study success.
V 2 −1 2 is close to identity matrix with off-diagonal entries zero, hence, the mean m 2j of PPD for Y 3j -based PrSS mainly depends on Y 2j for individual endpoint j, regardless of the correlation between those endpoints (log HR).

Marginalization Example: Determine Phase 2 PFS CSF
for Phase 3 OS Success Consider S = {Y 32 : log OS HR < log c 2 }, that is, study success solely depends on phase 3 OS HR. Note based on PPD (4), the marginal distribution of each phase 3 endpoint j is where σ 2 3j and v 2 2j are the jth diagonal entry of 3 and V 2 , respectively. Let (·) denote the cumulative distribution function (cdf) for the standard normal distribution. Under the univariate case of study success S which only depends on the jth endpoint, the PrSS in (5) reduces to (13) As an example, for phase 3 OS (j = 2) success, E = (0, 1) for (10), the PrSS in (13) is equivalent to the predictive power (PP) in Hong and Shi (2012). To see this, note that from (3), 32 + P 11 /|P|. This yields the predictive power in Hong and Shi (2012), however, without additionally assuming a singular phase 3 precision matrix −1 3 with all entries 0 except (2, 2)-entry σ −2 32 in the proof therein.
In this example, to achieve PrSS = β based on phase 3 OS success, the desired phase 2 CSF cutoff for log PFS HR Y * 21 has a closed-form expression for all 3 marginalization methods, as shown in Table 1. The derivation can be found in supplementary materials.
The CSF cutoff of phase 2 PFS HR for phase 3 OS success under different priors in Table 1 can generally differ as Singular prior relies on an informative prior specification via (m 0 , V 0 ) for the mean, or absolutely location information. CSF for Surrogate/Imputation prior, on the other hand, can allow diffuse prior choice of (m 0 , V 0 ), yet it depends on a reasonable prior information on the relationship between means, that is, relative location information via regression. Choice of CSF calculation depends on to what extent the prior knowledge is available and incorporated to the PrSS calculation, yet the specification of a singular precision matrix by allowing zero number of events for Singular prior can undermine the asymptotic log-normal distribution assumption for the hazard ratios. The CSF cutoff values using different priors will be further compared in the following illustrative examples.

Illustration
In this section, we illustrate the steps to implement the proposed PrSS assessment framework with different prior choices using the PFS/OS example. Assume a novel immuno-oncology (IO) agent has demonstrated a reasonable safety profile and preliminary efficacy in phase 1 study. A randomized phase 2 study in lung cancer was initiated to further evaluate the efficacy and safety. The phase 2 study randomized 100 patients to the IO agent versus standard of care in 1:1 ratio. The primary endpoint is PFS and the key secondary endpoints are ORR and OS in phase 2. The interim analysis for collecting data on these endpoints is planned when the first 100 patients have been followed for 6 months. The final analysis is planned when approximately 70% of patients experience OS events. The sponsor would like to establish the CSF in ORR, PFS and OS for the trigger of phase 3 registration study where the primary endpoints may include both OS and PFS. It is worth noting that the decision of a phase 3 registrational study is a complex process including commercial assessment, regulatory calibration and competitive landscape investigation. We focus on the technical perspective in PrSS evaluation.

Define the Phase 3 Success Boundaries
The first step is to determine the phase 3 study design and the definition of a positive phase 3 study. In this example, a phase 3 study with dual endpoints of PFS and OS is successful when at least one endpoint reaches statistical significance. Assume the target values of 0.65 for PFS HR and 0.75 for OS HR are determined to be clinically meaningful for registration. Moreover, assume the one-sided Type I error 0.025 is split into 0.005 for PFS and 0.02 for OS. The power associated with PFS is 90% and OS is 85%. The phase 3 study has two interim looks at data Table 1. Marginal CSF for phase 2 PFS to achieve phase 3 OS success.
information fraction 50% and 75% before the final analysis. This setting is translated into a hazard ratio boundary of 0.75 and 0.824 for the PFS and OS hazard ratio, respectively (minimum detectable difference that yields statistical significance). As a result, the phase 3 study with dual endpoints can be set as PFS with critical HR c 1 = 0.75 (at 321 PFS events) and OS with the critical HR c 2 = 0.824 (at 470 OS events). The phase 3 success is defined as PFS HR 3 ≤ 0.75 or OS HR 3 ≤ 0.824. Note that, followed by sensitivity analysis in Supplementary Materials, the α-splitting (e.g., α 1 = 0.005 for PFS and α 2 = 0.02 for OS) within a reasonable range (e.g., α 1 from 0.0025 to 0.02), while holding the overall α/power/effect size unchanged, does not have a substantial impact on the critical HR boundaries and the decision making from the subsequent BAMBOO analysis in this example.

Determine the Correlation between PFS and OS (log HR)
As a second step, since BAMBOO models the joint predictive distribution between PFS and OS, it is important to accurately estimate the correlation between PFS and OS (log HR). Multiple publications in the past decade discussed how to evaluate the correlation between PFS and OS (log HR) with meta-analysis.  Table 2. Note that when applying a meta-analytic prior to a specific study, it is necessary to evaluate the selection of relevant historical studies. The enrollment criteria and baseline characteristics need to be similar between historical study and the future phase 3 study. In addition, it is critical that the mechanism of action of investigated molecule, including the comparator molecule, will be comparable to the mechanism of action in the historical trials.
We fit a bivariate normal model using weighted likelihood, with higher weights on studies with a larger number of PFS or OS events observed. However, as many of the historical studies did not report the number of events, and for the ones reported, most PFS and OS analysis occurred with the events approximately 70% and 60% of the total population, respectively. Due to this proportionality, we specify the weights as the number of patients randomized in each study. More specifically, let w r = nN r / n r=1 N r with N r the number of patients from the rth study with observed log HR values x rj of PFS (j = 1) and OS (j = 2). The estimated parameter θ = (m 01 , m 02 , σ 01 , σ 02 , ρ) is found by maximizing the likelihood L = n r=1 L w r i , or equivalently, the weighted log-likelihood = n r=1 w r r = The PFS and OS HR data from 18 studies with the fitted normal model and the corresponding 95% ellipse for the data (outer ellipse) and the mean (inner ellipse), are shown in Figure 2. The estimated mean of log HR m 01 = −0.323 and m 02 = −0.322, that is, the HR is around 0.724 for both PFS and OS. The estimated variance for the data PFS/OS log HR σ 2 01 = 0.065 and σ 2 02 = 0.039, and the estimated correlation under the weighted likelihood is ρ = 0.643. The variance for  However, not all tumor settings have a rich literature to evaluate correlation between PFS HR and OS HR. We illustrate an alternative way to derive the correlation between PFS and OS HR when historical information is sparse. The alternative method is a simulation-based approach with the assumption that individual patient's OS = PFS + PPS (Imai, Kaira, and Minato 2017) where PPS is the post-progression survival. The simulation procedure involves four steps as follows: 1. Simulate PFS of both control and treatment arms with proper assumptions (such as exponential distribution); 2. Simulate PPS of control arm, based on historical data or best clinical judgment; 3. Simulate PPS HR and hence the PPS of treatment arm 4. Calculate OS time in both control and treatment arm, then derive the HR.
Key assumptions include: (a) PFS HR is log-normally distributed with mean and variance retrieved from historical data; (b) for the control arm, PFS follows an exponential distribution with historical median, and PPS follows a zero-inflated exponential distribution with historical median for the exponential part, and probability q of being exactly zero, that is, OS = PFS for the event of death rather than disease progression; (c) log HR of PPS follows a Normal distribution with specified mean 0 and variance ν 2 . Hence, the clinical benefit is primarily on OS via PFS, and ν 2 is chosen such that the resulting OS (log HR) matches the historical variability; (d) simulated trials with population size N as in the design. The simulation parameters are chosen based on the historical information, for example, small q to reflect that the median OS was highly associated with the median PPS, but not with the median PFS in lung cancer (Imai, Kaira, and Minato 2017). One example of such simulated cloud of PFS/OS HR pairs with ν 2 = 0.05, q = 0.1 and N = 100 is shown in Figure 3. The estimated correlation between simulated PFS/OS log HR pairs is around 0.634, which is similar to the estimated correlation from historical data with meta-analysis approach.

Evaluate the Phase 3 PrSS and Phase 2 CSF in PFS and OS
Next, recall that the phase 2 study is ongoing and one interim analysis is planned before the final analysis. The objective is to determine the CSF at interim and final analysis with phase 3 success defined in Section 3.1. For a given enrollment rate and assumed median PFS and OS, one can easily obtain the number of PFS and OS events at a specific time point. For illustrative purposes, we assume that PFS and OS follow exponential distribution and the median PFS is about 8 months, median OS is around 22 months, the enrollment rate is 10 patients per month. Based on simulating the phase 2 trial with 100 patients, the estimated number of PFS and OS is 47 and 23 at interim and 87, 68 at the final analysis in phase 2 study. Plugging the estimated correlation ρ = 0.643 between PFS and OS (log HR) in Section 3.2 and these sample sizes for covariance matrices, one can calculate Pr(PFS HR 3 < 0.75 or OS HR 3 < 0.824 | PFS HR 2 , PFS HR 2 ) given an observed pair of PFS HR 2 and OS HR 2 . Assuming a diffuse prior N (0, 10 2 I 2×2 ) for the mean μ, the PrSS assessment results are illustrated by the contour plots in Figures 4 and 5 for interim and final analysis, respectively. The horizontal axis corresponds to the PFS HR, and the vertical axis refers to the OS HR in the phase 2 study. The white solid lines indicate PFS HR at 0.75 (vertical) and OS HR at 0.824 (horizontal) where the phase 3 study success criterion is meet (i.e., either phase 3 PFS or OS HR is below the corresponding threshold). The contour lines indicate different PrSS levels in phase 3. In this illustration we are interested in phase 3 PrSS cutoff values from 50% to 85%, while PrSS levels below 50% or above 85% are not further explored. For example, at the phase 2 interim analysis (Figure 4), when PFS HR 2 = 0.75 and OS HR 2 = 0.8, the phase 3 PrSS is near 65%. More generally, the CSF of PFS/OS HR for 65% PrSS would be any points on the 65% contour line, and all points below the line are acceptable values for "go" decision. Comparing the interim/final results, the overall contraction of the contour lines from Figures 4 to 5 indicates that one can make stronger conclusions with increased sample sizes since the bars for making both "go" and "no go" decisions are lowered (i.e., generally larger PFS/OS HR values as CSF cutoff for concluding early efficacy, and smaller HR values for futility).
Since the OS data can be immature and exhibit a large variability at the interim analysis, a marginal CSF for PFS HR can be desired for early decision-making. Hence, Figures 4 and 5 also show the vertical lines at the marginal CSF for PFS HR  corresponding to a 65% PrSS in phase 3 under the 3 methods in Table 1. In this IO study example: 1. Singular prior is similar to Hong and Shi (2012) but with phase 2 data matrix adjusted by the correlation ρ, putting a weight of σ −2 21 /(1 − ρ 2 ) instead of σ −2 21 on the phase 2 PFS data, hence, less weight on the historical mean m 0 . 2. Surrogate prior π(μ) = π(μ c | μ * ) π(μ * ) is similar to Saint-Hilary et al. (2019) but for general study success S and multiple primary endpoints. For π(μ c | μ * ) = N (A s + B s μ * , V s ), the intercept A s = −0.155, slope B s = 0.540 and variance V s = 0.086 2 were estimated using a Bayesian meta-analysis model on the IO literature data. Then a diffuse prior π(μ * ) = N (0, 10 2 ) is used for the mean of the surrogate endpoint PFS (log HR).

Imputation prior assumes the unusable OS log HR
163, slope B c = 0.496 and residual variance V c = 0.169 2 were estimated from the IO literature data using a simple regression (weighted by study size). Then a diffuse prior μ ∼ N (0, 10 2 I 2×2 ) is used for μ.
The estimated coefficients for Surrogate and Imputation prior using the IO literature data are quite similar. The residual variance under Imputation prior is much larger (hence, less prior impact), which is analogous to that the prediction interval for the data is generally wider than the confidence interval for the mean under the regression analysis.
From Figures 4 and 5, for Imputation prior that analytically integrates out OS HR 2 with historical relationship between OS/PFS HR, the bar to meet at the interim look (PFS HR = 0.796) is lowered as more data are accumulated at the final analysis (PFS HR = 0.829) in phase 2. The bar under Surrogate prior is also slightly lowered from the interim (PFS HR = 0.818) to the final (PFS HR = 0.829) analysis. On the other hand, Singular prior gives unrealistically low bars due to the prior impact, which suggests discounting is needed in this application with imbalanced prior and in-trial data information. Figures 4 and 5 also manifest that when the OS data information is used, albeit small (e.g., 23 events for the interim), the region for plausible PFS HR values to attain 65% success is greatly expanded from a single cutoff around 0.8 under Imputation/Surrogate prior with OS information suppressed. Therefore, those OS data can make a valuable contribution and provide additional insights.

Link the Marginalized PFS HR to the ORR
In addition, for IO studies, the first tumor response usually occurs during the first 2 tumor assessments. Therefore, the ORR may mature early comparing to the PFS HR and OS HR, and it may be desirable to calculate the phase 2 CSF based on ORR. In order to calculate the ORR CSF, one can incorporate the log-odds ratio of ORR as a third endpoint in the joint Bayesian model. We further introduce a simpler approach that bridges the PFS HR and -ORR CSF cutoff in supplementary materials.

Simulation Studies
In this section, to assess the accuracy in the decision boundaries from the proposed BAMBOO framework as illustrated in Section 3, we conduct simulation studies to obtain the operating characteristics under different prior choices and various scenarios.

Simulation Design
We consider dual success S = {Y 31 < log 0.75 orY 32 < log 0.824} as in the preceding example. Additional scenarios such as OS success S = {Y 32 < log 0.824} as considered in the literature (e.g., Hong and Shi 2012) can be found in supplementary materials. For each defined study success, we consider five scenarios for the true mean HR (i.e., exp{μ}): 1. Dual efficacy: true mean 0.67 for PFS HR and 0.75 for OS HR.
Both meet the success criterion; 2. OS efficacy: only OS meets the success criterion, with two sub-scenarios: • PFS/OS consistency: true mean 0.8 for PFS HR and 0.75 for OS HR. The mean is close to each other; • PFS/OS inconsistency: true mean 1.25 for PFS HR and 0.75 for OS HR. Two means are distant in contrast to the historical means (both around 0.724); 3. Futility: None of the PFS/OS HR meet the success criterion, with two sub-scenarios: • Low futility: true mean 1 for PFS HR and 0.85 for OS HR; • High futility: true mean 1.25 for both PFS HR and OS HR.
The true PrSS is calculated as P(Y 3 ∈ S) under Y 3 ∼ N (μ, 3 ) using true mean μ and phase 3 covariance matrix based on 470 OS events and 321 PFS events. For each scenario, the phase 3 PrSS evaluation is based on phase 2 observation sampled from Y 2 ∼ N (μ, 2 ) where 2 is constructed based on three different sample sizes as follows: I: Interim analysis in phase 2 with 23 OS events and 47 PFS events, F: Final analysis in phase 2 with 68 OS events and 87 PFS events, and E: Equal-phase-3 sample size for phase 2 with 470 OS events and 321 PFS events as a pseudo-example for large-sample scenario.
In each case, we perform the proposed BAMBOO assessment of phase 3 PrSS using (a) Diffuse prior: μ ∼ N (0, 10 2 I 2×2 ) and; (b) Informative prior μ ∼ N (m 0 , V 0 ) from the IO literature data in Section 3.2 for the mean parameter μ. We also implement the 3 methods in Table 1 that use phase 2 PFS only, using the same setting based on the IO literature data as described in Section 3.3. For Informative and Singular prior with absolute location information borrowed from historical trials, a discount factor of d = 0.2 as in Cho et al. (2018) for the power prior is used to attenuate the prior impact, since the prior variance is quite small comparing to phase 2 data variance due to limited sample size. Additional choices for d including the adaptive discount using the two empirical methods described in Section 2.1 will be investigated with additional simulations in supplementary materials. For each scenario, the five methods provide PrSS estimates based on the observed phase 2 PFS/OS log HR values sampled from N (μ, 2 ). This is repeated 5000 times to obtain the bias and root-mean-square error (RMSE). The simulation results are shown in Table 3. A bias/RMSE value closer to zero generally indicates better performance. In addition to the results for the 5 reported prior choices, Table 3 also indirectly gives the operating characteristics for the mixture prior with prescribed weights, as the corresponding PrSS values, hence, the bias and the mean squared error are the weighted average of their values under individual prior component, provided the number of simulations is sufficiently large. For all matrices involved in both data generation and PrSS assessment, ρ = 0.643 from the historical data is used. The graphical output using predicted mean values for phase 3 log HR under each method with additional sensitivity analyses is also given in supplementary materials.
In addition, to assess the robustness of go/no-go decision making, the error rates for different target PrSS values (β = 65% as in the illustrative example, and 50% as minimum) are empirically calculated based on the 5000 replicates and summarized in Table 4. The error rates are defined as the probability of making "no-go" decisions under the true efficacy, or chance of making "go" decisions under the true futility. Hence, a smaller value indicates better performance in decision making. Since the error rates can depend on the choice of the target PrSS level for go/no-go decision, we explore more than one choices in Table 4. Table 3, when both phase 2 PFS and OS data are used, the estimated PrSS under both Diffuse and Informative prior can converge to the true value as the sample size in phase 2 increases. This is due to that the phase 2 covariance matrix 2 and V 2 = ( −1 2 + V −1 0 ) −1 ≈ 2 approach zero as the number of phase 2 events rapidly grows, regardless of the choice for V 0 . Note the weight on phase 2 data w 2 = V 2 −1 2 approaches I J×J . As a result, the sampled Y 2 ∼ N (μ, 2 ) and hence m 2 = w 2 Y 2 + (I J×J − w 2 )μ 0 highly center around μ. Therefore, the PPD Y 3 ∼ N (m 2 , V 2 + 3 ) converges to the true phase 3 distribution Y 3 ∼ N (μ, 3 ). When the number of events is small (e.g., at the interim analysis) and the prior information from historical data with mean centering around 0.724 for both PFS/OS HR is in concordance with the true PrSS values (e.g., dual efficacy with small PFS/OS HR values), Informative prior produces lower bias/RMSE by incorporating historical information. Such Bayesian borrowing is nevertheless   Table 3, the probability of falsely forwarding the drug when the calculated PrSS exceeds the target 65% is 0.367, and such error rate rapidly decreases as the sample size increases, for example, it becomes 0 under equal-phase-3 sample size (E). When only the phase 2 PFS data are used, the conclusion based on the three marginalization methods is similar in that the more information the method (e.g., Singular prior) borrows from the historical data, the higher power gains under priordata consistency (e.g., dual or OS efficacy scenario), meanwhile the higher chance of Type I error in falsely concluding the drug efficacy (e.g., futility scenario), even with a substantial discount. On the other hand, Surrogate and Imputation prior give robust results against prior-data conflict with smaller bias/RMSE under futility case in Table 3, due to that they borrow relative location information on PFS/OS log HR from the historical data, which can be potentially extrapolated to faraway points from the historical mean. Table 3 also manifests that the Imputation prior produces smaller RMSE values and hence better overall PrSS estimation comparing to Surrogate prior under low futility with prior-data conflict. Surrogate prior however gives lower bias. Under high futility, the results for both approaches are comparable and Imputation prior is slightly better. However, both approaches result in large bias under OS efficacy when the inconsistent PFS/OS HR (1.25 vs. 0.75) is strongly against the historical relationship (both mean around 0.724). In the case, it is challenging for both approaches to detect the strong OS efficacy using an observed low efficacy in PFS. Such potential deviation in PFS/OS relationship should be cautiously assessed when applying the Imputation/Surrogate prior approaches.

Based on
Similarly, the results in Table 4 indicate the potential risk of higher error rates when the ground truth largely deviates from the assumptions when using historical data (e.g., priordata consistency under Informative and Singular prior, or robust historical association under Imputation or Singular prior). The improved accuracy in making correct decisions with limited data information can be also high when the assumptions hold. Therefore, some empirical check of prior-data conflict and mixture priors can be considered. The use of both OS/PFS data also is recommended when the extreme scenario is suggested by data which causes inflated error rates for methods that use PFS data information only.
Overall, the simulations suggest that more samples or data information in phase 2 can yield a higher precision in estimating PrSS. This also generally supports the use of OS data even when they become less available than PFS data in phase 2. The datadriven decision based on Diffuse prior can require considerable more samples to achieve high accuracy. However, it does not have much variability in bias (always below 30% for I and below 15% for E in absolute terms) while the other options show unacceptable bias (>40%) for at least one scenario. The diffuse prior also has consistently lower bias with increasing sample size while the last three options show even increasing bias for PFS HR=1.25 and HR OS=0.75 with increasing sample size. On the other hand, when an assumption of prior-data consistency is reasonable (e.g., using χ 2 1−α * ,J in supplementary materials), one can choose Informative prior based on the historical data to improve the PrSS estimation and hence the accuracy in decision-making, while both the power prior with discount factors and the mixture prior with empirical weights can aid in attenuating the prior impact if there is lack of evidence for prior-data consistency. When the phase 2 OS information is immature at early decision points, the Singular prior can be adopted to make decisions based on the PFS at the presence of prior-data consistency. In general, when the prior-data consistency/conflict is unknown, the Imputation or Surrogate prior can be considered if the historical relationship is not severely contradicted and this assumption is more general than priordata consistency in means under Singular prior. The imputation prior requires a linear relationship between log HR from PFS and OS. The data to define this relationship can be externally acquired using historical records. However, the prior weights are typically lower than that under Surrogate prior. If the historical relationship is strongly believed or practically relevant, Surrogate prior can be preferred. The Imputation prior also allows extra flexibility by specifying a general predictive relationship between the surrogate and final endpoints which may not be restricted to linear association. Such flexibility will be explored in other case studies when the nonlinear association is presented by the historical data.

Summary and Discussion
Late stage cancer studies frequently involve multiple endpoints. While it is critical to make the go/no-go decision based on phase 2 early readout, information about the endpoints and their relationship is limited. Recently, immuno-oncology has brought new hope for cancer patients for its superior efficacy and tolerable safety profile comparing to the traditional chemotherapy. As of September 2019, from cancerresearch.org there are a total of 2975 active interventional trials for PD-1/L1 monoclonal antibodies clinical trials . The rapid development of IO therapies has transformed the cancer treatment landscape. It has also brought statisticians more opportunities to offer guidance in making commercial decisions in the midst of data and uncertainties. In addition, the novel mechanism of action of IO agents can lead to transient pseudo-progression followed by long-lasting disease control. This is further translated into potentially a large OS benefit which is discordant with a limited PFS benefit. Therefore, some immunotherapy trials adopt both OS and PFS as dual study endpoints, in which meeting either endpoint would indicate the trial success.
In this article, we considered a practical scenario of a randomized phase 2 trial followed by a confirmatory phase 3 trial, where dual endpoints (PFS and OS) were deemed as necessary. The objective is to evaluate the CSF after the interim and final analysis of phase 2. To achieve that, we developed the BAMBOO method to calculate the joint predictive distribution of PFS and OS HRs based on what we potentially observe in phase 2. The necessary input parameters for BAMBOO are Type I error (α), the target statistical power and clinical benefit (e.g., hazard ratio, median survival time) for each primary endpoint, target PrSS levels in phase 3, and a trial design for phase 2 (e.g., accrual rate, interim/final analysis plan). Optional input parameters are the historical data for informative prior and the correlation between endpoints (log HR). The latter can be calibrated via the described simulation-based procedure for PFS/OS if not supplied. The proposed framework then calculates the efficacy boundaries and sample sizes required to define study success and construct the phase 3 covariance matrix. Next, trial simulations based on the phase 2 design are conducted to estimate the number of events at the specified interim and final looks to construct covariance matrices in phase 2. As a result, the posterior predictive distributions in phase 3 can be obtained with the corresponding CSF or marginal CSF cutoffs determined based on the target PrSS levels. As demonstrated in both the simulation and exemplary results, the use of OS data provides valuable information in decision boundaries for phase 3 success that involves OS, even under high variability due to the potential low number of events in phase 2. This is particularly true at the presence of discordance between the PFS/OS benefits. When the decision-making on the PFS data adopted as primary efficacy endpoints in phase 2 due to early data looks to speed up development, the Imputation or Surrogate prior method can potentially alleviate the prior-data conflict issue by using the relative location information that can generally hold. However, the sponsor should carefully check such extrapolation and use the most appropriate information for decision making.
In our illustration, the correlation between PFS and OS log HR was calculated based on a meta-analysis of pre-selected historical randomized trials in the targeted disease setting. The example showed that for a set of paired phase 2 PFS/OS HRs and events, one can estimate the PrSS in phase 3 using the contour plot. Intuitively, the accuracy of the estimation is associated with the number of events observed in the phase 2 studies. When the number of OS events is significantly less than the number of PFS events, the PFS-OS joint model was reduced to one dimension by marginalization. Our approach extends Hong and Shi's (2012) methods for generally defined study success. The singular prior approach is also consistent with Hong and Shi's (2012) method for OS success. Occasionally, when both PFS and OS take much longer time to mature, we discussed an alternative approach to bridge the decision rule of PFS/OS HR in phase 2 to ORR by establishing the association between ORR and PFS HR using Bayesian meta-analysis on historical data. Arguably, people can also marginalize the CSF to OS and then link OS to ORR if more ORR and OS data are available in the literature. In general, the BAMBOO method can jointly model the ORR, PFS and OS when the normality is assumed for both HRs and odds ratios.
One critical assumption for the BAMBOO method is that the correlation between appropriate scales of the endpoints (PFS, OS, ORR) is correctly specified. However, for IO agents, such correlation can vary by different tumor types. For example, in lung cancer studies, the PFS HR and OS HR, or PFS HR and ORR, have demonstrated a reasonable correlation. On the other hand, when pooling a variety of studies from different tumor types to evaluate the association among median PFS and median OS or ORR rate directly, it demonstrated a very weak correlation historically. Therefore, when applying BAMBOO approach to determine the phase 2 CSF and calculate the phase 3 PrSS, it is critical to get aligned on correlation between endpoints (log HR) via meta-analysis or simulation approach.
In addition, the BAMBOO method can be extended to scenarios where phase 1b study is followed by a seamless phase 2/3 design. The BAMBOO approach can serve as the decision rule at the interim of seamless phase 2/3 design, where the phase 1b data can be incorporated into the prior of the joint Bayesian model. However, further adjustment needs to be considered to incorporate the chance of success and control the overall study Type I error rate (Stallard 2010).
Finally, in this work, we have primarily focused on PrSS assessment based on reliably estimated trial-level statistics (e.g., hazard ratio of survival endpoints PFS/OS) for drug efficacy. In actual oncology trials, a number of practical issues arise from teasing out such efficacy based on the patient-level survival endpoints, including administrative censoring (e.g., withdrawal due to adverse events), interval censoring for PFS, etc. In the example of IO studies we discussed, known issues include nonproportional hazards due to delayed onset of treatment effect, survival benefit in subpopulation defined by biomarkers (e.g., PD-L1 tumor proportion score), etc. Due to its flexibility and Bayesian nature, the proposed framework can be potentially coupled with several existing approaches to tackle these practical issues. For example, for nonadministrative/informative censoring, the expanded product limit (PL) estimation method Kaciroti et al. (2012) using a pattern mixture model as a Bayesian nonparametric approach can be adopted; For interval censoring of PFS, the nonparametric estimation for censored data can be used, such as the Expectation-Maximization (EM)-Iterative Convex Minorant (ICM) algorithm (Wellner and Zhan 1997). For delayed onset efficacy, a likelihood weighted by each individual's follow-up time (Cheung and Chappell 2000) can be considered. Alternatively, common techniques such as multiple imputation, EM algorithm and Bayesian data augmentation can be adopted if this is considered a missing data problem. For subgroup analysis, one can potentially incorporate predictive biomarkers as additional endpoints in the proposed joint modeling framework. Harnessing the current Bayesian borrowing from historical data with patient-level information to tackle these practical issues will be discussed in our future work.

Supplementary Materials
The Supplementary Materials file contains the derivation of the priors, empirical approaches for checking prior-data conflict, design parameters of the proposed framework with sensitivity analysis, simulation results under additional scenarios, and a post-hoc Bayesian meta-analysis model to bridge PFS/ORR critical success factor cutoffs.