Enhancement of Basket Trial Designs with Incorporation of a Bayesian Three-Outcome Decision-Making Framework

Abstract With recent accelerated approvals of histology-agnostic novel agents, conducting basket trials that evaluate an investigational therapy on different histologies is gaining momentum, thanks to the underlying common biological anti-cancer mechanism of action. Statistical models are proposed to boost statistical efficiency by leveraging information across cohorts to harvest the shared response signal. However, limited research exists about establishing a quantitative decision-making framework for basket trials. Robustness of a dichotomized “Go/No-Go” decision may be suboptimal when an “inconclusive” decision is more appropriate. Accordingly, the three-outcome decision-making (3ODM) framework with an additional “Consider” zone has gained popularity, and we propose to incorporate 3ODM into basket trials for the benefit of robustness and flexibility, and for formal incorporation of between-cohort shared signal into 3ODM to improve its performance in basket trials. Our simulation study is the first to compare modeling of log odds ratio (on top of benchmark response rates, RRs) with modeling of RRs in most current basket trial designs, considering the potential drastic differences for reference and target RRs across cohorts. We used the exchangeability-nonexchangeability (EXNEX) model to evaluate operating characteristics of the EXNEX + 3ODM framework (with/without an interim analysis), although the proposed enhancement could be readily extended to different basket trial designs.


Introduction
With the accelerated approvals of four histology-agnostic novel agents: pembrolizumab (FDA 2017), larotrectinib (FDA 2018), entrectinib (FDA 2019), and dabrafenib plus trametinib (FDA 2022a), we have seen an increasing trend of conducting basket trials that evaluate a single investigational medicinal product (IMP) and/or combination on multiple types of cancer with a common genetic mutation (e.g., MSI-H or dMMR, NTRK gene fusions, RET fusion-positive, and BRAF V600E) or with action within a common biological pathway of interest, for example, MAPK pathway.Novel precision medicine trial designs not only include basket trials, but also umbrella trials and platform trials that could potentially expedite drug development by increasing statistical and/or operational efficiencies, as described in the recent FDA guidance for industry on master protocols (FDA 2022b).In this article, our focus is on basket trials where a "cohort" refers to a group of patients with the same cancer type.It is worth noting that the concept of "cohorts" is very broad, which may refer to populations of different disease stages, number of prior therapies, biomarkers or demographic characteristics (FDA 2022b).
For a basket trial that enrolls patients harboring mutations of the same biological pathway but with different histologies, the ultimate objective is to confirm if the IMP works in a CONTACT Yue Yang yue.yang@sanofi.comBiostatistics and Programming, Sanofi, Cambridge, MA 02141.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/SBR.
"tumor-agnostic" manner (the ideal scenario that encompasses broad patient populations across different tumor types), or effective only to some histologies (a scenario that facilitates a more "focused" plan for late-phase development), or only in a particular tumor type (the so-called "nugget effect" scenario).A promising statistical model for basket trials is expected to efficiently handle each of the scenarios in the sense that, relative to conventional independent analyses, effective indications are more likely to be advanced to late-stage development while ineffective indications can be terminated early based on accumulating evidence.This requires rigorous statistical inference and a robust decision-making framework established in the basket trial setting.Additional works are still needed as most existing designs only adopted a simplified binary "Go/No-Go" decision rule on individual tumor type which should be further refined to meet practical needs.
Heterogeneity of treatment benefit among potentially heterogenous patient subgroups is one, if not the most, important element to consider in any basket trial.Existing methods that model potential heterogeneity of patient subpopulations can be categorized into two classes: (a) cohort-wise analysis assuming independence among cohorts-essentially the conventional multiple-cohort analyses; and (b) "informationborrowing" strategies assuming certain cohorts to be homogenous/exchangeable.For (a), in the presence of imbalanced enrollment, cohort-wise analysis suffers from limited power, and without proper multiple comparison controls, inflated family-wise Type I error rate would be an issue when multiple tests are performed among the independent cohorts.While such cohort-wise analysis is robust to bias, it is inevitably suboptimal when the IMP's effectiveness is consistent across cohorts (which is highly expected a priori given the biological rationales to conduct a basket trial) so that more efficiency can be achieved by appropriate pooling.For (b), inflated false positive or false negative error rates will need special considerations when using Bayesian hierarchical models that borrow strength among cohorts to boost the effective sample size."Whether" and "how" to effectively leverage information from multiple cohorts to make informed decisions remain the crux of basket trial designs.
Recognizing the aforementioned issues remain to be addressed, in this article, we proposed an enhancement to existing basket trial designs with incorporation of a Bayesian three-outcome decision-making (3ODM) framework.The novelties of this research are briefly described below, and further details will be discussed in subsequent sections.
First, for interim and final decision-making in the basket trial context, most existing approaches selected somewhat arbitrary posterior probability thresholds to declare a cohort as ineffective (e.g., if posterior probability <20%) or promising (e.g., if posterior probability >90%).In practice, sponsors may find such binary Go/No-Go outcome less flexible as it precludes an "indeterminate/consider" zone when accumulating data may support evaluations of additional evidence (e.g., duration of response, progression-free survival, or even safety profiles) along with the primary efficacy endpoint.The inclusion of a "Consider" zone is especially meaningful when a posterior probability fails to meet the "Go" or "No-Go" thresholds by only a small margin, leading respectively to a potential termination of a promising indication or a potential continuation of an indication that an early futility call should have been made.Furthermore, it is rare for existing Bayesian approaches to formally incorporate both the reference and target response rates (RRs, denoted by p 0 and p 1 , respectively) in the declaration of treatment efficacy, except the use of a simple average of p 0 and p 1 in the posterior calculation for early stopping at the interim analysis (e.g., Berry et al. 2013;Jin et al. 2020) or a point estimate of p greater than p 1 at the final analysis as part of the efficacy declaration (Neuenschwander et al. 2016).Despite that p 1 is considered, the decision-making framework is still binary for these designs.This could lead to obviously flawed decisions as we will describe later using a case example.Our proposal enhances the decisionmaking component of current basket trial designs by utilizing "dual criteria" to account for both the reference RR p 0 and the target RR p 1 , which yield a "Consider" zone in addition to the conventional "Go/No-Go" zones.It is an elegant extension of the single-cohort Bayesian decision-making framework that involves three outcomes.
Second, when two or more cohorts are regarded as "exchangeable", it does not necessarily mean that the RRs achieved are comparable to each other (i.e., in an "absolute value" sense); instead, it should imply that the IMP has relatively similar "treatment effects" on top of each cohort's established standard of care (SoC) benchmark, as measured by, for instance, the log odds ratio (LOR) between the target and reference RRs for each cohort (i.e., log where the subscript j = 1, 2, . . ., J is the cohort index).This is fundamentally different from most existing basket trial methods which may have unintentionally ignored the fact that different cohorts may have drastically different efficacy benchmarks: a retrospective review by Chen, Raghunathan, and Prasad (2019) of FDA-approved oncology indications (2006-2018) based on RR initially showed that the median RR among 85 indications is 41% (interquartile range: 27%-58%), and for the same IMP (e.g., imatinib) approved on different indications, the actual RR can be as low as 29% for systemic mastocytosis and as high as 83% for dermatofibrosarcoma protuberans.Our proposal of using LOR directly addresses this matter.The RoBoT approach by Zhou and Ji (2021) bears some similarities with our proposal in this regard, acknowledging that the benchmark RRs may be quite different but two cohorts may be considered "similar" as long as they are both larger than their corresponding benchmarks.The "shotgun" approach by Jiang et al. (2021a) included log odds of the null RRs as an offset, and depending on the application, log odds of the target RRs could also be used.However, no literature has done simulations to compare the performance between modeling naïve RR and the effect size of RR (e.g., LOR), which leads to different measurements of the "level of similarity".This plays a key role in the control of the amount of information borrowed across cohorts.
This manuscript adopts the idea of introducing the "Consider" zone and formulate a three-outcome decision making framework in the basket trial setting.Therefore, the advantages of such 3ODM framework, in a single cohort setting, as discussed by Fisch et al. (2015), Frewer et al. (2016) and Quan et al. (2020), could be extended to basket trials designs, which is gaining more popularity in recent oncology development.Furthermore, it generates additional benefits when similar cohorts in a basket trial could improve the appropriateness of decision to each other.
The rest of this article is organized as follows: in Section 2, we will give a high-level overview of representative categories of basket trial designs and the 3ODM framework (Quan et al. 2020), and then incorporate 3ODM into basket trial designs.In Section 3, we perform comprehensive simulations under different realistic scenarios to evaluate the operating characteristics of our proposal and compare the results with the incorporation of 3ODM (modeling RR or LOR with the option of an interim analysis), as well as comparisons with independent cohort-wise analyses using 3ODM.A case example from the ROAR basket trial (Wen et al. 2022) is also included to compare between the binary and trinary decision-making.Lastly in Section 4, we summarize key findings, discuss practical considerations, and outline future works.

Overview of Basket Trials
Earlier works of Thall et al. (2003) and Berry et al. (2013) applied Bayesian hierarchical models (BHMs) which allowed for information-borrowing across cohorts so that basket • Assessment of basket heterogeneity at the end of the stage 1 (when modest number of patients are enrolled per cohort).• Cohort-wise analyses if the "heterogeneous path" is chosen.
• Further enrollment of stage 2 patients with a pooled analysis if the "homogeneous path" is chosen and a futility check is passed.
They differ fundamentally in the pooled analysis:

Three-Outcome Decision-Making (3ODM)
Due to limited sample size in early-phase proof-of-concept (PoC) studies which are purposed for preliminary "signalfinding" instead of "signal-confirming", statistical inference is focused on the estimation of treatment effect by a surrogate endpoint rather than formal hypothesis testing.Bayesian inference that allows for historical and/or concurrent informationborrowing and quantifies treatment effect with probabilistic statements has been widely adopted with various decisionmaking frameworks across the pharmaceutical industry, for example, critical success factor (CSF), predictive probability of success (PPOS), and 3ODM.Although these approaches differ slightly in some technical aspects and implementation details, what they have in common is the use of a "trinary" decision rule consisting of not only the binary "Go/No-Go" decisions, but also a "Consider" or "Grey" zone in which the early efficacy evidence based on a single endpoint is insufficient to draw a decisive conclusion so that additional data should be evaluated (e.g., pharmacokinetics, pharmacodynamics, overall benefitrisk assessment, commercial viability per current and forecast competitive landscape, etc.).
Earlier works of Lalonde et al. (2007), Frewer et al. (2016), and Pulkstenis, Patra, and Zhang (2017) described a decisionmaking framework with three possible outcomes ("Go", "Consider" and "No-Go") either based on a frequentist perspective or in the Bayesian framework.In this article, we follow Quan et al. (2020) who proposed a revision of the Lalonde and Frewer decision rules by moving some of the "No-Go" decision to "Consider" decision so that the chance of directly terminating the development of a valuable drug can be reduced.Denote θ as the parameter of the true treatment effect as measured by, for instance, a binary endpoint of RR.Two important metrics that quantify the effectiveness of a treatment are the Lower Reference Value (LRV) θ LRV (i.e., the reference RR p 0 in earlier text) and the Target Value (TV) θ TV (i.e., the target RR p 1 in earlier text), which describes the smallest clinically meaningful treatment effect (e.g., RR under the current SoC) and the desired treatment effect (e.g., RR that could be established as a future treatment of choice), respectively.The "strength of evidence" of true θ associated with LRV and TV is characterized by the dual criteria, namely the "significance" criterion determined by τ LRV (which is often set between 80% and 90%) and the "relevance" criterion determined by τ TV (which is often set between 10% and 20%).With these assumptions and observed data, the 3ODM will make a "Go" decision if P(θ > θ LRV |data) ≤ τ LRV (significance criterion met) AND P(θ > θ TV |data) ≤ τ TV (relevance criterion met), a "No-Go" decision if P(θ > θ LRV |data) ≤ τ LRV (significance criterion not met) AND P(θ > θ TV |data) ≤ τ TV (relevance criterion not met), and a "Consider" decision for all other cases.This 3ODM framework has been widely used in oncology early-phase expansion cohorts (with multiple arms conducted "in parallel") for the confirmation of preliminary PoC.
To the best of our knowledge, no literature has discussed such decision-making strategy that involves a "Consider" zone in the basket trial context, although we have seen sponsors in practice considered such scenarios when "moderate" responsiveness is assumed in-between historical control (i.e., θ LRV ) and clinically meaningful RR (i.e., θ TV ) in simulation studies (see protocol No. CDRB436X2201 in the Appendix of Wen et al. (2022) as an example, which we also included as a case example discussed in more details in Section 3.4).However, the power and Type I error evaluations are still based on LRV only, and in general the lack of correctly classifying such "moderate" cases into the "4Consider" zone, as in the 3ODM framework, could have unintentionally discouraged sponsors from evaluating additional evidence and totality of data for a more robust decision-making.

Incorporation of 3ODM in Basket Trials
It is easy and straightforward to extend the 3ODM framework from its original application on a single cohort to the basket trial setting.Regardless of which basket trial method is used, for each cohort j the posterior probability of θ j >θ LRV,j and the posterior probabilities of θ j >θ TV,j can be calculated and compared with the pre-specified significance and relevance criteria τ LRV and τ TV , respectively, with the three-outcome decision rules listed as below: • Go: P(θ j > θ LRV,j |data) > τ LRV AND P(θ j > θ TV,j |data) ≤ τ TV • No-Go: P(θ j > θ LRV,j |data) ≤ τ LRV AND P(θ j > θ TV,j |data) ≤ τ TV • Consider: All other cases.
Note that despite basket trial designs may differ in the specifics of their information-borrowing strategies, the posterior probabilities above can almost always be obtained and fit into the 3ODM framework (for MUCE and RoBoT where the posterior probabilities are interpretated as inference about the hypothesis rather than the parameter θ j , the posterior probability based on LRV or TV needs to be obtained separately before fitting into the 3ODM framework).Without loss of generality, we used the same τ LRV and τ TV for all cohorts, but such "strength of evidence" may be specified differently among cohorts (e.g., for an indication of highly unmet need or the sponsor's "risk tolerance" is relatively higher, τ LRV and τ TV may be set smaller for that indication).
It is typical to plan for one or multiple interim analyses with early futility and/or efficacy calls in basket trials.In case of only one interim analysis, it may occur when half of the planned sample size is reached (e.g., 15 patients if 30 patients are planned in a cohort).Given the limited sample size in early PoC studies and the substantial committed investment for initiating a pivotal trial, it is not uncommon that a sponsor may choose to continue enrollment to the full cohort even if an "early Go" threshold is met at the interim, which aims to confirm the promising efficacy trend will remain with accumulating data (in such case, preliminary late-phase preparation works could be initiated).A sponsor may also choose to pause or even terminate enrollment of a cohort when an "early No-Go" threshold is met, so that resources could be better allocated to more promising assets (i.e., portfolio reprioritization).Because of the inherent higher uncertainties associated with decision-making at the interim compared to the final analysis at full enrollment, at the interim decision-making the 3ODM significance and relevance criteria τ LRV and τ TV could be set at more extreme values from those used for the final analysis, reflecting the intent to make it harder to meet the No-Go/Go thresholds unless the limited data truly reveal a/an disastrous/extraordinary outcome at the interim analysis.See Section 3.1.3for some numerical examples.
We would like to emphasize that our proposal is extensible to almost any basket trial design (like those in Table 1) and quantitative decision criteria that include a "Consider" zone, with slightly modified algorithms for the integration.Comprehensive simulation studies to be discussed in detail below are required to calibrate tuning parameters and evaluate key operating characteristics among selected candidate proposals.The flowchart of "basket design plus 3ODM" is illustrated in Figure 1 (created by the "DiagrammeR" R package, Iannone 2022) with five cohorts, hypothetical 3ODM decisions at interim and final analyses, and EXNEX as an example of basket designs.
It should also be noted that the 3ODM framework outside of the basket trial setting can handle multiple types of endpoints, such as binary endpoint, normally distributed endpoint, timeto-event endpoint, and log-normal endpoint.Its extension to basket trial designs is straightforward.with an interim analysis.This example shows five cohorts (C1-C5) with an interim analysis decision (by color) of Go/Go/consider/No-Go/No-Go.Enrollment of C1-C3 ("Go" or "consider" cohorts) continues with additional N/2 patients (illustrated by thicker nodes) and the final analysis decision based on all cohorts is Go/consider/consider/No-Go/No-Go.N: total sample size planned for each cohort.

Results
Before diving into the simulation studies, we first use one case example to illustrate the limitation of only considering the reference RR in the declaration of treatment efficacy, as adopted in many current basket trial designs.Consider a basket trial with three cohorts of 30 patients each: the LRVs are 0.10, 0.20, and 0.40, and TVs are 0.25, 0.40, and 0.70, respectively.At the end of the study, 5, 10 and 17 responders are observed in each cohort, yielding an RR of 17%, 33%, and 57%, respectively.Using EXNEX with modeling of the LOR, the posterior probability P(θ j > θ LRV,j |data) are 0.94, 0.98, and 0.98, which result in all "Go" decisions specifying a "Go" threshold of 90%.However, the probabilities of reaching TVs (which is of genuinely interest to sponsors before committing investment into costly phase 3 studies), are as low as 0.08, 0.16, and 0.03 as revealed by the posterior probability P(θ j > θ TV,j |data).In such case, when the efficacy signal by RR alone is moderate, a "Consider" decision should be made using the "dual criteria" by 3ODM that encourages evaluation of additional evidence.
We perform comprehensive simulation studies following the "3 R" principle: the wide range of scenarios selected for evaluation are not only realistic and representative in real oncology trials, but also relevant in terms of evaluating the proposal.Specifically, as described in Section 3.1.1,for the true RR specifications across cohorts which are used to generate different scenarios, in addition to current literature's common practice of using simplified "permutation" of either the same null or alternative RRs (LRVs or TVs, respectively) across all cohorts, we are making more realistic assumptions that the LRVs for the cohorts are mostly different with a relatively wide range (so do the corresponding TVs), representing different efficacy benchmarks by current SoCs in reality.We also include cases when the true RRs are determined by a vector of odds ratios on top of the LRVs, which may be different in reality.For example, the IMP extends similar efficacy improvements in terms of log odds ratio on top of their corresponding benchmarks.However, it is more challenging for traditional basket trial designs to effectively borrow information and make correct decisions due to the heterogeneity among cohorts as reflected by different LRVs.
We choose EXNEX (Neuenschwander et al. 2016) as one basket trial design for illustration, but it is straightforward to extend to other design options (see Table 1).Based on the original EXNEX, we evaluate the following four models (A-D): 1. EXNEX_RR_2ODM: Original EXNEX setup as in Neuenschwander et al. (2016)  Comparisons between two of the models will answer different questions: A versus B compare between the binary two-outcome decision rules and the trinary 3ODM (both under the modeling of RR without an interim analysis); B versus C compare between the modeling of RR and modeling of LOR (both under the 3ODM framework without an interim analysis); C versus D compare between designs with and without an interim analysis (both under the modeling of LOR and the 3ODM framework).
We would like to add a note on the "EXNEX_RR_2ODM" model as the decision criterion we used differs from the criteria used in Application 1 of Neuenschwander et al. (2016).In the original EXNEX paper, two criteria were required for trial success (for each indication): (a) the estimated response rate (posterior mean) is at least 20%; and (b) the posterior probability that the rate exceeds 0.1 is at least 90% for indications 1 and 2, and at least 80% for indications 3 and 4. Criterion (a) resembles the "relevance" criterion in 3ODM, while criterion (b) resembles the "significance" criterion in 3ODM.For "2ODM" in this current manuscript, we only utilized a single criterion of posterior probability P(true response rate >LRV |data) for two reasons: first, except for EXNEX and designs that build upon a formal Bayesian hypothesis testing framework, many basket trial designs under the Bayesian hierarchical modeling framework make decisions purely based on this posterior probability, for example, Liu et al. (2017), Jin et al. (2020), and Chu and Yuan (2018a); it is also implemented in the ROAR study (protocol section 18.4.2.1); second, by this single ("significance") criterion, we are able to better understand the impact of adding the "relevance" criterion in 3ODM.

Data-Generating Mechanism for Basket Trials
We consider a total of five cohorts with a sample size of 30 patients in each cohort.One interim analysis, if pre-specified, is performed when every cohort has enrolled 15 patients (equal enrollment rate across cohorts is assumed for simplicity).As shown in Table 2, we consider two sets of simulations: Simulation 1 is consistent with simulation setups by most current literature assuming the same underlying LRV (e.g., 0.20) and TV (e.g., 0.40) across all cohorts (although the true simulation  scenarios varied by permutations of the assumed LRV and TV), while Simulation 2 is based on a more challenging but realistic assumption that most cohorts have different LRVs and TVs, with LRVs ranging from 5% to 35%, and TVs ranging from 15% to 60%.In both Simulations 1 and 2, one "global null" (indexed by "-GN"), a few "mixed" (indexed by "-Mix"), and one "global alternative" (indexed by "-GA") scenarios are considered for the true RRs.For the "mixed" scenarios: in Simulation 1 we considered one scenario of two LRVs and three TVs ("1-Mix"), and another scenario of two LRVs, one moderate RR (i.e., 30% that lies between LRV and TV), and two TVs ("1-Mix-mRR"); in Simulation 2, in addition to the scenario of 2 LRVs and 3 TVs ("2-Mix-1"), we also considered one scenario of alternating LRVs and TVs ("2-Mix-2"), one scenario ("2-Mix-LOR-1") with odds ratios at (1.5, 1.5, 1.5, 0.6, 0.6) reflecting superior efficacy for cohorts 1-3 and mediocre efficacy for cohorts 4 and 5, and one scenario ("2-Mix-LOR-2") with odds ratios at (0.8, 0.8, 1.2, 0.8, 1.2) reflecting mediocre efficacy for cohorts 1, 2, and 4, and superior efficacy for cohorts 3 and 5.

Model Specifications for EXNEX
We follow the model setup suggested in Neuenschwander et al. (2016): consider a binary endpoint (e.g., RR) and denote y j as the number of responders in cohort j, for j = 1, . . ., J where J is the total number of cohorts.EXNEX assumes y j follows a binominal distribution with parameters n j (sample size for cohort j) and p j (response rate for cohort j).That is, for j = 1, . . ., J, y j ∼ Binom n j , p j .Let θ j be the log-odds parameter for p j , that is, logit p j = θ j : BBHM assumes that all θ j share the same prior distribution with a common parameter, which could not handle the potential data heterogeneity, while EXNEX addresses this issue by introducing mixture priors for θ j .
While EXNEX allows for multiple exchangeable parts, we used EXNEX model with two exchangeable distributions in all our simulations.Specifically, for j = 1, . . ., J the model assumes and That is, with probability π j1 , θ j comes from one exchangeable distribution N 1 ; with probability π j2 , θ j comes from the other exchangeable distribution N 2 ; with probability π j3 , θ j comes from the nonexchangeable distribution T j that could be different for different cohorts, where π j1 + π j2 + π j3 = 1 for all j = 1, . . ., J.Both N 1 and N 2 allow information borrowing across cohorts.For hyperpriors, we assume μ 1 and μ 2 independently follow Gaussian distribution; τ 1 and τ 2 independently follow half-normal distribution with a scale parameter of 1.0; for indication j, π j = (π j1 π j2 π j3 ) follows a Dirichlet distribution.
For the modeling of LOR (see the "Introduction" Section for rationales), we propose to define θ j as log and θ j is modeled by the EXNEX model specified in (1) and (2).

Parameter Specifications for 3ODM
The same LRVs and TVs as shown in Table 2 are used for θ LRV,j and θ TV,j .The significance criterion τ LRV and the relevance criterion τ TV for the final analysis are set at 80% and 20%, respectively.As we discussed in Section 2.3, a different τ LRV and τ TV is used for the interim analysis, making it relatively harder to make either an "early Go" or "early No-Go" decision with limited data at the interim.Specifically, we set the interim 3ODM decision rules as follows: • Interim Go: P(θ j > θ LRV,j |data) > 0.95 AND P(θ j > θ TV,j |data) > 0.20 • Interim No-Go: P(θ j > θ LRV,j |data) ≤ 0.30 AND P(θ j > θ LRV,j |data) ≤ 0.05 • Interim Consider: All other cases.
Decision thresholds are chosen based on aggressiveness and prioritization of the project.For example, for agents that aim to address a highly unmet medical need or there is a lack of standard of care on the market, the sponsor may be willing to take a higher risk of false positive rather than false negative, which is reflected in the values of τ LRV and τ TV that jointly determined the decision.
In Section 3.3, we will further discuss the impact of specifying different criterion thresholds at the interim, which is essentially a tradeoff between the total sample size and aggressiveness of decision-making.For sample size considerations, although early-phase oncology studies are not typically intended to explicitly test a hypothesis (so the sample size is not determined by power and Type I error considerations as done in registrational trials), the cohort size in the basket trial setting for the purpose of "signal-finding" still needs careful evaluations.For example, FDA Guidance on Expansion Cohorts specifically suggested cohort size to be limited to 40 subjects for solid tumors or 20 subjects for hematological malignancies (FDA 2022c).Simulations to evaluate the operating characteristics by 3ODM for each cohort should also be done to ensure the selected sample size will yield satisfactory decision-making results in terms of "correct Go/No-Go" (e.g., at least 80%) and "incorrect Go/No-Go" (e.g., no more than 10%).

Metrics for Performance Evaluation
As typically required by regulatory agencies, trial operating characteristics from Bayesian designs are routinely assessed by its frequentist properties (i.e., the Bayesian-frequentist hybrid approach) via extensive simulations.Depending on which models are being compared, three performance metrics are used in our simulations: (a) for comparisons of models A versus B, B versus C, and C versus D, the percentages of making a correct No-Go/Consider/Go decision for each of the five cohorts are under different scenarios are reported; (b) for comparison of models B versus C, as they are both under the 3ODM framework without an interim analysis, the only difference is the modeling of RR versus LOR which may differ in the estimation accuracy (especially when the LRVs across cohorts are different).We will also report the mean squared error (MSE) for scenarios under Simulation 2 (with varying LRVs), as the results are highly similar in Simulation 1 with an identical LRV (the "order of efficacy" among cohorts remains unchanged after transforming RR into LOR, only by an offset); (c) for comparison of models C and D, as they differ only in the absence or presence of an interim analysis, we will also report the average sample size in each cohort for an evaluation of potential saving in the number of patients.

Simulation Results
Table 3 shows the percentages of making No-Go/Consider/Go decisions for each of the five cohorts among the four models A-D under different scenarios as indexed in Table 2.All the numerical results are also illustrated by stacked bar charts in Appendix Figure 1, supplementary materials.For clarity of presentation, we discuss below in three sub-sections model performances comparing 2ODM versus 3ODM (Section 3.3.1),modeling RR versus LOR (Section 3.3.2),and designs with or without an interim analysis (Section 3.3.3).In the "supplementary materials" file that accompanies this article, more simulation results with two additional models are included.

Results Comparing 2ODM versus 3ODM
Columns "Model A" (EXNEX_RR_2ODM) and "Model B" (EXNEX_RR_3ODM) in Table 3 are compared in this section.We have a few observations: • In the global null scenarios ("1-GN" and "2-GN"), the correct "No-Go" probabilities are high (85%-86% for Simulation 1, and 72%-90% for Simulation 2) and comparable (differ at most ~4%) between 2ODM and 3ODM.However, the range for the cohort-wise incorrect "Go" probabilities for 2ODM can be as high as 14%-15% in Simulation 1 and 10%-24% in Simulation 2. This is unacceptable even in the early PoC context: the probability of advancing one (ineffective) cohort into future development (i.e., family-wise error rate, FWER) is as high as 42% in Simulation 1 and 63% in Simulation 2. Under the 3ODM framework, the incorrect "Go" probabilities are consistently smaller and except for cohort 1 in the "2-GN" scenario (more on this case later in Section 3.3.2),all less than 2.2%.The FWER under 3ODM is as low as 6.9% for Simulation 1 and 26% for Simulation 2. Admittedly, the incorporation of 3ODM does not increase the correct "Go" probabilities (as they are comparable to those under 2ODM) as it just helps "re-classify" incorrect decisions into the "Consider" zone.However, this is practically meaningful as the "Consider" zone provides a "buffer" to avoid making false decisions and encourages additional data support before a final decision can be made.In this regard, the issue of inflated false positive or false negative rates incurred by informationborrowing (as we mentioned in the "Introduction" section) can be mitigated by switching from the "binary" decisionmaking rule to the "trinary" 3ODM framework.• In the global alternative scenarios ("1-GA" and "2-GA"), as expected, 3ODM consistently has lower correct "Go" probabilities due to the introduction of the "Consider" zone; however, the incorrect "No-Go" probabilities for either model are kept at or well below 5% in all scenarios except for cohort 1 in the "2-GA" scenario (again, more on this case later in Section 3.3.2). • In the mixed scenarios (indexed by "-Mix"): on one hand, for the cases when the desirable decision for a cohort does not include a "Consider" decision (i.e., a cohort's RR is either at or below the LRV, or at or above the TV), the same conclusions from the global null and alternative scenarios hold (i.e., 3ODM helps "re-classify" incorrect "Go" or "No-Go" decisions into the "Consider" zone, so that the operating characteristics under 3ODM will always be better than those under 2ODM).On the other hand, for cases when the true RR lies between LRV and TV (so the desirable decision is "Consider"), 3ODM is undoubtedly superior.Take cohort 3 in the "1-Mix-mRR" scenario for an example: while 3ODM has a 45.6% probability of making the correct "Consider" decision (true RR is 30% while LRV=20% and TV=40%), 2ODM has a staggering 74.6% probability of making a "Go" decision.

Results Comparing Modeling Response Rate vs. log Odds Ratio
Columns "Model B" (EXNEX_RR_3ODM) and "Model C" (EXNEX_LOR_3ODM) in Table 3 are first compared in this section.In most scenarios, the two models perform similarly, especially for scenarios in Simulation 1.This is because of the assumed common LRV (0.20) which means modeling the RR is equivalent to modeling the LOR (except for an offset).A few exceptions that demonstrate the difference observed in Simulation 2 are listed below: • Modeling LOR may provide better performance under the more realistic global null or global alternative scenarios (Simulation 2): for cohort 1, in the "2-GN" scenario the incorrect "Go" probabilities are 19.7% and 1.8% for modeling RR and LOR, respectively; also, for cohort 1, in the "2-GA" scenario the incorrect "No-Go" probabilities are 16.1% and 5.5%, respectively.The much-increased probability of making an incorrect decision using RR primarily arises from cohort 1's RR being the lowest compared to the other four cohorts: its TV of 15% is even lower than the next lowest LRV of 20%.As a result, the information-borrowing strategy based on the absolute value of RR tends to "excessively" borrow from the other cohorts which inflates the Type I rate in the "2-GN" scenario.Modeling LOR in such cases is expected to substantially alleviate the issue.The FWER for the "2-GN" scenario reduces from 26% (modeling RR) to 6.2% (modeling LOR).• A similar finding is observed for the mixed scenario: in the "2-Mix-2" scenario, the desirable decision for cohort 1 is "No-Go", and the probabilities for an incorrect "Go" decision are 18.7% and 7.2% for modeling RR and LOR, respectively.With the rest of the cohorts having much higher RRs (30%-60%) than cohort 1 (5%), modeling LOR appears to be more reliable by keeping the incorrect "Go" probability lower.
Table 4 includes comparisons of MSEs between the two models (RR vs. LOR) under all six scenarios in Simulation 2 (the rows of "CW" which stands for cohort-wise analyses will be discussed later in Section 3.3.4).The unit is in 1/1000.Except for the "2-Mix-LOR-1" scenario and one cohort in the "2-Mix-2" scenario, modeling LOR generally results in smaller MSE compared to modeling RR.It is worth pointing out that quantities that impact decision-making not only involve posterior mean estimation, but also posterior variance estimation: the former is depicted using MSEs here, but the latter, although can be calculated from the posterior samples, is impossible to evaluate its performance given the true variance is unknown.Although the estimation accuracy of the posterior mean is important, the posterior variance estimates tend to expose greater influence on the decision-making.

Results Comparing Designs with or without an Interim Analysis
Columns "Model C" (EXNEX_LOR_3ODM) and "Model D" (EXNEX_LOR_3ODM_IA) in Table 3 are compared in this section.It is consistent across all scenarios that performances in terms of making correct decisions are highly comparable regardless of whether an interim analysis is conducted: the largest difference is less than 5%, with the only exception of cohort 1 in the "2-Mix-LOR-2" scenario (less than 9%).The advantage of having an interim analysis with potential early termination of ineffective cohort(s) is the reduction of the total number of patients in the basket trial: even by our current strict simulation thresholds (i.e., relatively harder to terminate a cohort early, unless it really shows a disastrous outcome), on average a cohort with 30 planned patients ends up with five fewer patients if the cohort is ineffective (see Table 5).If the sponsor is willing to relax the thresholds by allowing more early terminations at the interim analysis, the expected saving in the sample size would be even greater.Of course, that change comes with a tradeoff, as a cohort may be incorrectly terminated early due to limited sample size and/or relatively unfavorable patient baseline characteristics that impact efficacy, but the desirable decision may be "Consider" (or even "Go") if more representative patients were enrolled later into the cohort.

Results Comparing with Independent Cohort-Wise
Analyses Despite that advanced basket trial designs offer potential advantages via information-borrowing, independent cohortwise analyses are still applicable in the basket trial setting.In addition, despite the benefit of information-borrowing is well understood under 2ODM, how it may improve the operating characteristics under 3ODM is yet to be evaluated.Therefore, we include some simulation results in this section.For the same scenarios in Table 2, we report in Appendix Figure 2, supplementary materials percentages of making No-Go/Consider/Go decisions with the modeling of LOR under the 3ODM framework, either analyzed independently (left panels labeled "LOR_cw" in each scenario) or allowing information-borrowing (right panels labeled "LOR" in each scenario).The cohort-wise analyses were done by specifying the number of exchangeable components to zero (i.e., no information-borrowing by setting (π j1 π j2 π j3 ) = (0, 0, 1) in EXNEX.The conclusion should be applicable to other basket designs when exchangeability is set up as minimum in prior).We briefly summarize three major observations: • When all cohorts are relatively homogeneous (i.e., under the global null or global alternative scenarios), informationborrowing offered by EXNEX (which we believe can be generalized to other basket trial designs) consistently outperform cohort-wise analyses by maintaining highly comparable true negative and true positive rates, while (on average) reducing false positive rates from 6.5% to 1.4% (Simulation 1) and from 4.7% to 1.3% (Simulation 2), and (on average) reducing false negative rates from 9.6% to 2.0% (Simulation 1) and from 8.3% to 2.6% (Simulation 2).The reduced error rates are achieved by a wider "Consider" zone.• In the presence of both effective and ineffective cohorts (i.e., mixed scenarios with truth either a "Go" or "No-Go"), information-borrowing generally offers smaller error rates thanks to the increased "Consider" probabilities, despite that the percentages of making correct decisions are slightly lower than those in the cohort-wise analyses.Regardless of the truth, a wider "Consider" zone is consistently observed for EXNEX, most likely due to the information-borrowing from heterogeneous cohort(s) when such borrowing should be discouraged.• When one or more cohorts have moderate efficacy: in scenario "1-Mix-mRR", EXNEX (basket design) yields a much higher percentage of making the correct "Consider" decision (46.1% vs. 18.1% for cohort 3) than the cohort-wise analysis; for scenarios in Simulation 2, the results in "2-Mix-LOR-1" are comparable (<6% difference), while in "2-Mix-LOR-2" EXNEX (basket design) generally yields better results than cohort-wise analyses.The information-borrowing strategy, in such more realistic scenarios, generally leads to higher or at least comparable "Consider" probabilities than cohortwise analyses.
As to the MSE comparison in Table 4, except for only one cohort in the "2-Mix-1" scenario, the MSEs in the cohort-wise analyses are larger than those by EXNEX (either by modeling RR or LOR), indicating that improved accuracy for parameter estimation may be achieved via information-borrowing.

A Case Example
In Appendix 8 of the ROAR study protocol (Wen et al. 2022), analyses results for a total of six hypothetical examples are provided.We pick Example 5, "Mixed Scenario where Roughly Half the Histologies are Declared Successful, " with the following data at the final analysis across a total of nine cohorts (for simplicity, in this case example we did not consider interim analysis).The final decisions concluded in the ROAR protocol (binary) and by our proposed EXNEX + 3ODM (trinary) with modeling of response rates (RRs) and log odds ratios (LORs) are provided in the last three columns of Table 6.The key decision thresholds are shown below, which are typical choices to strike a balance between not making too aggressive or too conservative decisions: • Final Go: P(θ j > θ LRV,j |data) > 0.80 AND P(θ j > θ TV,j |data) > 0.20 • Final No-Go: P(θ j > θ LRV,j |data) ≤ 0.80 AND P(θ j > θ LRV,j |data) ≤ 0.20 • Final Consider: All other cases.
Comparing the final decisions between the binary "success/futility" in the ROAR protocol and the trinary "Go/Consider/No-Go" in EXNEX + 3ODM (for modeling RR and LOR), we see that: • For modeling RR: seven out of the nine cohorts reached the same conclusions in ROAR, while for cohorts 3 and 7 the models yielded different decisions.EXNEX+3ODM concluded with a "Consider" decision for both cohorts, while the hierarchical analysis with a binary decision rule in the ROAR protocol recommended "Success" for both cohorts.• For modeling LOR: six out of the nine cohorts reached the same conclusions in ROAR, while for cohorts 1, 3, and 7 the models yielded different decisions.EXNEX+3ODM concluded with a "Consider" decision for all three cohorts, while the hierarchical analysis with a binary decision rule in the ROAR protocol recommended "Success" for all three cohorts.
Although the truth for this hypothetical example (a single-case scenario) is never known, the findings turn out to be quite interesting: given the high uncertainty by the relatively small sample size (n = 6 for cohorts 3 and 7) and the deviation of the observed ORRs versus the TVs (38% vs. 60% for cohort 3, and 33% vs. 50% for cohort 7), we think the proposed EXNEX + 3ODM approach yields more appropriate recommendations ("Consider").Furthermore, we believe that by modeling the LOR, the "Consider" decision made for cohort 1 seems to be more reasonable.This is because of cohort 9 having a higher LRV/TV (i.e., it is a relatively "heterogeneous" cohort than others) and a much higher observed ORR of 83%: if modeling by RR, the Bayesian hierarchical information-borrowing from cohort 9 may have "boosted" the efficacy estimate for cohort 1 too much (42%), while modeling LOR seems to have alleviated the "inflation" by concluding with a "Consider" decision for cohort 1.

Discussions and Future Works
In the past 20 years, there has never been a lack of novel proposals on basket trial designs, but the important decision-making framework in the basket trial setting had somehow received less attention than it deserves.To the best of our knowledge, our research is the first such effort to formally incorporate the 3ODM framework widely used in single-cohort PoC studies into a broad range of basket trial designs.While it is a convention to accept the sacrifices of slightly reduced correct Go/NoGo decision when borrowing information under mixed scenarios using traditional basket design models, our simulation studies demonstrated the utility of adding a "Consider" zone in addition to the conventional dichotomized "Go/No-Go" zones via pre-specified dual criteria involving both the reference and target values.Being able to reduce incorrect "Go" probabilities for ineffective cohorts and the ability to correctly classify an "average" cohort into the "Consider" zone should justify the incorporation of 3ODM in a practical basket trial setting, as it addresses the major limitation of the conventional 2ODM framework in such mixed (and probably more realistic) scenarios, despite the reduced correct "Go" probabilities (powers) in the presence of the "Consider" zone for effective cohorts.In addition, motivated by practical case examples where different prognoses could result in a wide range for the reference RRs, we further evaluated modeling the log odds ratio of each cohort's RR over the corresponding SoC benchmark so that stronger "information-borrowing" occurs among cohorts with similar efficacy improvement over their respective SoCs.Modeling LOR turns out to be more effective when the benchmark RRs are different.We also evaluated performance under 3ODM and modeling of LOR with an interim analysis, when an early "No-Go" decision would trigger a stop of enrollment but all data will be used at the final analysis.With comparable decision-making probabilities but potential saving of sample size, planning an interim analysis with appropriate dual criteria thresholds should be encouraged.Lastly, we found that when 3ODM is used, compared to cohort-wise analyses, the error rates are generally lowered by EXNEX that allows for information-borrowing (especially when cohorts are relatively homogeneous such as global null or global alternative scenarios), but a wider "Consider" zone is not uncommon which encourages sponsors to examine additional data evidence.
It should be noted that a "Go" decision out of the basket trial is a necessary but not sufficient criterion for the initiation of a pivotal trial.Numerous multifaceted considerations (e.g., from clinical and commercial perspectives) need to be taken into account by the sponsor as well as reaching an alignment with external partners and health authorities.The totality of clinical data (e.g., key secondary endpoints that provide indepth understanding of the efficacy/safety profile), along with current portfolio prioritizations and projected competitive landscapes, etc., will jointly determine if it is viable to launch a registrational trial or not.In case of a "Consider" outcome, we recommend pre-specifying the "triggered actions, " such as using secondary endpoint(s), or exploring subpopulations (e.g., defined by biomarkers), or prioritizing other assets with clear decisions, or putting more emphasis on potentially promising combination regimens than single-agent, or launching a phase 3 trial with a less optimistic effect size assumptions/envisaged sample size reassessment, etc.
Although the 3ODM framework is practically sensible, to the best of our knowledge, it does not possess a formal statistical optimality property like the binary decision Neyman-Pearson hypothesis testing framework (Lehmann and Romano 2022) unless we define a loss function combining the three outcomes and minimize the loss.A set of criteria which associate with a relatively larger "Go" probability under certain favorable treatment effect scenarios will also have a smaller "No-Go" probability under certain unfavorable treatment effect scenarios.Therefore, extensive simulations are always performed for a specific trial to select the appropriate criteria which has a good tradeoff.
There are a few near-term future works that could be explored.First, although in this article we only discussed four comparisons among models A-D to answer the three primary questions of interest (related to 3ODM, modeling LOR and interim analysis), depending on practical needs, it is straightforward to evaluate other "combinations" such as LOR_2ODM versus LOR_3ODM, RR_3ODM versus RR_3ODM_IA, etc., each keeping only one factor variable.Second, it is also possible to allow for an enrollment stop when an early "Go" decision is made at the interim (although practically speaking, sponsors would prefer to enroll more patients to confirm the efficacy trend).How this practice impacts the operating characteristics would require additional simulations.Third, for convenience we are using posterior probabilities for the interim decisionmaking; however, it may be more appropriate to use predictive probabilities for such early decision-making.Fourth, different enrollment rates are expected across cohorts in practice, leading to imbalanced cohort sizes at the time of interim analysis; we expect the general conclusions discussed above should remain valid, with a caveat that a minimum cohort size (e.g., at least 10 patients) should be pre-specified before making any concrete decision on that cohort.
Two long-term future works are briefly included here: first, the extension of 3ODM to basket trials with the three-tier decision rules may be incorporated to the platform trial setting as well.Meyer et al. (2022) is one recent research on Bayesian decision rules for platform trials, where a binary "GO/STOP" decision is made at the cohort level.Due to the "perpetual" manner of platform trials, the current information-borrowing strategy may need to be refined to accommodate emerging data from new entries and accumulated data from "graduating" sub-studies.Second, an often neglected yet important aspect of using Bayesian modeling for basket trial data analyses is the determination of convergence to stationarity when running Markov chain Monte Carlo (MCMC) algorithms: the statistical model being correctly specified does not necessarily mean that the posterior samples are representative, especially for early PoC basket trial studies with rather limited sample size.The issue can be mostly alleviated with increasing cohort size, but improvement is less likely by further increasing the number of iterations or chains or burn-in periods.Statisticians evaluating basket trial designs by simulations should always have this mind.
All programs written in the R programming language can be accessed from the corresponding author's GitHub page at https://github.com/Joyrsyy/Basket_Trial_3ODM.

Supplementary Materials
In the supplementary material, we first provide additional simulation results to complete the comparisons between modelling RR and modeling LOR (either under "2ODM" or under "3ODM and IA").Second, for better illustration, we include the stacked bar charts for all simulation results.

Figure 1 .
Figure1.Flowchart of incorporating the 3ODM framework into basket trial designs with an interim analysis.This example shows five cohorts (C1-C5) with an interim analysis decision (by color) of Go/Go/consider/No-Go/No-Go.Enrollment of C1-C3 ("Go" or "consider" cohorts) continues with additional N/2 patients (illustrated by thicker nodes) and the final analysis decision based on all cohorts is Go/consider/consider/No-Go/No-Go.N: total sample size planned for each cohort.

Table 1 .
Summary of selective categories of basket trial designs.

Table 2 .
Scenarios showing true response rates that generate simulated basket trials.

Table 3 .
Simulation results comparing four models A-D.
NOTE:The smallest MSE of each scenario/cohort is shown in bold.RR: response rate;LOR: log odds ratio; CW: cohort-wise.

Table 6 .
Hypothetical trial example #5 in ROAR protocol, with EXNEX + 3ODM results including both modeling of response rate (RR) and log odds ratio (LOR).