A more powerful test for three-arm non-inferiority via risk difference: Frequentist and Bayesian approaches

Necessity for finding improved intervention in many legacy therapeutic areas are of high priority. This has the potential to decrease the expense of medical care and poor outcomes for many patients. Typically, clinical efficacy is the primary evaluating criteria to measure any beneficial effect of a treatment. Albeit, there could be situations when several other factors (e.g. side-effects, cost-burden, less debilitating, less intensive, etc.) which can permit some slightly less efficacious treatment options favorable to a subgroup of patients. This often leads to non-inferiority (NI) testing. NI trials may or may not include a placebo arm due to ethical reasons. However, when included, the resulting three-arm trial is more prudent since it requires less stringent assumptions compared to a two-arm placebo-free trial. In this article, we consider both Frequentist and Bayesian procedures for testing NI in the three-arm trial with binary outcomes when the functional of interest is risk difference. An improved Frequentist approach is proposed first, which is then followed by a Bayesian counterpart. Bayesian methods have a natural advantage in many active-control trials, including NI trial, as it can seamlessly integrate substantial prior information. In addition, we discuss sample size calculation and draw an interesting connection between the two paradigms.


Introduction
Well-designed Randomized Control Trials (RCTs) are accepted standards for measuring an intervention's impact across many diverse disease areas and thus are considered goldstandard for establishing a new treatment regime. In the presence of clinically proven established treatments/therapies, it is not ethical justified to allocate patients in the placebo arm. Thus, this gives rise to trials that compare the experimental drug with one or more active comparators. These trials are considered as standard in Comparative Effectiveness Research (CER), which is designed for providing evidence on the effectiveness, benefits, and risks of a broad range of interventions to the patients and clinicians. Although multiple treatment comparisons are possible [27], for the sake of simplicity, here we consider one experimental arm compared with one active reference. However, in certain situations superiority of the new drug might be in question. It may be reasonable to test if the experimental treatment is not worse than the reference by more than a pre-specified margin. These type of active-controlled trials known as the non-inferiority (NI) trials are intended to show if the new drug retains a substantial portion of the active control effect, thus making it more preferable to some patients due to its other desirable properties [8]. The choice of a pre-specified margin, also termed as NI margin (δ), is a critical issue in these trials. Although regulatory agencies provided broad guidelines on the choice of δ [9,10,22,23], it must be examined based on the past performance of the active control and is usually desirable to choose a margin that reflects the clinically acceptable largest loss of effect. Hence, the NI trials need to be administered with extreme caution [10,21,32].
In the last few decades, two-arm NI trials (experimental vs. reference) have been developed predominantly under the Frequentist paradigm. However, two-arm trials suffer some major challenges from the design, analysis, and possible interpretations point of view. One of the major concerns is that the two-arm NI trial may not support assay sensitivity (AS) directly and requires external validation [8]. This is because without a placebo arm no direct proof can be established about the efficacy of the reference drug over placebo. To compensate this, [9] recommends the inclusion of a placebo arm when ethically possible, resulting in three-arm 'gold-standard' design, that has greater confidence concerning AS and lesser concern related to external validity. For three-arm trials, [25,31] proposed the choice of NI margin as the pre-specified fraction of the unknown effect size of reference drug, instead of directly specifying a fixed margin. Later, this approach was extended first by [34] and then by [24] for binary end-point by considering difference of proportions (or risk difference). Pigeot et al. [31] suggested that the superiority of the reference drug over placebo should be established first to satisfy the AS assumption before carrying out such NI testing. An alternative to this is the fixed margin approach of [18] which requires joint testing of NI and AS albeit resulting in a rather conservative intersection-union type test [17]. In this article, we proposed an improved Frequentist test based on conditional principle following Pigeot's fraction margin approach for a binary outcome. Note, for binary end-points, risk difference is not the only function of interest. However, the nature of NI hypothesis, margin construction and the resulting methodological formulation for other types of functional will diverge significantly as shown in our recently published paper (see [4,5]). For related developments on those directions, please see the discussion section.
Clinical trials, particularly NI trials, have used Bayesian approaches since long past which can be found in the references, for example, see [11,13,16,33] among others. Gamalo et al. [12] considered Bayesian approach for the analysis of two-arm NI trials for binary outcomes. Ghosh et al. [16] also put forward a novel Bayesian analysis for a three-arm NI trial following Pigeot's fraction margin approach. The existence of prior information in the current NI trial is advantageous. The Bayesian paradigm delivers a natural route to obtain the prior information and help to reduce the sample size as well as cost by combining that information with the current trial. In this paper, we also propose an exact Bayesian procedure that is based on the conditional Frequentist principle to test NI. We also propose an approximation-based Bayesian approach that gives a closed-form solution of the Bayesian posterior probability, thus avoiding computational complexity of an exact Bayesian approach with a slight loss of accuracy. All approaches are evaluated on simulated and one published dataset from mental health trial.
The rest of the article is organized as follows. In Section 2, we give the NI hypothesis and the existing and our proposed Frequentist method for testing it. In Section 3, we propose a novel Bayesian methodology to design and perform the analysis of a three-arm NI trial for binary outcomes. Along with conjugate priors, we consider two other prior scenarios incorporating the condition of AS. Section 4 presents an interesting connection between Frequentist and Bayesian posterior probabilities in the three-arm trial. In Section 5, we present the algorithm and results for simulation studies as well as sample size table. Finally, in Section 6, we apply our proposed methodology to a published clinical trial dataset. We conclude the article with discussions in Section 7.

Three-arm Frequentist NI testing
Following [4,5,24], we construct the three arm non-inferiority trial for the primary binary endpoints under the experimental (E), reference (R), and placebo (P) arms. Let X l , l ∈ {E, R, P} denote the number of successes with n l number of subjects in lth arm. The random variable X l ∼ Bin(n l , π l ) with the corresponding probability π l (∈ [0, 1]). Without loss of generality, we assume the greater treatment benefits for the higher response probabilities in this scenario. In general, for two arm NI trial, the risk difference problem with pre-specified NI margin, δ < 0, is stated as Pigeot et al. [31] and Kieser and Friede [24] proposed the mathematical expression for δ as f (π R − π P ), where f is a negative fraction assuming the AS condition, that is, π R > π P . As discussed in [4,5], we can build the three arm NI hypothesis using δ and hence f as H 0 : (π E − π P )/(π R − π P ) ≤ θ vs. H 1 : (π E − π P )/(π R − π P ) > θ, where θ = 1 + f is a pre-specified fraction of the effect of the reference drug relative to the placebo. In this three-arm NI trial, the efficacy of the test drug when compared to placebo attains more than θ × 100% of the efficacy of the reference drug as compared to placebo. Although the choice of θ (∈ [0, 1]) as shown in [31] depends on the clinical approval, in this case θ is limited in [0.5, 1) for the NI testing of the new drug to retain at least 50% effect of the active control. Hence, the NI hypothesis of the risk difference can be written as For this NI test, the rejection of null hypothesis satisfies that a pre-defined proportion of the unknown effect of the reference over placebo is maintained by the experimental treatment.

Existing marginal approach
Kieser and Friede [24] developed statistical test procedures under Frequentist paradigm for the NI testing under three-arm trial for binary outcomes. They constructed the test statistic for testing the NI hypothesis in (1) by considering the maximum likelihood estimate (MLE) of the linear contrast π E − θπ R − (1 − θ)π P , given by T =π E − θπ R − (1 − θ)π P , whereπ l = X l /n l is the MLE of π l , l ∈ {E, R, P}. Different tests can be obtained by considering the maximum likelihood (ML) or restricted ML (RML) estimate of the variance of T, given by Var(T) = π E (1 − π E )/n E + θ 2 π R (1 − π R )/n R + (1 − θ) 2 π P (1 − π P )/n P . RML estimates can be obtained subject to the constraint π E − θπ R − (1 − θ)π P = 0. Under asymptotic normality, the standardized statistic T/ √ Var(T) An asymptotic level α Wald-type test is obtained by rejecting the null hypothesis if T exceeds 100(1 − α)%. In this case, the power of T can be written as denotes the standard deviation under H 0 and μ alt T , σ alt T represent the mean and standard deviation, respectively, under H 1 .

A novel Frequentist proposal
It is important to note that a pretest for the superiority of the active control over the placebo should be performed before the NI is investigated (see [31]). NI testing thus then only carried out as a second step provided the AS condition (π R > π P ) holds. However, it is often agreed [16,24,25,31] that if active control retains majority of the effect over placebo then in practice the statistical power to perform joint testing (NI and AS) will be very similar to that of testing NI only [35]. However, this may not always be true and traditionally the pre-tested AS condition has not been used further in NI testing in the marginal Frequentist effect-retention approach, except for margin construction. We introduce here a more powerful conditional approach for risk difference. Since NI and AS hypothesis are related, this leads to significant power gain in certain situations. Notably, [4,5] proposed a similar approach for risk ratio and odds ration albeit without any theoretical guarantee for power gain. A major point of this paper is to show that both theoretically as well as via simulation. For finding the MLE, we truncate the parameter space of (π E , π R , π P ) such that it belongs to {π E , π R , π P : π E ∈ [0, 1], π R ∈ [0, 1], π P ∈ [0, 1], π R > π P }. One may develop a likelihood ratio test based on the statistic the AS conditionπ R >π P under null hypothesis via Wald-type test. Following [29], one can improve the convergence via the RML which requires solving under H 0 (π E,RML ,π R,RML ,π P,RML ) = arg max where log l(π E , π R , π P ) is the log-likelihood of (π E , π R , π P ). For the odds and risk ratios, [4,5] discussed a strategy using unrestricted MLE to reduce the computational difficulty. This strategy is well established in many practical applications as mentioned in [20,26]. Using similar concept, that is, using T ML =π E,ML − θπ R,ML − (1 − θ)π P,ML , we can solve the optimization problem numerically for the risk difference. However, for our case, we consider the part restricted by the AS condition,π R,ML >π P,ML and hence where ' ' represents the approximation.
The exact small sample distribution of W is non-normal under the current-setup, however, [1] proved that W has Normal distribution under continuous setting. Hence, for the binary case under asymptotic normality of W, we can similarly prove that (W − μ w )/σ w ∼ AN(0, 1), where μ w and σ 2 w are the mean and variance of W, respectively. Chowdhury et al. [4,5] proved a similar lemma for calculating mean and variance under the conditional approach for risk and odds ratios. In this paper, we use the same approach to prove the lemma for risk difference. Lemma 2.1: Under conditional normal approach, the mean μ w and variance σ 2 w are the variances under null and alternative, respectively. Hence, the critical region of the test under the Frequentist approach is given by W > k * , where k * is obtained by assuming a test of size α: P H 0 (W > k * ) = α ⇒ k * = μ null w + z 1−α σ null w , where z 1−α is the 100(1 − α)% percentile point of the N(0, 1) distribution. In general, the value of α is set to be 0.025. Based on the Lemma 2.1, it can be noted that μ null w , μ alt w , σ null w , and σ alt w depends on π E , π R , and π P with π null E satisfies π null , where null and alt in the exponent represent the proportions under null and alternative, respectively. In simulation study, we followed the approach of [4,5,17] to generate π R for a pre-defined θ , π E , and π P such that it satisfies null hypothesis of equality as mentioned in Equation (1).

Proof: See Supplementary Material 2.
This lemma shows that there is effective power gain in the conditional test or conversely speaking, to attain a fixed power, the conditional test requires smaller sample size. Though for simplicity, the proof is given for equal allocation case, it can be easily extended for more general unequal allocation case. As observed in the simulation study (Section 5), this power gain is substantial when the gap between π R and π P is small and negligible when π R >> π P . This is parallel to what was noted at the beginning of Section 2.2. Note that the above theoretical claim for power gain via conditional approach for a binary outcome is restricted to risk difference case only.

Sample size
Using our proposed approach, we can calculate sample size for the assessment of NI to attain a desired power for a point alternative π E = π alt The power function of the test is derived by specifying the fixed values of π R , π P , and θ and consider different values of π E such that the ratio (π E − π P )/(π R − π P ) ∈ [0.5, 1.4]. As described in [4,5], let r 1 and r 2 be the allocation ratio of the sample sizes corresponding to the reference and placebo arms, respectively, relative to the experimental arm with sample of size n E = n. Hence, the total sample size can be expressed as N = n(1 + r 1 + r 2 ) for the allocation ratio n E : n R : n P = 1 : r 1 : r 2 . To attain at least 100(1 − β)% power, the sample size 'n' (of the arm E) is computed via the equation In this paper, we set β as 20% and vary π alt E to get the minimum sample size needed for 80% power.

Three-arm Bayesian NI testing
Any NI trial by design is an active control trial, where the availability of historical data on one or more arm/s is more or less guaranteed. Bayesian design [30] offers an interesting pathway to bring this additional information into play which can lead to substantial savings. Gamalo et al. [12] developed Bayesian procedures for NI testing in two-arm trial with binary end-point that allows the incorporation of the historical data on the active control via the use of informative priors. In this section, we propose an exact Bayesian and an approximate Bayesian test procedure under fraction margin approach for three-arm NI trial via risk difference.

Exact Bayesian approach
We consider three different prior choices, such as Conjugate Beta Prior (CBP), Proper Uniform Prior (PUP), and Dirichlet Prior (DP), where the AS condition is incorporated explicitly parallel to the proposed Frequentist approach described earlier. Among these three priors, the sampling procedure is easy to implement for CBP, whereas for DP it is computationally more intensive than other two procedures. For the PUP, the sampling has to be done from the restricted domain. Also, the posterior is not in the closed-form for DP. So in this section, for the illustration purpose, we provide the formal test procedure for NI testing and address the sample size calculation based on these three different prior settings.

Conjugate Beta prior (CBP)
Under the Binomial setting, the usual conjugate prior is the Beta distribution. In this threearm NI trial, we assume the Beta prior with hyper-parameters α l ∈ R + and β l ∈ R + with l ∈ {E, R, P}, for the proportion of successes (π l ) as proposed in [4]. For the three-arm NI trial with AS condition, the joint prior distribution of the proportions of successes can be defined as is the density of the standard Beta distribution. The joint posterior distribution of proportions given the number of successes can be written as Under the conditional approach, the posterior samples can be generated from the joint posterior distribution satisfying the AS condition, that is, π R > π P . Now, based on the prior information of the placebocontrolled trial, we can choose the value of the hyper-parameters. For the informative prior, the hyper-parameters can be computed by equating the mean or mode (with smaller variance) with the success probabilities. If we do not have substantial ideas about the parameters, non-informative prior is a common choice in this situation and in case of Beta non-informative prior, the choice of hyper-parameters is α l = β l = 1 with l ∈ {E, R, P}.

Proper uniform prior (PUP)
In this case, the prior distributions are assigned to the parameters π E , π R , and π P so that the restriction 0 < π P < π R < 1 is automatically satisfied. We give joint prior on (π R , π P ) by putting Beta distribution on π P and conditional on π P , a truncated Uniform distribution on π R , with the support on (π P , 1), so that π R > π P . We also put unre- The joint posterior distribution, obtained by multiplying the joint likelihood with the joint prior, is proportional to the product of two full Beta and one truncated Beta dis- is a truncated Beta distribution with support on 0 < π P < π R < 1 and X denotes the relevant data. The MCMC samples from the posterior for π E and π P can be generated from the updated Beta distributions. Given a draw for π P , the MCMC samples for π R can be generated from the truncated Beta distribution with the support (π P , 1).

Dirichlet prior (DP)
In this setup, we put a Dirichlet prior on (π R , π P ) with support on 0 < π P < π R < 1. We make the following transformation The joint prior of (π E , π R , π P ) can be obtained as before by multiplying f (π R , π P ) by f (π E ) which is Beta(α E , β E ) and then the joint posterior of (π E , π R , π P | X) can be obtained by multiplying the joint posterior of (π R , π P ) | X R , This joint posterior is not in any standard form and hence Metropolis-Hastings acceptance-rejection sampling is required with a proposal density to generate MCMC samples from the posterior [14]. A convenient proposal density could be the product of three Beta distributions with appropriately chosen parameters. Remark 3.1: Following [16,31], we continue to assume that AS condition, that is, π R > π P is true. As a result truncated priors are chosen. This assumption explicitly reflects the fact that active control still retains some of its effect over placebo. In a situation when this assumption is questionable, it is not advisable to carry out a three-arm NI trial, rather a superiority trial of new treatment over placebo is more realistic.

Remark 3.2:
Among the three proposed priors under Bayesian exact approach, the CBP gives equal support to the three parameters which are treated independently and is the simplest form the computational point of view. On the contrary, under PUP, the parameters π R and π P are made to depend on each other and in the absence of any prior information an uniform distribution is an obvious choice for π R with restricted support to incorporate AS condition. Under the DP, more flexibility can be achieved by considering the joint distribution of π R and π P . However, the choice of the Dirichlet parameters is an additional burden along with its computational complexity. While we have only considered proper priors, improper priors are also possible albeit when posterior propriety holds, however, not explored here for the brevity purpose.

Test procedure
We formulated the test procedure to determine the experimental drug compared to the active control for the risk difference similar to the [4] who proposed the same for risk and odds ratio type functional. Under the NI setup, the common acceptable range of the effect size (θ) is [0.5, 1) . Hence, we can claim the NI of the test drug relative to reference drug if the posterior probability under the alternative hypothesis as mentioned in (1) exceeds some pre-defined clinically meaningful threshold, say, R NI = p * . Borrowing the idea from [16] (Section 3.3), the Bayesian decision rule to claim NI in this setting is defined as The probability in (6) can be calculated empirically by generating M MCMC samples from the posterior distribution of (π l | X l ), l ∈ E, R, P. The estimated probability is given bŷ where π m E , π m R , and π m P denote the mth MCMC sample, m = 1, . . . , M, drawn from the posterior distribution, satisfying the AS condition (π m R > π m P ) with sufficiently large M. Note, the slight distinction with [16] previous approach (which is the direct Bayesian version of [31]) is that the usage of AS condition in the conditioning statement, which not only acting as a gate-keeper but also being used to calculate the posterior probability, yielding greater power (as proved in Lemma 2.2).

Sample size
The power under this NI setup can be calculated by estimating the probability of the test drug out of n * times. Let π alt E be the value of π E under H 1 . Hence, mathematically, the estimated power is formulated as Power = (#of times P(π alt E − θπ R − (1 − θ)π P > 0 | π R > π P , X) > p * )/n * , where the value of π E for known values of π R and π P such that (π E − π P )/(π R − π P ) ∈ [0.5, 1.4]. For NI testing, this ratio depends on the choice of θ which equals to [0.5, 1) under H 0 and exceeds under H 1 . As discussed in Section 2.3, the minimum sample size 'n' of the arm E and other two arms corresponding to reference and placebo arms by incorporating different allocation ratios can be obtained by setting the power to be at least 100(1 − β)%. Due to generating random samples from the posterior distribution, we can notice sampling fluctuation in the results.

Approximate Bayesian approach
We next propose an approximate Bayesian approach for NI testing that incorporates the AS condition and also explicitly derive the formula for sample size determination. Note, the approximation-based approach gives a closed form of the posterior probability and hence saves the computation time of the MCMC sample generation from the posterior distribution.

Test procedure
We consider the Beta prior for the proportions π l in each arm, that is, π l ∼ Beta(α l , β l ), and the responses are assumed to be Binomially distributed, that is, X l ∼ Bin(n l , π l ), l ∈ {E, R, P}. The Frequentist test statistic for testing the hypothesis in (1) is given by T = (X E /n E − θX R /n R − (1 − θ)X P /n P ). Under asymptotic normality assumption, we have Putting Normal prior on μ T , for large sample we can approximate where μ l and σ 2 l are the mean and variance of Beta(α l , β l ), l ∈ {E, R, P}. Keeping in mind the condition of AS, that is, π R > π P , we take prior on ν T ≡ (μ T | π R > π P ) . Assuming ν T ∼ AN(μ * ν , σ * 2 ν ), the posterior ν T | X ∼ AN(μ Tσ 2 T ,σ 2 T ), whereμ T andσ 2 T are given as We refer to [1] for the detailed derivation of μ * ν , σ * 2 ν .

Lemma 3.1:
Under conditional normal approximation, the mean μ * ν and variance σ * 2 ν of ν T = π E − θπ R − (1 − θ)π P | π R > π P are given by where The Bayesian decision rule for deciding that the experimental treatment is non-inferior to the active comparator is given by [11]: P(ν T ≥ 0 | X) ≥ p * , where p * is a pre-specified constant usually chosen to be 0.975 or 0.95.

Sample size
The sample size 'n' of the arm E under approximate Bayesian approach can be calculated by satisfying the two conditions: where the probability in (C1) is the estimated Bayesian version of average type-I error while that in (C2) is the estimated power of the test, β being the type-II error. The sample size 'n' is determined from (C2) by fixing β to have at least 100(1 − β)% power of the test and simultaneously satisfying (C1). As in the Frequentist approach, we choose α = 0.025. We note that where z 1−p * is the 100(1 − p * )% of the N(0, 1) distribution. Now the power function is obtained by varying π E such that 0.5 ≤ (π E − π P )/(π R − π P ) ≤ 1.4, keeping the other proportions π R , π P , and θ fixed. Let us denote μ T and σ 2 T by μ null T and σ 2null T , respectively, under H 0 , and similarly under H 1 denote the respective quantities by μ alt T and σ 2alt T . Thus condition (C1) can be rewritten in terms of T as Similarly, condition (C2) becomes Now 'n' can be solved from (9) by setting β = 20% and simultaneously satisfying (8). We vary π alt E (which is included in μ alt T ) to get minimum sample size satisfying at least 80% power for each π alt E . The sample size for the arms R and P can be obtained considering the allocation ratios r 1 and r 2 as discussed earlier.

Bayesian-Frequentist connection in three-arm trial
In this section, we connect the Bayesian and Frequentist approaches by transforming the Bayesian posterior probability of the tested hypothesis into the Frequentist probability of Bernoulli trial after adjusting the number of events and population sizes. This section is motivated from the work of [36] who showed similar connection for two-arm trial with integer-valued hyper-parameters, by linking Frequentist p-values and Bayesian conditional measure of evidence [2,7]. This work also offers additional insight about effective sample size gain in Bayesian set up under conjugate prior specification. We consider the CBP setting; that is, X l | π l ∼ Bin(n l , π l ), prior π l ∼ Beta(α l , β l ), and the posterior distribution π l | X l ∼ Beta(α l + X l , n l − X l + β l ), l ∈ {E, R, P}, with the restriction that the hyper-parameters are integers. The Bayesian decision rule to declare NI of the test drug over the reference given the AS condition (π R > π P ) holds, as given in Section 3.1.4, can be written as Define, η RP = π R − π P . Now, since the probability in (10) does not have a closed form, it is approximated by generating posterior samplers as in the following: where g(θ c i , X) = P(π E − π P > θc i | X), c i being the ith sampled value of π R − π P | (π R > π P ), and X denotes the relevant section of the data. To obtain P(π E − π P > θc | X), we refer to [36] and present the following two theorems that link the Frequentist and Bayesian approaches and can be used to estimate the probability in (11).
We give the following proposition using the identities in the above two theorems, which can be used to obtain the probability P B (π E − π P > θc | X).
Another way of linking the Frequentist and Bayesian approach can also be found from the following identities ( [36]) which can be used to approximate the incomplete Beta integral by sum: (13) x F = x + a and n F = n + a + b − 1. The identity in (13) can be used to approximate g(θ c | X) = P B (π E − π P > θc | X) in (11) as given in the following proposition.

Proposition 4.2:
Taking p = π E , a = α P , b = β P , x = x P , and n = n P , P B (π P < π E | x P , n P , x E , n E ) can be approximated by the sum of gamma functions using the identity in (13) as Thus, for a fixed c, P(π E − π P > θc | X) can be calculated using Theorem 4.1 and 4.2 as where the function h(·) is given in Theorem 4.2.

Proof: See Supplementary Material 4.
Repeating the calculation of P B (π E − π P > θc | X) for each c i , i = 1, . . . , M, one can obtain the posterior probability of NI hypothesis given AS condition from (11). Similar to [36], the Bayesian test of significance is equivalent to Fisher's exact test with adjusted value of the parameters. This is characterized as the effective sample size change in the literature [3,28].

Simulation and sample size calculation
In this section, we enumerate simulation studies to evaluate the performance of the Bayesian as well as Frequentist procedures presented above. The power curves are generated for the test considering three different priors under exact Bayesian, under Frequentist as well as Bayesian approximation procedures. For the exact Bayesian approach, power curves are compared under the informative and non-informative Beta priors. In the latter part of the section focuses on sample size calculation for the assessment of NI to attain the desired power under three approaches: (1) Frequentist normal approximation, (2) Bayesian normal approximation, and (3) Bayesian exact approach for the three-arm NI testing.

Steps for simulation
The following simulation steps are used to calculate the type-I error and power for the three different prior scenarios described earlier: (1) Conjugate Beta-Binomial, (2) PUP, and (3) DP. For the CBP setting, we assume a non-informative prior for the proportions in each of the three-arms; that is, π l ∼ Beta(1, 1), l ∈ {E, R, P}. We also consider an informative Beta prior so that the mode of the Beta distribution equals the parameter and compared the power between non-informative with the informative prior. For PUP, we consider the noninformative Beta priors for the experimental arm (π E ) and the placebo arm (π P ), while π R is generated from truncated Beta with the support on (π P , 1]. Finally, for the DP, we put non-informative Beta prior on π E and choose suitable values for the Dirichlet parameters. We consider a randomized trial with the sample size allocation ratio as n E : n R : n P = 1 : r 1 : r 2 . In following we give the steps for the simulation as discussed in [4]: S1: Specify n E , n R , n P (or, the allocation ratios), π l , l ∈ {E, R, P} with π R > π P , and θ and vary π E such that π P + 0.5(π R − π P ) ≤ π E ≤ π P + 1.4(π R − π P ) to generate X = {X E , X R , X P }. S2: Generate X l ∼ Binomial(n l , π l ), l ∈ {E, R, P} for given values of the ratio (π E − π P )/(π R − π P ) or π E . S3: For exact Bayesian approach, M many MCMC samples are generated from the posterior distribution based on the priors as mentioned in Section 3 satisfying the AS condition π R > π P . For Frequentist and approximate Bayesian cases, we disregard this step. For the PUP and the DP, the posterior sample values satisfy π R > π P automatically because of the in-built restriction. S4: For each posterior samples, compute the ratio (π E − π P )/(π R − π P ) and estimate the posterior probability is as follows: S5: Set Count = 0 and increase Count by 1 if the posterior probability > p , otherwise, Count = 0. S6: Repeat the steps S2 to S5 for a large number say n * times. Calculate type-I error and power by using COUNTS divided by n * . For type-I error calculation π E should satisfy (π E − π P )/(π R − π P ) = θ , and for power calculation, π E should be (π E − π P )/(π R − π P ) > θ. S7: Based on the estimated power from the step S6, the power curve can be plotted for sequence of π E satisfying the condition 0.5 ≤ (π E − π P )/(π R − π P ) ≤ 1.4.
Note that for Frequentist and Bayesian approximation approaches, S4 and S5 are replaced by the corresponding decision rule as mentioned in Sections 2.2 and 3.2.1, respectively.

Simulation result
For the CBP and PUP, since the posterior is available in closed form, we chose the number of posterior samplers M to be 1000. For the DP, we have determined the number of MCMC samples to be M = 1000 taking every 50th value of 50,000 MCMC samples. We assume non-informative Beta(1, 1) prior for the three-arms for the CBP setting. Throughout the simulation study, we consider the following specification of the parameters: π R = 0.7, π P = 0.1, and we set π E such that (π E − π P )/(π R − π P ) ∈ [0.5, 1.4]. We consider several values for n l , l ∈ {E, R, P} such that n E : n R : n P = n : nr 1 : nr 2 , 'n' being the common sample size. Unequal allocation is also possible as will be described in Section 5.3. Another important criterion is the choice of p * which we fixed at 0.975. However, as reported in [13] this choice could give too restrictive type-I error in the Bayesian context. One way to alleviate this problem is to perform Bayesian calibration, but is not pursued to reduce the computational burden.
In Figure 1(a), we present four power curves corresponding to different values of θ: 0.8, 0.7, 0.6, and 0.5 and n = 100 under Bayesian conjugate non-informative Beta(1, 1) prior for each arm. The three values of θ correspond to the three choices of NI margin which are the maximum clinically relevant difference: δ = −0.2(π R − π P ), δ = −0.3(π R − π P ), and δ = −0.4(π R − π P ), that is, f is chosen as −0.2, −0.3, and −0.4, respectively. This indicates that in order to be non-inferior, the experimental drug with respect to placebo, must attain more than 80%, 70%, and 60%, respectively, of the effect of the reference drug as compared to the placebo. Also, for the proposed test, we can infer that the smaller value of θ is more powerful than that of higher θ because of the easier declaration of NI of the experimental drug. In Figure 1(b), we plot the power curves for a balanced study design with a common sample size n = 100 for the Frequentist approach and approximate Bayesian approach under non-informative prior. From Figure 1(b), we observe that both the methods produce almost similar power curves. Although the Bayesian power curve is slightly above the Frequentist one, from our experience, this should be the case under a flat prior. In Figure 1(c), we do a comparison among the power curves obtained under the Frequentist approach, Bayesian exact approach under non-informative prior, and the same for informative prior with the common sample size n = 100 in each arm. For the informative prior, we put the prior in each arm as follows: E: Beta(40, 17.71), R: Beta(40, 17.71), and P: Beta(2, 10) and for the non-informative prior, we consider the same Beta(1, 1) prior as earlier. We note that these priors are chosen so that the mode of Beta distribution is equal to the value of the corresponding proportion parameter. We observe that Bayesian power curve under non-informative prior is almost similar to that of the Frequentist power curve, while the Bayesian power curve under informative prior is much higher. The gain in power by using informative priors is also depicted in Figure 2, where we give four plots for n = 20, 50, 100, and 200 to compare the informative with noninformative power curves under CBP setting. For the DP, with Dirichlet parameters (1, 1, 1) and Beta(1, 1) prior for each arm, the test is too conservative for (π E − π P )/(π R − π P ) ≤ θ and yields type-I error close to 0. However, in our experience it is possible to choose the Dirichlet parameters so that type-I error becomes close to 0.025, thus yielding better power as compared to CBP and PUP. This is depicted in Figure 1(d). We have chosen the Dirichlet parameters (α 1 , α 2 , α 3 ) as (2, 3, 6) which gives the marginal distribution for π R and π P as: π P ∼ Beta (2,9) and π R ∼ Beta(5, 6) . To make the power curves comparable we chose the same priors under CBP and PUP setup. In all three cases, the arm E is given Beta(1, 1) prior. From Figure 1(d), we see that the DP outperforms the CBP and PUP. The latter yields the least power among the three.

Sample size
We refer to the Sections 2.3, 3.1.5, and 3.2.2, respectively, for the sample size determination under Frequentist, exact Bayesian, and approximate Bayesian approaches. We determine the sample size n l , l ∈ {E, R, P} setting the power at (1 − β) with β as the pre-specified type-II error. Let us consider n P = n, n R = r 1 n, and n E = r 2 n with r 1 , r 2 > 0. To calculate the sample size for each arm, we explore three possible allocations for experimental, reference, and placebo arms, (i) (1:1:1) with r 1 = r 2 = 1; (ii) (2:2:1) with r 1 = 1, r 2 = 1/2; and (ii) (3:2:1) with r 1 = 2/3, r 2 = 1/3 of the total sample size N. The sample size is calculated as the smallest 'n which satisfies power ≥ 1 − β. To make a comparison of the existing Frequentist approach with the proposed conditional one, first we present the sample sizes under both the approaches in Table 2. For simplicity, we only consider equal allocation to the three treatment arms.

Existing
Conditional Existing Conditional (π R = 0.7, π P = 0.1) (π R = 0.6, π P = 0.55) θ π E n P N n P N n P N n P N We determine the sample size under the two approaches for θ = {0.8, 0.7} with (π R = 0.7, π P = 0.1) and (π R = 0.6, π P = 0.55). From Table 1, we observe that for π R = 0.7 and π P = 0.1 the sample size under the conditional approach is identical to that calculated under the marginal approach, while for π R = 0.6 and π P = 0.55, the sample size under the conditional approach is smaller than the existing one to achieve a power of 80%. This observation implies for smaller difference between π R and π P , the proposed conditional approach is more powerful, while for larger difference both the approaches behave similarly. This fact supports the claim proved in Lemma 2.2. In rest of the sample size calculation, only conditional Frequentist approach is considered as it is the more powerful than the marginal approach. In Table 2, we demonstrate the sample sizes under our proposed approaches with π R = 0.7 and π P = 0.1. Similar to [4,5], we assign α = 0.025 for Frequentist and the sample sizes satisfying power ≥ 1 − β also allow estimated type-I error of at most α = 0.025 for Bayesian exact and approximate methods.
In Table 2, total sample sizes for three allocations are calculated based on the sample size of the placebo arm, n P . For example, the total sample size corresponding to the allocation ratio 1:1:1 is 3n P whereas for 2:2:1 and 3:2:1, the total sample sizes are 5n P and 6n P , respectively. As discussed in [4,5], one might not consider balanced design due to the ethical reason and the smaller difference between E and R compared to the difference from placebo. From Table 2, we observe smaller sample size for the unbalanced allocation (2 : 2 : 1) as compared to the balanced design (1 : 1 : 1). Similarly, we notice a minor reduction in sample size for the unbalanced case (3 : 2 : 1) as compared to (2 : 2 : 1). A similar interpretation can be found in Figure 3 where the power curves show three different allocations under Frequentist and exact Bayesian approaches with non-informative prior with N = 300. We note that the type-I error rate is exactly 0.025 for the Frequentist approach and always maintained below 0.025 for the Bayesian approaches (Table 2). In cases where α << 0.025, Bayesian calibration can be performed to improve sample size, but is not explored in the current paper.

Application in a real data
To illustrate the real data application, we revisited major depressive disorder data described in [19]. This dataset has been analyzed in many articles, including [15,18]. Chowdhury Table 2. Frequentist and Bayesian sample sizes to achieve a power of 80% for θ = {0.8, 0.7}, α = 0.025, and π E ∈ [0.65, 0.9], keeping π R = 0.7 and π P = 0.1 under three different allocations.  et al. [4,5] used this dataset for binary outcome with risk and odds ratio type functional.
In this analysis, we implement our proposed methods for risk difference purpose. Briefly, the primary endpoint, HAMD-17 total score, was a continuous scale explaining the change from baseline at the end of sixth week with three arms: duloxetine (n E = 147), paroxetine (n R = 148), and placebo (n P = 145). We consider two binary outcomes, Response and Remission which are presented in Table 3. As described in [18], Response is defined as the reduction of more than 50% change of the total score at the end of six week. Remission is defined as maintaining HAMD-17 score of less than 17 at the same end-point. For the existing Frequentist approach, the p−value of the test is calculated as where T =π E − θπ R − (1 − θ)π P is the Frequentist statistic under the existing approach, T obs is the observed value of T, and σ 2null T is the variance of T under null hypothesis. For the conditional Frequentist approach, we calculate the p-value as where W = (π E − θπ R − (1 − θ)π P ) |π R >π P is the Frequentist test statistic for the conditional testing, W obs is the observed value of W, and μ null w and σ 2null w are the mean and variance of W under null hypothesis as given in Section 2. For the Bayesian approach, we start with non-informative priors and then consider informative priors to compare the results. We use p * = 0.975 to determine NI of duloxetine over paroxetine. The Frequentist p-values are compared with α = 0.025 to deduce the decision. Assuming non-informative Beta(1, 1) prior for π l , l ∈ {E, R, P} the samplers are generated for the three rates from Beta distributions as in Step 3 of the simulation. We calculate the posterior probability P(H 1 | X) for the rejection of H 1 which is the quantity estimated in Step 4 of the simulation. This is reported in Table 4 for different values of θ ∈ [0.5, 1), in order to ensure that the test drug has meaningful clinical effect retention. These posterior probabilities are compared with p * to deduce the Bayesian decision. We also checked that the AS condition holds with probability close to 1 for both the Response and the Remission outcome. From Table 4, we observe that the Frequentist p-values decrease while the posterior probabilities increase as θ decreases implying greater chance of declaring NI for smaller values of θ , which is compatible with the simulation results observed in Section 5. Also, we observe that the p-values under the conditional approach is smaller or at most equal to that under the marginal approach which is consistent to the Lemma 2.2. However, since none of the p-values is smaller than α = 0.025, NI hypothesis cannot be rejected and hence non-inferiority cannot be claimed for any θ . As evident, the Remission data has lower posterior probabilities than Response data. Using non-informative Beta prior, the posterior probabilities are less than the pre-specified cutoff p * = 0.975 and hence the NI of E relative to R cannot be claimed. However, when we choose an informative Beta prior, E: Beta(40, 34) , R: Beta(40, 36) , and P: Beta(40, 64) , NI is established for θ ≤ 0.55, for Response data. Similarly, taking the priors as E: Beta(40, 77) , R: Beta(40, 80) , and P: Beta(40, 141) , NI is claimed for θ = 0.5 for the Remission data. For the PUP on the arm R and with Beta(1, 1) prior on the arms E and P, we obtain results very similar to the CBP set-up. However, choosing informative priors for the arms E and P as in the CBP one can claim NI for θ = 0.5. Finally considering DP, with parameters (1, 1, 1) along with non-informative Beta prior for E, the posterior probabilities are found to be too small, even smaller than the CBP or PUP, to claim NI. However,  if the Dirichlet parameters are chosen to be (60, 22,73) with Beta(40, 36) for the arm E for the Response data, NI can be claimed for θ ≤ 0.55. Similarly, with Dirichlet parameters (150, 75, 450) with Beta ( 80, 160) for the arm E, NI can be claimed for θ = 0.5 for the Remission data. We note, here, that the Dirichlet parameters as well as the informative priors under CBP or PUP are so chosen that the mean of the Beta distribution coincides with the estimates of proportion parameters. A point to note, the choice of informative priors cannot be set arbitrarily in practice to claim NI, rather it must be guided from available and verifiable sources.

Conclusion
In this paper, we have presented new Frequentist and Bayesian test procedures for the 'gold standard' three-arm NI trial which includes a placebo arm. We focused primarily on binary outcome with risk difference being the metric of comparison. In the Frequentist setup, we introduce a more powerful conditional test of NI which makes more intuitive sense with a reduction in sample size requirement under certain situations. In our proposed methods, we explored the fraction margin approach with unknown NI margin, δ, which can be fluctuating based on the effect size of the treatment. On the other hand, the three-arm fixed margin approach of [18] is based on joint testing which requires additional attention for decision making as it may result in a biased test (see [6] for Intersection Union test and [17]). We provided sample size estimation for the three-arms of NI trial under three types of allocation (in E, R, P) using all three approaches. We have seen that even with the non-informative prior Bayesian normal approximation, as well as Bayesian exact approach yields greater or equal power as compared to Frequentist approach. The sample sizes using Bayesian approaches are smaller than that of the Frequentist approach for the desired power of 80%. From our investigation it is evident that an unbalanced allocation of the total sample size in NI trial results in the reduction of the required number of patients to achieve a fixed power. According to [31] an unbalanced allocation of the total sample size in a NI trial is desirable from an ethical point of view. Besides these technical aspects, NI trial has to be reflected in several substantive respects. The concerns include the choice of δ, the question of whether a placebo can be included as an additional arm of the study, AS, to give a few examples among others.
The results of the real clinical trial data suggest that the exact Bayesian methods perform favorably in all situations, and that these methods do not rely on any asymptotic approximation. Notably, with binary end-points, risk difference is not the only function of interest. One may also frame both a two-arm as well as a three-arm hypothesis in terms of log odds and/or relative risk ratios. For two-arm trial [30] proposed a fully Bayesian method for such metrics. Their method is based on a fixed margin-based approach, where margin construction was not the priority. Our group recently published (see [4,5]) conditional Frequentist and Bayesian test for risk ratio, odds ratio and number needed to treat which uses a similar approach as in the current paper, albeit, without any direct mathematical proof of power gain. The effect of prior miss-specification in the NI context is also an open area of research. Robust prior in the form of mixture distribution could lead to more stable and less sensitive result. Another interesting extension could be semi or non-parametric extensions of our approach. Ghosh et al. [16] proposed a semi-parametric extension of the Bayesian test procedure for continuous outcomes, which can be further extended for the binary responses.

Disclosure statement
No potential conflict of interest was reported by the author(s).