Blinded sample size recalculation in clinical trials with binary composite endpoints

ABSTRACT We consider clinical trials with a binary composite endpoint where the trial is successful when a significant result is achieved for the composite or one prespecified main component. Appropriate sample size planning is challenging in this situation, as in addition to the Type I error rate, power, and target difference the overall event rates and the correlation between the test statistics have to be defined. Reliable estimates of these quantities, however, are usually hard to obtain and therefore there is a high risk to not achieve the intended power in a fixed sample size design. In this article, we propose an internal pilot study design where the nuisance parameters are estimated in a blinded way at an interim stage and where the sample size is then revised accordingly. We investigate the characteristics of the proposed design with respect to the actual Type I error rate, power, and sample size. The application of this design is illustrated by a clinical trial example.


Introduction
Determination of the correct sample size is of utmost importance for clinical trials. A too high sample size exposes an unnecessarily high number of patients to interventions with a potentially unfavorable benefit-risk profile, while the choice of a too low number of patients leads to trials with a small chance for success. Both are unacceptable from an ethical and economical perspective. Key components required for sample size calculation are Type I and Type II error rates as well as the target difference that should be both clinically important and realistic (Fayers et al., 2000). Furthermore, nuisance parameters, such as the variance when comparing normal means, are involved calculating the required sample size. Established values are commonly employed for the Type I and Type II error rates (conventionally chosen as 0.025, one-sided, or 0.05, two-sided, and 0.20 or 0.10, respectively) and guidance exists how to derive the target difference in a specific clinical setting (Cook et al., 2015). However, there is usually a high uncertainty about the value of the nuisance parameters. Previous studies on the same or a similar research question are commonly used to obtain information on these quantities. However, due to differences in patient populations, interventions, or concomitant medication, these values are often highly variable (see, e.g., the examples given in Friede and Kieser, 2004, 2006Golkowski et al., 2014;Kieser and Friede, 2003). Performing a pilot study just to get estimates of nuisance parameters for calculating the sample size of a subsequent trial is time-consuming and a waste of resources.
To resolve this problem, Wittes and Brittain (1990) proposed the internal pilot study (IPS) design where the nuisance parameters are estimated at an interim stage when a prespecified portion of the anticipated target sample size has been enrolled. If the data question the initial assumptions on the nuisance parameters, the sample size is revised accordingly. Within first works published on this topic, estimation of nuisance parameters required unblinding of the data. Later on, as an important advancement, estimators based on aggregated data were proposed thus leaving the treatment group allocation blinded (e.g., for binary data as proposed by Gould, 1992a). This limits the risk of introducing bias and of reducing the integrity and interpretability of the trial, which would occur if the treatment effect can be guessed at an interim stage. Considerable research has been undertaken on this topic for various designs and scale levels of the outcome variable, for example for normal data (Gould and Shih, 1992b;Kieser and Friede, 2000;Friede and Kieser, 2013;Wachtlin and Kieser, 2013;Golkowski et al., 2014), binary data (Friede and Kieser, 2004;Friede et al., 2007), count data (Friede and Schmidli, 2010;Schneider et al., 2013), and survival data (Ingel and Jahn-Eimermacher, 2014). Furthermore, these methods have gained regulatory acceptance (ICH E9, 1999;CHMP, 2007), and in the recent FDA draft guidance "Adaptive Design Clinical Trials for Drugs and Biologics" it is even stated that "Sample size adjustment using blinded methods to maintain desired study power should generally be considered for most studies" (FDA, 2010). In this contribution, we consider clinical trials with a binary composite endpoint. Such trials are frequently performed, for example in cardiology and neurology, to capture several facets of the treatment effect within a single binary variable thus increasing the group difference and reducing multiplicity (Einarson et al., 2014;Ferreira-González et al., 2007;Neaton et al., 2005;Rauch et al., 2015). We assume the situation that the trial is successful if a significant result is achieved for the composite endpoint or one prespecified component. This is a common setting when one of the components constituting the composite is of major importance, for example mortality, and demonstrating superiority for this variable constitutes a definitive clinical advantage. As we will see, when planning such a trial and employing an adequate analysis strategy, not only the overall rates for the composite endpoint and the important component have to be specified but additionally the value of the correlation between them has to be quantified. These values are, if ever, hard to obtain from previous studies and are generally uncertain. This is a situation for applying an IPS design, which, however, is up to now not available for this setting.
In this article, we propose an IPS design for the situation described above and investigate its performance characteristics. In the next section, we introduce the test problem, the multiple testing procedures, and the methodology for sample size calculation. In Section 3, we present a motivating clinical trial example demonstrating the weaknesses of the fixed sample size design in case of uncertain planning assumptions. In Section 4, the IPS design for the setting considered is introduced. Section 5 presents results on the actual Type I error rate of the proposed IPS design, which is crucial for application, especially in a regulatory environment. Furthermore, we explore the characteristics of the design in terms of power and sample size. We conclude with a discussion of the findings and an outlook on future research.
2. Test problem, multiple testing procedures, and sample size calculation 2.1. Test problem and multiple testing procedures We consider randomized clinical trials, where an intervention treatment (I) is compared to a control (C). For the sake of simplicity of presentation, it will be assumed for this work that group sizes are equal and sample size per group is depicted by n. In confirmatory analysis, a binary composite endpoint (CE) and a prespecified main component (MC) of the composite are assessed. Thereby, an event is assumed to be harmful and thus less events are desirable. Without loss of generality, we assume that there is only one other component beside the main component, referred to as the second component (SC). In clinical trial applications, the single components are often not mutually exclusive which means overlapping may occur. To account for this overlapping, the following three event rates are defined. Let p I MC;ex ; p I SC;ex and p C MC;ex ; p C SC;ex denote the rate of patients in groups I and C showing exclusively an event in the main or second component, respectively. In addition, p I MC;SC and p C MC;SC denote the rates of overlapping, that is the rate of patients who experienced both events. The composite event rates are then given by: The amount of overlapping is only of interest if both components are analyzed, as this overlapping rate influences the correlation between the components. Since in our work we focus on testing exclusively the MC in addition to the CE, the amount of overlapping is not relevant. This way, the random vectors X I and X C can assumed to be multinomially distributed with X I e multiðn; p I MC ; p I SC;ex Þ and X C e multi n; p C MC ; p C SC;ex . In the case of testing more than one component in addition to the CE, the amount of overlapping has to be considered directly in the approach by taking into account the correlation between the components. This correlation can be deduced as described by Sozu et al. (2010). The test problems for the two primary endpoints are then given by: The trial is successful if at least one of the null hypotheses is rejected, i.e., the following union-intersection test problem is assessed: Various multiple testing procedures exist for test problem (3) that control the experimentwise Type I error rate α. For example, the Bonferroni procedure tests both elementary hypotheses To reduce the conservativeness of this procedure, the correlation of the test statistics used for the assessment of the composite endpoint and the main component can be taken into account (Rauch and Kieser, 2012). We assume in the following that the null hypotheses H CE 0 and H MC 0 are tested by applying the normal approximation test for rates where the test statistics T CE and T MC are given by: denoting the estimators of the overall event rates. The correlation of the test statistics T CE and T MC is given by: Under H 0 , it holds that p C MC ¼ p I MC and p C CE ¼ p I CE and thus the correlation simplifies to: Consequently, the experimentwise Type I error rate can be controlled by choosing the local levels α CE and α MC such that the following equation is fulfilled: where Φ 2 denotes the bivariate standard normal distribution with correlation given as in Equation (4) and z 1Àα denoting the corresponding 1 À α ð Þ-quantile of the standard normal distribution (Rauch and Kieser, 2012). This method provides a "gain" in power compared to a simple additive splitting and enables a flexible allocation of the local significance level between the composite endpoint and the main component. In the following, we set α MC ¼ α=2 and allocate the "gained" level, which is larger than α=2, to the composite endpoint.

Sample size calculation
For sample size calculation, the assumed alternative has to be defined by specifying the overall event rates p CE and p MC and the treatment group differences Δ CE ¼ p C CE À p I CE and Δ MC ¼ p C MC À p I MC , respectively. Note that by providing overall event rates and group differences, the corresponding event rates in the groups are implicitly defined. The sample size required to achieve a power of 1 À β is then chosen as the minimal value that fulfills: For solving (7), we need to exploit the bivariate distribution of the test statistics T CE andT MC under the alternative hypothesis. Thus, we require the related correlation under H 1 which is given by Equation (4).
In the next section, we will see that it is usually a difficult task to correctly anticipate the overall rates. On the other side, the required sample size heavily depends on these quantities. Therefore, the fixed sample size design is prone to a mis-specification of these nuisance parameters and alternative approaches are desirable which is presented in Section 4.

Motivating clinical trial example
The TAXUS-IV trial was a randomized, double-blind study comparing a slow-release, polymerbased, paclitaxel-eluting stent (group I) to a bare-metal stent (C) in patients with a single, previously untreated stenosis (Stone et al., 2004). The primary endpoint was the nine-month incidence of ischemia-driven target-vessel revascularization. As secondary endpoints, major adverse cardiac events (CE), defined as a composite of death from cardiac causes, myocardial infarction, or ischemia-driven target-vessel revascularization, and death from cardiac causes (MC) were analyzed. For illustrative purposes, we assume that a study in a similar setting is to be planned with CE and MC as primary endpoints and that the results of TAXUS-IV are used for this purpose. An example of this situation is the TAXUS-V trial, where the same research question as in TAXUS-IV was investigated but in a patient population with more complex lesions (Stone et al., 2005). The overall rates and treatment group differences observed in TAXUS-IV werep CE ¼ 0:12 andp MC ¼ 0:0125 as well asΔ CE ¼ 0:065 andΔ MC ¼ 0:003, respectively. We assume that the observed differences are both clinically relevant and realistic for a future trial and thus use for sample size calculation the values Δ CE ¼Δ CE and Δ MC ¼Δ MC . Furthermore, we assume the true overall rates to be equal to those observed in TAXUS-IV. The correlation of the test statistics under the null and the alternative hypotheses, respectively, is then Corr H 0 T CE ; T MC ð Þ¼0:2809 and Corr H 1 T CE ; T MC ð Þ¼0:3049. For α ¼ 0:025, the local levels to be used for the multiple testing procedures described in the preceding section are α CE ¼ 0:01326 and α MC ¼ 0:0125 (solving Equation (6)), respectively. The quantity α CE À 0:0125 ¼ 0:00101 constitutes the "gain" in terms of the significance level, which is achieved by incorporating the correlation under the null hypothesis. With these definitions, the required sample size to achieve a power 1 À β = 0.80 is n ¼ 466 per group. However, in TAXUS-V overall event rates of 0.181 for the composite endpoint and 0.007 for the main component were observed. Using these values for the overall event rates for sample size calculations (but otherwise leaving the other assumptions unchanged) results in an adjusted alpha level of α CE ¼ 0:01296 and a required sample size of 652 per group. The other way round, in a fixed sample size design the power to reject at least one of the two endpoints with a sample size of 466 per group would amount to 0.6498 only, which is much lower than aspired.
The IPS design described in the next section copes with this uncertainty and thus aims at achieving an improved robustness of power in comparison with the fixed design.

Proposed internal pilot study design
The IPS design starts with calculation of an initial sample size n init per group for a specified global Type I error rate α, power 1 À β, and treatment group differences Δ CE and Δ MC and by employing anticipated values for the overall rates p CE and p MC . After a part of the sample size n init is recruited and data of n 1 <n init patients for the primary outcomes are available, the overall event rates and correlation are estimated from the pooled sample, i.e., byp blind CE ¼ . Note that this correlation is used both as an estimate for the correlation under H 0 and for the correlation under H 1 , being aware that these quantities are in fact not the same. This blinded correlation estimate is biased and lies somewhere in between H 0 and H 1 . The bias depends on the true underlying treatment effects and overall event rates. Employing these blinded estimates, the local levels and the required sample size are recalculated n recalc ð Þ. We consider two different approaches, denoted as unrestricted and restricted recalculation. In the unrestricted case, just the remaining number of patients is recruited, if the recalculated sample size exceeds n 1 (n 2 ¼ max 0; n recalc À n 1 f g ). In contrast, in the restricted case the total sample size is capped from above with a maximum of 2n init ðn 2 ¼ minfmaxf0; n recalc À n 1 g; 2n init À n 1 }) (Birkett and Day, 1994), reflecting a more practical implementation with regard to aspects such as feasibility of recruitment and financing. The final analysis is then based on all patients included before and after the sample size review.
An important prerequisite for practical application of the above-described procedure is adequate control of the Type I error rate. As we use the normal approximation test for rates for the analysis, even in the fixed sample size design strict control of the significance level is only assured asymptotically. For an appropriate comparison with the IPS design, we therefore provide the exact level of the normal approximation test, which is given by: where I Á f g denotes the indicator function and For the IPS design, the actual Type I error rate is given by: where the additional indices of 1 and 2 indicate the stage within the IPS design. The power can be computed by employing the specified alternative hypothesis in Equation (9). In the next section, we investigate the properties of the IPS design described above in terms of the Type I error rate, power, and required sample size.

Parameter scenarios
Throughout this section, we apply a one-sided overall Type I error rate of α = 0.025 and aim at achieving a power of 1 À β ¼ 0:80. For our investigations, we consider three parameter constellations for p MC ; Δ MC ; and Δ CE and vary the overall event rate p CE in a predefined range. Within these three scenarios, a broad range of settings is evaluated reflecting real-world applications. Scenario 1 is motivated by the results of the TAXUS-IV trial presented in Section 3, i.e., we choose the overall rates as p MC ¼ 0:0125 and p CE 2 0:06; 0:10; :::; 0:22 f g and the treatment group differences as Δ MC ¼ 0:003 and Δ CE ¼ 0:06. For initial alpha adjustment and calculation of n init ; p CE is assumed to be 0.15 which reflects the situation of misspecifications at planning stage and results in n init ¼ 547. Results for the fixed design that will be presented for comparison reasons are based on this calculated n init . Scenario 2 represents situations with event rates more in the middle of the distribution with p MC ¼ 0:075; p CE 2 0:1; 0:15; :::; 0:5 f g , Δ MC ¼ 0:05; and Δ CE ¼ 0:1: Similarly, p CE is assumed initially to be 0.35 which results in n init ¼ 308. Scenario 3 was chosen the way that small sample sizes result, with p MC ¼ 0:08; p CE 2 0:15; 0:2; :::; 0:5 f g , Δ MC ¼ 0:06; and Δ CE ¼ 0:18. The initial p CE is assumed to be 0.25 which results in n init ¼ 103. For the three scenarios different sizes of n 1 are investigated given as 1 3 n init ; 1 2 n init ; and 3 4 n init (rounded up to the next integer). The valuesp blind CE ;p blind MC ; and d Corr blind T CE ; T MC ð Þ are estimated based on n 1 and used also for re-adjustment of α CE . For sample size recalculation, Δ CE andΔ MC are used both under the null and the alternative hypotheses. Note that even for the Type I error, sample size recalculation requires usage of the predefined treatment effects.
All calculations and simulations were performed using R version 3.1.2 (R Core Team, 2014). Results for the Type I error in the fixed sample size design are calculated based on Equation (8) and for power on (7), given the adjusted α CE and sample size which is correct only under the initially assumed event rates. Under the IPS design the calculation time increases considerably since it requires estimation of parameters for all possible outcomes of the first stage, recalculation of the adjusted alpha level and sample size, and requires investigation of all possible realizations for Stage 2 depending on the Stage 1 constellations. The calculation time can be decreased by excluding parameter constellations with low probability mass. However, several days are still required for the calculation of one specific value (parallel running on 30 cores), exponentially increasing with sample size. Since this is not realistic for all the scenarios considered here, we present simulated results in the following sections. These are based on 1,000,000 runs to guarantee high precision. Furthermore, our calculations require the summation of a high number of small probabilities (sample size calculation, sample size recalculation, Type I error, and power). Therefore, precision of the calculations and simulations has to be considered. Precision can be improved through simple programming steps such as the definition of event rates as 1/10 instead of 0.1. Furthermore, there is a package in R to improve the internal precision of values and functions called Rmpfr (Maechler, 2014) which has negligible effects on the results but the consequence of dramatically increasing calculation time. However, a comparison of simulated and calculated results confirmed their precision and thus the practicability in evaluating the performance of this design is supported.

Actual Type I error
The left panel of Figure 1 shows the actual Type I error rate for the normal approximation test in the considered scenarios with initial mis-specification of the overall event rate in the composite endpoint for the fixed sample size design and for the restricted IPS design. In there, n 0 denotes the sample size per group that would be required for a fixed sample size design to reach the desired power of 80% for the specified alternative.
For the fixed design in Scenario 2 a very conservative Type I error is observed for event rates in the composite; lower than the initially assumed rate of 0.35. This is due to the fact that the assumed correlation under H 0 used for defining the adjusted α CE is lower than the true correlation for p C CE <0:35 (see supplementary material). Therefore, the nominal alpha level is not exhausted. In Scenario 3, the opposite effect is observed. Anticonservative performance arises due to an adjusted alpha level that is too high for higher event rates in the CE. In Scenario 1, the impact of the mis-specified event rate on the actual Type I error is lower in the fixed design, since the correlations are more similar for the investigated event rates and mis-specification does not lead to adjusted alpha levels that are as conservative or anticonservative as in the other scenarios. In contrast, in Scenario 1, the nominal α levels for both the fixed and the IPS designs are generally constant and considerably below 0.025. This is due to the approximate nature of the test statistics distribution, which is better in the middle of the range than at the edges. For Scenario 2, both designs are slightly anticonservative with a maximum Type I error below 0.0255 for all investigated settings. The same result is obtained for low event rates in Scenario 3. The results for the unrestricted recalculation show no substantial deviation regarding Type I error rates (see supplementary material).
In conclusion, the exhaustion of the nominal α level is better for the IPS design compared to the fixed design, especially in the case of mis-specifications at the planning stage. No relevant exceed of the nominal level is observed.

Power and required sample size
The right panel of Figure 1 shows the achieved power in the three scenarios with initial mis-specification of the overall event rate in the composite endpoint for both the fixed sample size design and the restricted IPS design. It is obvious that in cases where mis-specifications at the planning stage result in higher sample size than necessary the resulting power in a fixed design is higher than intended. The same arises in applying an IPS design when n 1 is already higher than the true required sample size n 0 .
In settings where n 1 <n 0 , the IPS design meets the anticipated power quite well, with the exception of very small pilot studies (Scenario 3, n 1 ¼ 35Þ that lead to slightly underpowered studies. However, it can be seen that the loss in power under mis-specifications can be considerable in the fixed sample size design. The distribution of the final sample sizes (under H 1 ) for the IPS design is depicted as boxplots in Figure 2.  With smaller size of the internal pilot the variability in the recalculated sample size is higher. Again, the unrestricted and restricted recalculation approaches show very similar performance characteristics (see supplementary material).

Discussion
We proposed an IPS design for clinical trials that aim at demonstrating a significant treatment effect for a composite endpoint or one prespecified main component. The procedure fulfills all major requirements to make it attractive for practical application. First, the design modification is based on blinded data thus maintaining the trial integrity. Moreover, the procedure is easy to implement with minimal additional   Figure 2. Simulated empirical distribution of the final sample size. The boxes show the median value and interquartile range of the final sample size (n 1 + n 2 ) and the whiskers indicate minimum and maximum. Results are shown for investigated scenarios 1 (top), 2 (middle), and 3 (bottom), applying three different sample sizes of internal pilot study using the unrestricted recalculation approach (as described in Section 5.1). Underlying parameters are as follows: Δ CE = 0.06, p MC = 0.0125, and Δ MC = 0.003 (scenario 1); Δ CE = 0.1, p MC = 0.075, and Δ MC = 0.05 (scenario 2); Δ CE = 0.18, p MC = 0.08, and Δ MC = 0.06 (scenario 3); with the initial assumption for p C CE of 0.15, 0.35, and 0.34, respectively. Below the plots, true required sample sizes per group n 0 are given in grey color. logistics required. Furthermore, the actual Type I error rate was shown to be comparable to those of the fixed sample size design or even better in terms of alpha exhaustion. In all considered scenarios, the actual level of the IPS design lies below the nominal level and falls only slightly above in the remaining cases. The starting point of our research was the unsatisfactory sensitivity of the fixed sample size design with respect to uncertain planning assumptions. In this respect, the IPS design turned out to be robust against initial misspecifications of nuisance parameters. In fact, even if only crude estimates of these quantities are available upfront, the desired power is achieved as long as the recalculation is not performed too late, i.e., with a sample size that already exceeds the required one. However, with decreasing sample size of the IPS the precision decreases as well, and therefore the recalculated sample size has higher variability. As already observed in other situations, the power achieved by the IPS design may be slightly lower than aspired when the sample size review is based on a small sample size. An adjustment factor may then be used to expand the recalculated sample size (Zucker et al., 1999;Friede and Kieser, 2011). These considerations might help to choose a value of n init . However, a simple and universally valid rule of thumb cannot be provided, especially as the timing of the sample size review is usually in some part determined by the recruitment and follow-up time.
We provided formulas for an analytical calculation of Type I error rate and power thus enabling exact enumeration of these quantities where these are of interest or required, for instance, by regulatory agencies. For the reasons specified above, we presented simulated results for the IPS design here. The programs for exact calculations and simulations can be obtained from the first author upon request.
This research can be extended in several ways. While we considered unrestricted designs with final sample size exclusively determined by the mid-course estimates of the nuisance parameters and a restricted approach with an upper boundary for the sample size, the methods can also handle restricted designs with lower and upper boundaries. Furthermore, generalization to the situation that more than one component is assessed in confirmatory analysis may be desirable. Thereby, the definition of an adequate multiple testing strategy is already a point of discussion in the fixed design. Further difficulties in interpretation arise in cases where the observed effects in the composite and the components point in opposite directions. Showing a significant effect not only in one of the endpoints but in all of them could be a solution to this problem.
Finally, if the follow-up time required to measure the outcome of the endpoints is long as compared to the recruitment speed, the enrolment of patients has to be interrupted until the result of the sample size recalculation is available; otherwise overrun occurs. Including short-term information in the recalculation of the sample size may then be a valuable option. We are currently investigating whether methods developed for a single primary binary outcome (Wuest and Kieser, 2005) can also be fruitfully applied in clinical trials with multiple objectives.