Efficiency Comparison of Analysis Methods for Recurrent Event and Time-to-First Event Endpoints in the Presence of Terminal Events—Application to Clinical Trials in Chronic Heart Failure

Abstract In clinical trials investigating treatments for chronic heart failure (CHF), a standard primary endpoint is the time to the first composite of hospitalization for heart failure or cardiovascular death (CVD). Since many patients experience several hospitalizations, there is interest in including recurrent hospitalizations into the primary endpoint to better capture disease burden. Several analysis methods have been proposed for recurrent event endpoints, mostly for the situation without a terminal event such as CVD. Only some methods explicitly account for terminal events, for example, the joint frailty model. We compare the power and Type I error rate of recurrent event methods with those of time-to-first event methods in the presence of terminal events. A special focus is the situation where treatment affects the risk of CVD. Our investigations are based on a simulation study, for which the scenarios are motivated by CHF trials, and based on bootstraping data from the Val-HeFT and PARADIGM-HF trials. We find that recurrent event methods can in many situations increase power considerably. But this is not always the case, for example, when there is high probability of treatment discontinuation after the first hospitalization. Also, for both recurrent and time-to-first event methods, the Type I error rate can be inflated in case of a detrimental treatment effect on CVD. Based on the simulation results we give recommendations on the choice of endpoint and analysis method for CHF trials.


Introduction
Heart failure (HF) is the inability of the heart to pump sufficiently to maintain the blood flow that the body needs. The disease presents a major public health issue with an increasing prevalence due to an aging population. With the availability of better medical care over the past decades, heart failure has developed from a relatively short-term fatal disease to a more chronic one. Subtypes of chronic heart failure (CHF) are often defined according to the value of the left ventricular ejection fraction (LVEF), with patients having lower values of LVEF being referred to as having heart failure with reduced ejection fraction (HFrEF) and those with higher values as having heart failure with preserved ejection fraction (HFpEF). There are currently several treatments approved for HFrEF, but none for HFpEF. See Ponikowski et al. (2016) for further background information.
In clinical trials investigating new treatments for CHF, a standard primary endpoint is the time to first composite of hospitalization for heart failure (HHF) or cardiovascular death (CVD). This endpoint has the limitation that the chronic nature of the disease which manifests in recurrent events (recurrent hospitalizations for HF) is only partly captured. The overview of Anker and McMurray (2012) showed that a time-to-first composite event analysis ignores 40%-50% of all HHF and CVD events. There is therefore interest in including recurrent HHF into the primary endpoint of HF trials to more fully capture the disease burden. The use of these additional events might also increase the statistical efficiency and lead to an increase in statistical power or allow a reduction in sample size. This expectation is driven by promising recurrent event post hoc analyses of published HF trials, which originally used time to first composite as primary (Rogers et al. 2012(Rogers et al. , 2014Mogensen et al. 2018). With PARAGON-HF, the first large drug trial in HF including recurrent HHF in the primary endpoint was recently reported (Solomon et al. 2019).
The CHMP (2017) guideline on developing treatments for CHF recognizes the clinical meaningfulness of recurrent HHF. At the same time, it mentions that there are considerable challenges in analyzing recurrent event data in the presence of a terminal event, typically death. The FDA draft guidance on endpoints in heart failure (FDA 2019) mentions as well that recurrent HF events can be incorporated into the analysis of HF trials, without discussing statistical details. Statistical challenges arise because obviously no HHF can be observed after death and because HHF and CVD events share, partly unobserved, risk factors. These risks can therefore usually not be assumed to be independent, even conditional on baseline covariates. To help addressing the challenges with the analyses a consortium of industry and academic statisticians requested an CHMP Qualification Opinion on the use of recurrent events. Such an opinion is a qualification of novel methods for medicine development and typically given for a new biomarker or supporting device. But it can also be applied for statistical methods, as done for the MCP-Mod method for dose finding (EMA 2014). The final qualification opinion on the use of recurrent events was recently released after public consultation (EMA 2020). The current article as well as the two companion articles in this issue (Schmidli et al. 2021;Wei et al. 2021) are based on the request for the qualification opinion. Focus of the current article is the application of recurrent event methods to CHF trials, investigating the potential gain in statistical efficiency from use of recurrent HHF in simulation studies and actual clinical trials. There have been previous comparisons of the efficiency of recurrent event and time-to-first event methods including Metcalfe and Thompson (2006) and Jahn-Eimermacher (2008), which considered the situation without terminal events.  investigated the specific situation of heart failure trials, but applied only a limited set of statistical methods. They also did not investigate the behavior of methods in the presence of a treatment effect on CVD. Both of these points are addressed in the current article.
Section 2 will introduce the methods and hypotheses to be tested, Section 3 will describe the setup of the simulation study and discuss its results. Section 4 presents an overview of results published in the literature and provides a bootstrapbased power comparison of methods based on the Val-HeFT and PARADIGM-HF trials. Section 5 discusses results of the current article also in relation to the two companion articles.

Notations
The time to CVD is modeled by a random variable D and the times of HHF are described by random variables T 1 < T 2 < · · · < T N with N being the number of HHF up to time D. In case of N = 0 only T 0 = 0 is defined. CVD is assumed to be the only terminal event. The counting processes describe the number of HHF and the information on CVD up to time t, respectively. Furthermore, let C be a noninformative right-censoring time. The individuals follow-up ends at F := min(C, D).

Common Test Hypotheses
We compare different statistical analysis methods when applied to data that follow a particular data-generating model. As datageneration model we use a joint frailty model (JFM) and define the test hypotheses within that model. The JFM reflects potential associations between the conditional rates of HHF and CVD by introducing a random term that acts multiplicatively on both of these rates. Furthermore, the model allows for different treatment effects on the rates of HHF and CVD and reflects that the rates for HHF and CVD can differ. It has therefore been proposed for estimating separate treatment effects on HHF and CVD in heart failure trials (Rogers et al. 2016). Within the JFM, the conditional event rate functions of the processes N 1 (t) and N 2 (t) are modeled. These rates are defined not only conditional on being alive and on the history of both processes, H(t), including treatment indicator X as a baseline covariate, but also conditional on some positive-valued random variable Z. This frailty term Z may be considered as summarizing the unexplained heterogeneity between subjects. Treatment effects, or more general covariate effects, are defined conditional on Z as the conditional rates are assumed to be proportional for some γ ≥ 0 and λ 0,HHF and λ 0,CVD being nonnegative functions. Thereby, λ HHF (t|Z, By modeling a common Z an association is induced between both processes. Whereas γ = 1 refers to a shared random term, with γ < 1 or γ > 1 the conditional rates of N 2 (t) (CVD) vary less or more than those of N 1 (t) (HHF), respectively.
We will consider the analysis of two different endpoints: HHF only, that refers to N 1 (t) only, and the composite of HHF and CVD, that refers to both N 1 (t) and N 2 (t). In the latter CVD before censoring is handled as an event and terminates the follow-up. The common hypotheses are defined separately for these two endpoints: Endpoint HHF: Endpoint HHF+CVD: For each of these hypotheses, power and Type I error rate of the statistical methods described in Section 2.3 are investigated separately in a simulation study. Thereby, we define power as the probability that the particular analysis methods reveals a statistically significant treatment effect under the alternative hypothesis of the data-generating model (H HHF   1 or H CVD+HHF 1 ). Note, that the particular analysis method might have been designed not exactly for this hypothesis and under different modeling assumptions. Therefore, control of falsely rejecting the null hypothesis of the data-generating model (H HHF 0 or H CVD+HHF 0 ) is no longer ensured. Thus, while our main interest is in comparing power, a valid comparison of power can only be made between those methods that are observed to control the Type I error rate, that we define as falsely rejecting the null hypothesis of the data-generating model. For this reason, we will also investigate the Type I error rate for all the considered statistical methods. The test hypotheses refer to β 1 and β 2 , that describe conditional and not marginal treatment effects.
Note that as a composite endpoint, for HHF+CVD it could happen that there is no difference between treatment groups in all-cause hazards, even though there is a treatment effect on each cause-specific hazard. This is because the all-cause hazard is the sum of the cause-specific hazards and thus cause-specific hazards that differ between treatment groups can still result in a common all-cause hazard.

Cox Proportional Hazards Regression
As a reference for analysing time-to-first event only, we use the Cox proportional hazards regression. The Cox regression will be applied to analyze time-to-first HHF and time-to-first HHF or CVD, respectively, by modeling the hazard rate of process N(t) := I(T 1 ≤ t) and the hazard rate of processÑ(t) := I(D ≤ t ∨T 1 ≤ t), respectively. The hazard rates of both processes refer to the event rate function conditional on being alive and free of any event, that is conditional on D ≥ t and T 1 ≥ t. In both cases, a proportional hazards assumption is made and H(t) being the history of the respective process. By maximizing the partial likelihood, we get a treatment effect estimateβ.

Quasi-Poisson model (QP)
For defining the QP we start by considering a Poisson regression. Depending on the considered endpoint, we define Y as the number of HHFs or the number of HHFs and CVD up to time F, thus We consider Y as being Poisson-distributed conditional on the binary treatment indicator X = x and the length of follow-up F = f , thus The treatment effect β is estimated by the maximum likelihood approach. Since the Poisson modeling assumption on variance being equal to expectation is not plausible here, we relax this assumption by applying a quasi-Poisson regression, also referred to as Poisson regression with correction for overdispersion. With this approach, a dispersion parameter φ is introduced, that models the relation and thus allows for an increased variability in Y. The dispersion parameter φ is estimated from the data (quasi-Poisson model) instead of keeping it at φ = 1 (Poisson model). Using this estimate, the ML-estimate of the standard error is adjusted by the factor (φ) 1/2 . For estimating φ we apply the Pearson statistic (Fahrmeir, Kneib, and Lang 2007) and an alternative estimation method for sparse data (Fletcher 2012).

Negative Binomial Model (NB)
As another approach to account for overdispersion, we consider the rate of the Poisson distribution for Y to vary by a gamma distributed multiplicative factor Z ∼ (1/φ, φ) with mean 1 and variance φ Then, conditional on X = x and F = f but marginalizing out Z, Y is negative binomial distributed with mean The regression coefficients and the overdispersion parameter φ are estimated by the maximum likelihood approach, that also provides standard error estimates. As in quasi-Poisson regression, the parameter φ allows for a larger variability in Y, but now under a different modeling assumption.

Lin-Wei-Yang-Ying (LWYY) model
The LWYY model has originally been proposed for modeling recurrent events without a terminal event (Lin et al. 2000). We will apply this approach for both endpoints by modeling the event rate of process N 1 (t) and the event rate of process N 1 (t)+N 2 (t), respectively. In both cases, a proportional hazards rate is assumed for N 1 and N 1 + N 2 The regression coefficient β is estimated by root-finding of the partial likelihood score function. Andersen and Gill (1982) proposed a model based on a more restricted modeling assumption and H(t) being the history of the respective process. Whereas partial ML estimation will derive equal parameter estimates compared to the LWYY method, standard errors are not to be estimated by the information matrix in the presence of a frailty term. In these situations, only robust standard errors control Type I error rate and keep confidence levels as long as there is no treatment effect on terminal events (Lin et al. 2000). Different to the Cox model, the hazards refer to event rate functions only conditional on being alive (D ≥ t), but without a condition on being free of any event (T 1 ≥ t). As a consequence, the population at risk is varying less over time and events after a first event contribute to the HR estimate. Both will usually affect the treatment effect estimates.

Joint Frailty Model (JFM)
For endpoint HHF+CVD, we focus on methods estimating a single treatment effect only, for example, by considering a composite endpoint. This allows to rely the decision about treatment effects on a single parameter only and circumvents the problem of low event rates and thus low power for CVD. For this reason, we do not consider the JFM for the analysis of endpoint HHF+CVD as only separate effects for HHF and CVD can be estimated. When analysing the endpoint HHF, we instead also consider the JFM as an analysis method using its parameter estimate of β 1 as treatment effect estimate for HHF. Compared to effects estimated by LWYY and Poisson regression, effects estimated by JFM are defined conditional on Z. Different to the NB model, conditional and marginal effects do not coincide in the JFM.
Whereas gamma-distributed frailty terms Z define the datagenerating model, we assume log-normally distributed frailty terms within data analysis for computational reasons.

Win Ratio
When comparing the disease processes between two subjects, we follow the proposal of Pocock et al. (2011) and label a subject as a winner or loser depending on who had a CVD first. If this cannot be answered from the available data, the labeling depends on which subject was earlier in experiencing the first HHF. When comparing all subjects in the treatment group with all subjects in the control group, the win ratio is estimated as the total number of winners in the treatment group divided by the total number of losers in the treatment group. Standard errors ofˆ have been derived in Bebu and Lachin (2016) and are implemented in the R package WWR (Luo 2017). While the win ratio is not directly a recurrent event method, it is often proposed as alternative and a comparison with the other methods seems of interest.

Test Decision
For each parametric statistical analysis method, the null hypothesis as defined in Section 2.2 (H HHF 0 or H HHF+CVD 0 depending on the respective endpoint) is considered as being rejected if β SE(β) < z α withβ and SE(β) estimated from the applied analysis method, respectively. For the win ratio, the null hypothesis is considered as being rejected if Note, that one-sided test versions are applied by rejecting H 0 only if a protective treatment effect is estimated.

Setup of Base Case
For our simulation study, we set up a base case scenario that mimics a typical HFpEF trial where patients may experience the following intercurrent events: CVD, non-CVD and treatment discontinuation. First, the enrollment time points s i were assigned equally-spaced across 3 years of enrollment. Treatment groups were randomly assigned for each patient according to X i ∼ Bin(1, 0.5). In case no terminal event such as CVD or non-CVD occurred, subjects were administratively censored at the end of the study by setting the patient's censoring time to c i = 5 − s i , assuming an overall study duration of 5 years. For the base case scenario, the time from enrollment to CVD and to the next HHF for subject i, i = 1, . . . , n, were generated using a JFM as described in Section 2. The inter-event times are exponentially distributed conditional on the Gamma distributed frailties Z i = z i (with mean 1 and variance φ) with conditional rates given as where λ 0,CVD and λ 0,HHF are the respective rates in the placebo group, γ defines the association between the two processes, x i is the individual treatment identifier (x i = 1 for the active treatment group, x i = 0 for the placebo group) and the hazard ratio exp(β 2 ) = HR and risk ratio exp(β 1 ) = RR are the treatment effects on CVD and HHF, respectively. As discussed in Section 2 these treatment effects are conditional on Z i . The patient specific frailties Z i for the rate of recurrent HHF and Z γ i for the rate of CVD are correlated, but not identical. This is clinically plausible, as HHFs are associated with an increased risk of CVD, which is also influenced by other risk factors like age and the presence of chronic kidney disease (Setoguchi, Stevenson and Schneeweiss 2007). For the simulations, we choose γ = 0.75, a value similar to what has been observed when applying the JFM to the previous HF trials (Rogers et al. 2016). This means that the frailties and the rates for CVD have smaller variability between patients than the ones for HHF.
A value of 3.9 CVDs per 100 patient-years was observed in the placebo groups of both the CHARM-Preserved trial ) and the BNP stratum of the TOPCAT trial (Pitt et al. 2014). Additionally, 9.1 and 8.5 first composite events (HHF or CVD) per 100 patient-years were observed in CHARM-Preserved and the BNP stratum in TOPCAT, respectively. The placebo rates λ 0,CVD and λ 0,HHF were therefore chosen such that the simulated trials had on average observed placebo rates per 100 patient-years of 4.0 for CVD and of 9.0 for the first composite event. The reason for not considering the whole trial population of TOPCAT is that doubts have been raised regarding the reliability of results in Russia and Georgia (Pfeffer et al. 2015). It seemed reasonable to either exclude these countries or restrict to the prespecified BNP stratum, which also had reasonable results. It was also noted that the observed ratio of all composite events (first and recurrent) to first composite events was around 1.8 in several previous HF trials (Anker and McMurray 2012). This ratio is dependent on the frailty variance φ in the simulation, and a value was chosen such that the average observed ratio in the simulated trials was also 1.8.
To summarize, the simulation parameters λ 0,CVD , λ 0,HHF , and φ have been chosen such that the resulting simulated trials have placebo event rates and a ratio of all to first events that matches those of reported HF trials. The exact value of the simulation parameters were determined by preceding simulation studies. Their values as well as those of further parameters for both the base case and the variations described in Section 3.3 are given in Table 1.
For the treatment group, both the treatment effect on recurrent heart failure hospitalizations exp(β 1 ) = RR and on CV death exp(β 2 ) = HR are varied. Namely we investigated RR = 0.6, 0.7, 0.8, 0.9, 1.0 and HR = 0.6, 0.7, 0.8, 0.9, 1.0, 1.25, while keeping the sample size fixed at N = 4350 patients in total, that is, 2175 patients per arm. This sample size is chosen as it gives approximately 90% power to show a treatment effect for the recurrent composite endpoint with the LWYY model for RR = 0.7 and HR = 0.8, as determined by the data-generating model with the base case parameter values. Time from enrollment to non-CVD was independently simulated according to an exponential distribution, where the rate λ nonCVD was chosen such that the proportion of non-CVD of all deaths is around 30%. For the base case, time from enrollment to treatment discontinuation was simulated independent of the joint frailty process for CVD and HHF as an exponential process with a rate λ trtDisc such that the proportion of treatment discontinuation after 1 year is 5%.
We evaluate the performance of the various statistical methods described in Section 2 based on 10,000 simulated clinical trials by the Type I error rates (one-sided, nominal significance level of 2.5%) and the power. All methods were implemented using R, except for the JFM. The JFM as implemented in the function frailtyPenal of the R package frailtypack (Rondeau, Mazroui and Gonzales 2012) was considered but unfortunately this lead to a number of computational problems. Therefore an implementation in SAS based on the PROC NLMIXED was used, similarly as described in Toenges and Jahn-Eimermacher (2020) with a log-normal distributed frailty for computational reasons. Due to the longer run time only 1000 simulation runs were considered for the JFM. Figure 1 shows the base case results for the recurrent event endpoint HHF. First observation is that in all cases there are substantial power gains for the recurrent event methods compared to the time-to-first event analysis. With a neutral effect on CVD (HR = 1.0, left panel), NB and LWYY have roughly the same power for all values of RR, while QP and JFM have a consistently lower power than the other recurrent event methods. For the case of varying HR (right panel), it can be seen that most of the power curves are increasing as HR approaches 1 and that there is thus a higher power under a lower treatment effect on CVD.

Results of Base Case
With an effective treatment, CVDs are being prevented more effectively in the treatment group including also the severely ill patients which then potentially realize many HHFs. This selection bias is the reason for the increasing power curves and is for example further discussed in Aalen, Cook and Røysland (2015). The only method with approximately constant power curve is the JFM. By modeling a frailty term that also affects the CVD rate the JFM seems to be able to avoid selection bias here, as intended by the model. For the comparison, it should be kept in mind, though, that the JFM is at an advantage, as the data have also been generated by a related JFM. In addition, the JFM targets conditional treatment effects, which have been the basis for simulation, while the other methods target marginal effects.
The results for the recurrent composite endpoint HHF+CVD are displayed in Figure 2. As for HHF, a substantial power gain compared to time-to-first event can be seen for LWYY and NB. In case of HR=1, LWYY has the highest power for all values of RR followed by NB, while quasi-Poisson and win ratio in general do not have a higher power than a time-to-first event analysis. In the situation of a varying effect on CVD, it can be noted that unlike for HHF the power curves are now decreasing or approximately stable as HR approaches 1. A protective treatment effect on CVD has a more substantial effect on power than the counteracting selection bias. The win ratio has generally a lower power than the recurrent event methods, roughly comparable to the time-to-first event analysis. By prioritizing CVD in the definition of "winner, " it has a higher power than time-to-first event approach when there is a large treatment effect on CVD, but a lower power in case of no or a small effect on CVD. For  HHF+CVD quasi-Poisson shows a quite low power, lower than all other methods including the time-to-first event analysis. We have investigated this further and found that patients dying early can make an enormous contribution toφ (Wood 2017), which might result in a gross overestimation of φ. For details, we refer to the online supplementary appendix. This also leads to large standard errors for the treatment effect estimate. The estimate according to Fletcher (2012) that we also investigated, unfortunately seems to overcorrect and leads to a violation of the Type I error rate (see results in the online supplementary appendix). The Type I error rates for the base case scenario and both endpoints are given in Table 2. For HHF+CVD most methods keep the nominal significance level without being too conservative in case of HR=1.0. The exception is quasi-Poisson which shows a very conservative Type I error rate, in line with the rather low power seen above. For HR=1.25 LWYY has a similar Type I error rate as for HR=1.0, while all methods have very low Type I error rates.
For HHF all methods roughly keep the nominal alpha level for HR=1.0. With a Type I error rate of 0.028, there is a slight increase for JFM, which could also be due to the lower number of simulation iterations and resulting higher Monte Carlo error. With a value of 0.015, there is again a conservative Type I error rate for quasi-Poisson, although not as conservative as for the HHF+CVD endpoint. This can be explained by there being fewer patients with a short follow-up time and an event for the HHF endpoint. When the effect on CVD is varied while keeping RR=1 and thus the effect on HHF neutral, the Type I error rate decreases for a positive effect on CVD (HR<1) and increases for a detrimental effect on CVD (HR > 1). As discussed above, this can be explained by selection bias. Since the effect of selection bias would continue to grow, the Type I error rate is expected to increase further for values higher than the investigated HR=1.25. This is the case for all investigated methods, with JFM being potentially an exception. For JFM, there is no clearly increasing trend in the Type I error rate with increasing HR. As discussed for power above, JFM seems to be able to avoid selection bias here, which is also due to the datagenerating model being a related JFM. A consequence of the Type I error rate not being controlled for HR > 1 is that the methods other than JFM should not be used to test the HHF endpoint unless HR ≤1 can be assumed. Note that the power comparison presented above focuses on HR ≤1 and thus compares methods in a situation where Type I error rate is controlled.

Variations
We additionally investigated variations of the base case settings that have either also clinical plausibility for the CHF indication or where the results are of particular interest to better understand the operating characteristics of the methods. Two of these are presented here, further ones can be found in the request for a qualification opinion EMA (2020). These further scenarios include varying the value of γ and increasing the risk of another event after each HHF.
For the first scenario considered here, treatment discontinuation was assumed to depend on HHF, and thus, through the effect on hospitalizations, also indirectly on treatment. It was assumed that treatment is only discontinued directly after a hospitalization event, that is, the higher the number of recurrent events the more likely treatment is discontinued. This is in contrast to the base case, where treatment discontinuation was simulated independently from HHFs. For the variation, the probability of treatment discontinuation was set to 20% each time an event occurs. After treatment discontinuation it was assumed that active treated patients "jump" to the respective rate of the placebo group.
For the second variation of the base case, the overall CV mortality was increased. For the base case scenario, the overall CV mortality in the entire trial is roughly 12.5%, which was increased to 40% for the variation. In this case, the control rate of CVD per 100 patient years is approximately 19, the control event rate of the first composite event is approximately 20 and the observed ratio of number of total events to number of first events is 1.2. Note that this scenario is not clinically plausible for HF, as a higher rate of CVD than HHF cannot be expected. But we were interested in the behavior of the methods under a very high mortality rate to consider how results might generalize to other diseases.
Detailed results for the power and Type I error rate of both variations are given in the online supplementary appendix. For the 20% treatment discontinuation scenario the power figures show the same shape for the individual methods for both endpoints HHF and HHF+CVD. The difference in power between the recurrent event methods and a time-to-first event analysis is smaller. This is to be expected as treatment discontinuation after the first hospitalization attenuates the estimated treatment effect for the recurrent event methods. But despite the 20% treatment discontinuation after an HHF, the methods based on recurrent events still have a higher power in the most cases, especially NB and JFM for HHF, and NB and LWYY for HHF+CVD.  had investigated treatment discontinuation after the first HHF further and also found that recurrent events methods have a higher power as long as the probability of discontinuing was less than 30%. Type I error rates are generally in line with those of the base case scenario.
For the scenario with 40% CV mortality, the power results depend to a much higher degree on the treatment effect for CVD, which is not surprising given the much larger number of CVD events. Power curves are still increasing as HR approaches 1 for HHF and decreasing for HHF+CVD, but the curves are much steeper. Because of the large number of mortality events competing with hospitalizations, there is also generally a low power for HHF, while HHF+CVD shows high power only for a larger treatment effect on CVD. Comparing the recurrent event methods with the time-to-first event analysis there is less distinction than in the base case, probably due to there being fewer recurrent events. As discussed for the base case, quasi-Poisson is negatively affected by death events, leading to a power close to 0 in the most cases in the high mortality scenario. Concerning Type I error rate, the α level is roughly kept for HR = 1.0 for both endpoints. For HR = 1, the same patterns as for the base case can be seen, but with a more pronounced influence of HR.
At the suggestion of a referee, we additionally investigated the effect of varying the study duration while keeping the number of first events stable. Study duration was varied from 3 to 7 years in both the base case and the 20% treatment discontinuation after HHF scenario. The number of patients was adapted to keep the number of first events stable at the approximately 1030 composite events observed with 5-year study duration. Enrollment time was kept at roughly 60% of study duration. Detailed results are available in the online supplementary appendix. The ratio of all events to first events was noted to increase over time. This is due to the additional follow-up allowing more recurrent events to occur. While power is decreasing for all methods over time in both scenarios, the recurrent event methods remain at a power advantage across the different study durations. Two factors would contribute to the loss in power over time, treatment discontinuations and selection effects. That there are additional events for the recurrent event methods does not seem sufficient to compensate for them. With respect to selection effects, Wei et al. (2021) showed that while-alive estimands move toward 1 with longer follow-up time under a JFM and a positive CVD effect. Toenges and Jahn-Eimermacher (2019) made a similar finding for the marginal RR targeted by LWYY. Selection effects can also explain the decreasing power for Cox regression in the treatment discontinuation after HHF scenario.

Published Trial Results
Many of the methods investigated in the simulation study have been applied to heart failure trials McMurray et al. 2003;Granger et al. 2003;McMurray et al. 2013;Mogensen et al. 2018;Rogers et al. 2014Rogers et al. , 2016Solomon et al. 2019). Tables 3 and 4 show results for recurrent and time-to-first event  analyses for the HHF and HHF+CVD endpoint, respectively. The Val-HeFt results are based on the request for a qualification opinion. The only method for which we are not aware of any published result is quasi-Poisson, which is therefore not included in the tables. For most trials, the time-to-first event analysis of the composite HHF+CVD endpoint has been the primary analysis, the only study in the list that has used a recurrent event primary endpoint is PARAGON-HF, for which HHF+CVD analyzed with LWYY was primary.
The results for the actual trials confirm the results found in the simulation study in so far that the p-values from the recurrent event methods tend to be smaller than the p-values from the time-to-first event method, indicative of a higher power of recurrent event methods. The difference is especially noteworthy for the CHARM-Preserved and PARAGON-HF trial. For CHARM-Preserved the Cox regression analysis is (just) not statistically significant at a conventional 5% significance level, while using LWYY, NB for either endpoint or JFM for HHF gives p-values below the formal threshold. For PARAGON statistical significance was closely missed with the recurrent event methods, but the Cox regression p-values are with p = 0.144 and p = 0.154 for HHF and HHF+CVD, respectively, even further away from providing formal evidence of efficacy. For the other trials, clearly positive treatment effects are found regardless of which endpoint or method is used. For the win ratio, p-values are generally of similar magnitude as those for Cox regression, as seen for power in the simulation studies.

Bootstrap-Based Power Comparison
In Section 3, the efficiency of analysis methods for recurrent event and time-to-first event endpoints was compared for data simulated from a JFM. In the following the efficiency of the analysis methods is compared through resampling of clinical trial data to even closer capture the clinical setting. For a given sample size m, we calculate the power of an analysis method through the following steps:

The power of the analysis method is given by Power
For the efficiency comparison, we focus on two clinical trials: the Val-Heft trial published by Cohn, Tognoni, and the Valsartan Heart Failure Trial Investigators (2001) and the PARADIGM-HF trial published by McMurray et al. (2014). The same methods as in the simulation study are applied to the endpoints HHF and HHF+CVD, except the JFM which is not applied for HHF due to the associated computational burden. The number of bootstrap replications is chosen as B = 500,000.

Val-HeFT trial
The Val-HeFT trial was a parallel group, placebo-controlled, double-blind clinical trial with 5010 patients suffering from CHF of New York Heart Association (NYHA) class II, III or IV who were randomly assigned to receive either valsartan or placebo in a one-to-one ratio (Cohn, Tognoni, and the Valsartan Heart Failure Trial Investigators 2001). In the Val-HeFT trial, the overall mortality was similar in the two groups. However, the rate of the first occurrence of the composite endpoint which included death from any cause and HHF in addition to cardiac arrest with resuscitation and intravenous therapy was reduced by 13%. The results were driven by the effect of treatment on HHF. In Figure 3, the power of the analysis methods in dependence of the total sample size of the bootstrap sample is plotted for the HHF endpoint and the composite endpoint HHF+CVD. The power curves shown in Figure 3 are in alignment with the simulation results for the scenarios with exp(β 2 ) = 1, which corresponds to no treatment effect on CVD and highlights the increased efficiency of recurrent event methods over time-tofirst event methods.

PARADIGM-HF trial
The PARADIGM-HF trial (ClinicalTrials.gov Identifier: NCT01035255) was a multicenter, randomized, double-blind, parallel group, active-controlled study to evaluate the efficacy and safety of sacubitril-valsartan compared to enalapril on morbidity and mortality in patients with chronic HF and reduced ejection fraction (McMurray et al. 2013 . For the PARADIGM-HF study, 8442 patients with chronic HF with reduced left ventricular ejection fraction were randomized  one-to-one to either group. The primary endpoint was the time to the first occurrence of the composite endpoint of CVD or HHF. An analysis of recurrent HF hospitalizations was published by Mogensen et al. (2018). In the PARADIGM-HF trial, it was shown that sacubitril-valsartan reduced the rate of the first occurrence of CVD or HHF by 20% compared to enalapril. The effect size was identical for each component of the composite and also the same effect size was observed when including recurrent HHF in the composite. Figure 4 shows the power curves of the analysis methods simulated through bootstrapping data from the PARADIGM-HF trial. It can be seen that the Cox model and the win ratio have a greater power than the recurrent event methods. This can be explained by the fact that the observed rate of drug discontinuation after the first event is about 40% in PARADIGM-HF. It was noted previously by  and also in Section 3.3 that an increase in the rates of study drug discontinuation results in a loss of efficiency of recurrent event methods compared to time-to-first event methods. Figure 4 also shows that for the composite endpoint, the power of the Cox model and the win ratio are virtually identical. The power of quasi-Poisson is clearly lower than the power of the other methods since the quasi-Poisson method is conservative. It is worth highlighting that the performance of the quasi-Poisson in comparison to LWYY and NB is similar in Figures 3 and 4. The performance of the quasi-Poisson model in comparison to the Cox model and the win ratio is different between the settings presented in Figures 3 and 4 since the Cox model and the win ratio have a higher power in the PARADIGM-HF scenario as compared to the Val-HeFT scenario. The higher power of both methods in the PARADIGM-HF scenario is due to the existing effect of treatment on CV death.

Discussion
In this article, we compared the statistical efficiency of recurrent event methods and time-to-first event methods in clinical trials in CHF. We especially investigated how treatment differences in the risk of the terminal event CVD affect power and Type I error rate of the methods. The comparison was based on a simulation study and a bootstrap analysis of actual clinical trial data. We found that use of recurrent event methods can in many situations considerably increase power compared to a time-tofirst event analysis. But there are also situations where this is not the case, as found for the HHF endpoint in the bootstrap analysis of the PARADIGM-HF trial. Here, the smaller power of recurrent event methods compared to a Cox model for the time-to-first event seemed to be due to many treatment discontinuations after the first event as described also by . For the HHF endpoint and scenarios where there was a detrimental treatment effect on CVD, the Type I error rates were inflated for all methods except JFM (Table 2, HR=1.25). As discussed, this Type I error rate increase is related to selection bias. These methods should therefore not be used to test for an effect on HHF unless it is reasonable to assume that there is no or only a very small treatment effect on CVD. If a positive treatment effect on CVD is plausible, then the applied methods could be also be used to test for a treatment effect on HHF, but would be conservative since Type I error rates were below the nominal α level for HR<1.0 in Table 2. For the composite endpoint HHF+CVD, we did not see an increase in the Type I error rate for a detrimental effect on CVD in the investigated scenarios. Including CVD as an event could prevent Type I error rate inflation by counteracting the selection bias. We have found that this Type I error rate control seems to hold for parameter settings that are realistic for CHF trials, but it should be noted that situations can be constructed where the Type I error rate is no longer controlled, for example, when the frailty variance is very large. Supporting results for this observation are given by Wei et al. (2021). In settings of JFMs which are realistic for CHF trials, for example, follow-up less than 5 years and a frailty variance θ < 8, Wei et al. (2021) observed that composite while-alive estimands, which are targeted by the quasi-Poisson model and in some situations by the LWYY model for endpoint HHF+CVD, move toward 1 (null effect) with increasing HR. This result supports our findings on decreasing Type I error rate and power with increasing HR. This article focused on the efficiency aspect of the claim in the qualification opinion, while Schmidli et al. (2021) and Wei et al. (2021) discussed the clinical interpretability and properties of recurrent event estimands within the estimand framework. As mentioned, the quasi-Poisson model was found to target while-alive estimands and has thus a meaningful and causal interpretation. However, since our investigation found it to be very conservative for CHF trials, use of the quasi-Poisson model cannot be recommended from an efficiency point of view. An alternative, which was not investigated as part of this article, would be a Poisson regression with bootstrap confidence limits. The LWYY model targets the same estimand at least in the situation of a constant conditional rate, which is a plausible assumption for CHF trials. Additionally, it showed high power in the investigated scenarios. It could therefore be a reasonable alternative to be used for endpoint HHF+CVD in CHF trials, although its behavior in the situation of nonconstant conditional rate is still to be clarified. The NB model was found in Wei et al. (2021) to address an estimand between the exposure-weighted and patient-weighted composite while-alive estimands. The interpretation of the estimand might thus be more complicated, but nevertheless meaningful. With respect to efficiency, the NB model also showed a high power, which was more dependent on the treatment effect for CVD than LWYY. Together, this makes the NB model also a reasonable method for HHF+CVD. For endpoint HHF both the LWYY model and the NB model should only be applied under the caveats mentioned above regarding the treatment effect on CVD. The win ratio was found by Schmidli et al. (2021) to address a causal estimand, but in our investigation had a high power only when there was a large positive treatment effect on CVD. Since treatment effects on HHF would generally be expected to be higher than those on CVD in CHF, the win ratio does not seem to be the method of choice in terms of efficiency. Unlike the other methods, the JFM reasonably controlled the Type I error rate for endpoint HHF even in the case of a detrimental effect on CVD, unless there were a lot of CVD events, as in the 40% CV mortality variation scenario. The JFM also showed high power, with the caveat that the data-generating model was also a JFM, which gives the model an advantage at least in terms of Type I error rate control. While Schmidli et al. (2021) found that the treatment effect parameters of the JFM have no obvious causal interpretation, they suggested that a JFM of the potential outcomes could be used for statistical inference on causal estimands. Further research on the details of implementing this approach seems worthwhile.
In choosing a primary estimand and a corresponding analysis method for a CHF trial or in a related indication, a factor to be considered is treatment discontinuation after the first event. If the rate of discontinuations is expected to be high, a timeto-first event approach may be the better option. If inclusion of recurrent events seems to be worthwhile, then the above considerations may be used as guidance in the choice of estimand and analysis method. For both time-to-first as well as recurrent event analyses, in the presence of treatment discontinuation and patient heterogeneity it would be beneficial to obtain the same amount of events during a shorter follow-up time. As highlighted in the qualification opinion, a further important consideration for planning a CHF trial is that the sample size should be large enough to provide sufficient information on mortality. That may mean that the trial should be able to at least rule out a detrimental mortality effect of a certain size, either on CV or overall mortality. It would need to be investigated whether this would lead to an increase in sample size or whether the sample size planned for the primary estimand is already adequate to cover this additional objective.
Other methods commonly applied for recurrent event analysis are the PWP model proposed by Prentice, Williams and Peterson (1981) and the WLW model by Wei, Lin and Weissfeld (1989). These have been investigated in the request for a qualification opinion. As neither method was found to have an easily understandable interpretation or an advantage in power, they have not been included in this article. The comparison of methods presented was otherwise not exhaustive, further methods have been proposed for recurrent event analysis, also for recurrent events in the presence of a terminal event, for example, the method by Mao and Lin (2016) or .
In further research methods for analyzing the HHF endpoint that are not affected by selection bias should be studied. The qualification opinion discusses methods that address the patient-weighted while-alive estimand. Initial investigations using method-of-moments gave estimates with a high variance. Statistically efficient methods that address this estimand would also be valuable.

Supplementary Materials
The supplementary material provides further details and simulation results for the quasi-Poisson model with Fletcher's method as well as detailed results for the simulation study variations described in Section 3.3.