Simultaneous inference of a binary composite endpoint and its components

ABSTRACT Binary composite endpoints offer some advantages as a way to succinctly combine evidence from a number of related binary endpoints recorded in the same clinical trial into a single outcome. However, as some concerns about the clinical relevance as well as the interpretation of such composite endpoints have been raised, it is recommended to evaluate the composite endpoint jointly with the involved components. We propose an approach for carrying out simultaneous inference based on separate model fits for each endpoint, yet controlling the familywise type I error rate asymptotically. The key idea is to stack parameter estimates from the different fits and derive their joint asymptotic distribution. Simulations show that the proposed approach comes closer to nominal levels and has comparable or higher power as compared to existing approaches, even for moderate sample sizes (around 100-200 observations). The method is compared to the gatekeeping approach and results are provided in the Supplementary Material. In two data examples we show how the procedure may be adapted to handle local significance levels specified through a priori given weights.


Introduction
For a number of reasons, it may be convenient to use composite binary endpoints (i.e., the maximum of a number of binary endpoints) in clinical trials: Such endpoints may allow a parsimonious summary of treatments and also achieve increased statistical efficiency as compared to analyses of individual endpoints, in particular if some endpoints are associated with low probabilities of observing the event (Ross, 2007).
Unfortunately, the usage of binary composite endpoints also entails some shortcomings, which have been identified and discussed by a number of authors recently, e.g., Cordoba et al. (2010) and Tomlinson and Detsky (2010). A key problem is how to interpret the composite endpoint clinically, in particular in view of constituent endpoints with incomparable contributions due to varying degrees of relevance of the biological effects captured by these individual endpoints. Subsequent interpretation of results is also complicated as incomparable contributions are not easily disentangled from the composite endpoint (Lim et al., 2008;Mascha and Sessler, 2011). In short, clinical relevance and interpretation may be difficult for composite endpoints (Ross, 2007).
Hence both the composite endpoint and its constituents should be analyzed to ensure a satisfactory clinical evaluation (Turk et al., 2008). A confirmatory statistical analysis, which will require simultaneous evaluation of the composite endpoint and its constituent, is therefore strongly recommended (Huque et al., 2011;Chi, 2005).
Simultaneous confirmatory analysis involves some kind of multiplicity adjustment for the significance levels or p-values considered. Bonferroni adjustment is readily available also in this setting, but due to the expected high correlation between the composite endpoints and its components it easily becomes much too conservative. A few approaches for simultaneous inference of binary composite endpoints have been proposed in the literature: Pogue et al. (2010) suggested using generalized estimating equations (GEE) or logistic mixed-effects models to carry out a heterogeneity test for evaluating how different the treatment differences were across endpoints as compared to a common effect. Mascha and Sessler (2011) suggested a similar multivariate approach using GEE. A different approach is taken by Rauch and Kieser (2012), who derive a multiplicity adjustment using an approximation argument for establishing correlations between test statistics by means of (raw) observations and the parameter estimates and subsequently exploiting asymptotic joint normality of the test statistics.
We present a novel approach for simultaneous assessment of a binary composite endpoint and its component endpoints. Specifically, we consider separate models for each binary endpoint but derive correlations between test statistics through an asymptotic argument that allows us to combine the individual model fits, exploiting the idea by Pipper et al. (2012).
Section 2 contains the theoretical background of the proposed method. In Section 3, the proposed approach is evaluated through an extensive simulation study where its performance is compared to standard Bonferroni adjustment and to the approach proposed by Rauch and Kieser (2012). We also compare the approaches in two data examples. In Section 4, we discuss the advantages and limitations of the proposed approach and also outline some possible extensions.

Methods
We consider a study where J binary endpoints were observed for n subjects: Y j ð Þ 1 ; : : : ; Y j ð Þ n for endpoint j =1,. . ., J. These endpoints were combined into a single composite endpoint, which is considered to be the primary outcome. The original binary outcomes will be treated as secondary outcomes, e.g., Turk et al. (2008). More specifically, we define the composite binary endpoint Y ðJþ1Þ as the following maximum: ...;J fY j ð Þ i g; i ¼ 1; : : :; n: By definition the composite endpoint Y Jþ1 ð Þ will also be a binary endpoint (Quan et al., 2007).

A novel approach
We will assume that observations for each endpoint may be described by a generalized linear model for binary data (arbitrary link); a logistic regression model would be a common choice. So we assume that Y j ð Þ i is a binary random variable with probability π j ð Þ i specified through a generalized linear model with link function g: (1) McCullagh and Nelder, 1989, Chapter 4). This general model specification allows for inclusion of a treatment factor while at the same time additional covariates may be incorporated.
We assume that β j ð Þ k j denotes the parameter of interest for the jth endpoint (k j ∈{1,. . .,p j }), corresponding to the null hypotheses: The intersection of the null hypotheses in equation (1) is referred to as the global null hypothesis. We denote the maximum likelihood estimator of the unknown parameter β variable with finite variance. The multivariate central limit theorem ensures weak convergence ffiffiffi n pβ À β 0 ! N 0; AE ð Þ as n→∞. The variance-covariance matrix Σ is defined as the following limit in probability 1 T iΨ i , whereΨ i is the empirical counterpart of Ψ i with parameter estimates inserted. For additional details, we refer to Pipper et al. (2012).
The null hypotheses in equation (2) are evaluated in the usual way by means of Wald test statistics T 1 ,. . ., T J+1 , defined as the estimate divided by its standard error. Each of the corresponding (marginal) p-values, which are obtained with reference to a standard normal distribution, may then be referred to a global multiplicity adjusted significance level α adj . Consequently, the global null hypothesis is rejected if and only if (at least) one |T j | exceeds the (1 -α adj /2) percentile of the standard normal distribution (Dickhaus, 2014). The global multiplicity adjusted significance level may be deduced from the asymptotic multivariate normal distribution of T = (T 1 ,. . ., T J+1 ) (Hothorn et al., 2008;Hasler and Hothorn, 2011) in such a way that control at a given nominal FWER α nominal , which is often equal to 0.05, is ensured.
In some applications it can, however, be appropriate to prioritize certain hypotheses. Such a hierarchy among hypotheses can be induced by means of a vector of weights, w = (w 1 ,. . .,w J+1 ) (such that P Jþ1 j¼1 w j ¼ 1), where higher weights are assigned to more important hypotheses. In this case the global, adjusted significance level α adj would be replaced by local (i.e., hypothesis-specific) adjusted significance levels defined as α j ð Þ w;adj ¼ w j α w;adj; j ¼ 1; : : : ; J þ 1. The reference significance level α w,adj will naturally depend on the specified prioritization w and is chosen in order for the α j ð Þ w;adj to simultaneously control the FWER. Here one would compare the marginal p-value corresponding to hypothesis H (j) with the value of the respective weight-specific local adjusted significance level α j ð Þ w;adj . Consequently, the global null hypothesis would be rejected if and only if (at least) one |T j | exceeds percentile.
This idea was already proposed by Westfall and Soper (1998) and it may be extended to correlated endpoints considered in the present framework as follows: For a given weight vector w and variancecovariance matrix AE 2 R Jþ1 ð ÞÂ Jþ1 ð Þ , we define the function f w,∑ : [0,1]→[0,1] of tail probabilities in a multivariate normal distribution as: Àz 1Àw 1 α=2 : : : Àz 1Àw Jþ1 α=2 ϕ s; 0; AE ð Þds: Here z 1Àw j α=2 denotes the 1w j α/2 percentile in the standard normal distribution and ϕ(·, 0, Σ) is the density of the (J + 1)-dimensional normal distribution with mean 0 and variance-covariance matrix Σ. Given a prioritization represented by w and an estimated variance-covariance matrixAE of the vector T of test statistics, the function f w;AE may then be used to derive the local adjusted weightspecific significance levels α j ð Þ w;adj from a prescribed FWER α nominal as follows. As f is invertible, the reference level α w,adj may be calculated for any specified FWER α nominal by: Subsequently, the local multiplicity adjusted significance levels are obtained as α j ð Þ w;adj ¼ w j α w;adj . If the test statistics are independent, their variance-covariance matrix is given by the identity matrix I. With equal weighting w j ¼ 1 Jþ1 for j = 1,. . ., J +1, (which corresponds to no prioritization) the above equation would yield α w; h i , implying that the local adjusted significance levels α j ð Þ w;adj simply recover the Slepian correction (Pipper et al., 2012). In practice equation (5) may be solved by means of a modified bisection method. The method is implemented using R (R Core Team, 2014) and the extension package multcomp (Hothorn et al., 2008), an example code thereof can be found in Appendix A.

Results
A simulation study was performed to explore and compare the performance of the proposed method, the standard Bonferroni adjustment and the suggested approach by Rauch and Kieser (2012).
Within a logistic regression setting, we considered binary endpoints Y (1) , Y (2) and their composite defined as their maximum, i.e., Y (3) = max{Y (1) , Y (2) }. We were interested in the simultaneous assessment of the null hypotheses: Here π j ð Þ C and π j ð Þ T denote the event probability for endpoint j ϵ {1, 2, 3} in control and treatment group, respectively. Translated into the framework introduced in Section 2, this setting corresponds to choosing g as the logit function and J = p 1 = p 2 = 2. The first covariate is constant, x j ð Þ i1 ¼ 1, and represents the intercept β The parameter of interest is β j ð Þ 2 and its value under the null hypothesis H j ð Þ 0 of no treatment effect is β j ð Þ 2;0 ¼ 0. As indicated in the previous section, Wald test statistics may be applied to investigate the hypotheses of interest. Note, however, that in this specific setting (inclusion of only a treatment factor without incorporation of additional covariates), it is possible and advantageous to apply the generally more powerful score test for hypotheses assessment.

Settings for the simulation study
The global two-sided significance level was set to α = 0.05. Estimation of FWER and power was each based on 10,000 simulated data sets. Each data set contained n independent observations of the two endpoints Y (1) ,Y (2) for the control group and additional n independent observations for the treatment group such that the total sample size was 2n. The group sample size n varied between 50, 100, 200 and 500 and the component endpoints were correlated with predefined correlation ρ (correlations were equal in control and treatment group).
For the simultaneous confirmatory assessment of treatment effect in composite and its constituents in the simulation study, we considered unequally weighted hypotheses: More weight was assigned to the null hypothesis H 3 ð Þ 0 related to the composite endpoint by choosing w 3 = 0.5, where as its constituents Y (1) and Y (2) were weighted according to w 1 = w 2 = 0.25. As indicated above, this specific setting facilitates the use of the score test for hypotheses assessment, which results in a slight gain of power as compared to the Wald test and thus simulation results are based on the score test.
As mentioned in Appendix B, the correlation ρ between the components Y (1) , Y (2) is given by: denotes the overlap rate of the components. In Rauch and Kieser (2012), the authors apply an approximation of ρ by assuming a zero overlap rate, thereby introducing a bias in their estimation procedure. For the sake of completeness our simulation study includes both their proposed approximate approach with hypothesized zero overlap rate (in the sequel this approach will be denoted by RK1) and their approach without such approximation (henceforth denoted by RK2). This will en passant give a quantification of the effect the approximation has on the performance of their method.

Specific settings for FWER estimation
The event probability π 1 ð Þ C for the first endpoint in the control group was chosen among {0.1, 0.4}. We assumed the second endpoint corresponding to the less severe and more frequent event; therefore, its event probability under control treatment was set to π Given fixed values for π 1 ð Þ C and π 2 ð Þ C , formula (7) and the fact that the overlap π 1 ð Þ\ 2 ð Þ C is within the interval [0,1] naturally yield restrictions on the possible values of the components' correlation ρ (confer Appendix B). Choices of ρ in line with the above settings were {0.2, 0.4, 0.6}.

Specific settings for power estimation
The baseline value for the success probability for the first endpoint was set to π 1 ð Þ C ¼ 0:4 and as above π 2 ð Þ C was defined by π 2 ð Þ C ¼ π 1 ð Þ C þ 0:1. It was assumed that treatment causes a reduction in incidence of events, while reduction was supposed to be stronger in the less severe (and more frequent) endpoint: The For the component endpoints and the composite the proposed method and RK2 show comparable performance for detecting a treatment effect, across all parameter constellations. The simplified method RK1 looses efficiency by ignoring present overlap and particularly drops power for high endpoint correlations. However, all three approaches show an improvement as compared to standard Bonferroni adjustment, and the gain in power increases for high correlations as well as for low treatment effects. For instance, for a correlation of 0.8 and treatment effect of r = 0.05 (in the first component endpoint) and in a regime in which asymptotics can be assumed to hold (n = 200, 500), the proposed method achieves a gain in power of more than 50% (corresponding to an absolute gain of 0.027) for the CE, as compared to standard Bonferroni correction. In the same setting, but with weakly correlated endpoints there is still a gain of 12-16% in the composite endpoint.
Using the proposed approach, we considered simultaneous assessment of the treatment effect based on the composite endpoint and its two constituents (assuming a FWER of α = 0.05). The results of the simultaneous analysis are shown in Table 5: The second column displays the estimated odds ratios (OR), followed by the respective multiplicity adjusted 95% confidence intervals for both equally and unequally weighted endpoints. The subsequent column provides the unadjusted (marginal) p-values. The local significance levels for the proposed method and for the Bonferroni correction are presented in the latter columns. Here two sets of weights have been applied, w 1 = w 2 = w 3 = 1/3 (equal weights) and w 1 = w 2 = 1/4 and w 3 = 1/2 (unequal weights with double weight on the composite endpoint). Observe how prioritizing the composite (by assigning the weight w 3 = 1/2) leads to a shortening of its OR confidence interval. Table 4. Overview of the chosen parameter constellations and the (thereby implicitly defined) overlap rates.  Table 5. Simultaneous assessment of the composite endpoint and its constituents. The second column contains the estimated odds ratios, followed by the corresponding multiplicity adjusted 95% confidence intervals for the proposed method and both weightings. The subsequent column shows the raw p-values (marginal models) and the adjusted local significance levels α j ð Þ w;adj ¼ w j α w;adj for the proposed method and the Bonferroni correction are given in the right-most columns. Related to the previous example is the randomized controlled trial conducted by Solomon et al. (2007). The study targeted the assessment of three different educational strategies on osteoporosis management. Primary care physicians and their patients were randomly allocated to control group (Treatment A, usual care) or one of three intervention groups, patient intervention only (Treatment B), physician intervention only (Treatment C), or both interventions (Treatment D).
Intervention comprised a thorough education in osteoporosis diagnosis, prevention, BMD testing and treatment. The primary endpoint Y (3) was defined by means of the secondary endpoints Y (1) (undergoing a BMD test) and Y (2) (obtaining osteoporosis medication) in the same manner as in the first example. The empirical Pearson correlation between the component endpoints wasρ ¼ 0:4. To investigate which type of intervention has a significant effect on osteoporosis management, many-to-one comparisons (Dunnett contrasts) to the control group (usual care) were calculated. The assessment of intervention effect is conducted simultaneously for the composite endpoint and its components.
For the analysis, both equal α allocation and weighted allocation among the three endpoints were considered. For the latter, the two individual endpoints were assigned the weights w 1 = w 2 = 1/4 while double weight was allocated to the composite, w 3 = 1/2. Within a single endpoint, all comparisons to the control group were weighted equally. Such allocation resulted in weights for the first and second endpoint while the third endpoints was weighted according to w AB Table 6 shows the results of the simultaneous analysis. The estimated odds ratios with multiplicity adjusted 95% confidence intervals based on the proposed approach, for both equally and unequally weighted endpoints, are given in the first columns. Unadjusted p-values from the marginal models and each comparison to the control are listed in the subsequent column while the latter four columns display local significance levels based on equal and unequal weights, both for the proposed approach and for Bonferroni adjustment. Again observe that the chosen weighting of endpoints leads to shorter confidence intervals regarding all many-to-one comparisons for the composite.

Discussion
We have proposed a flexible and versatile approach for simultaneous inference of a binary composite endpoint and its individual binary components. The proposed method controls the FWER asymptotically and shows a marked increase in power for strongly correlated components as compared to standard Bonferroni adjustment and the approach proposed by Rauch and Kieser (2012), which is based on a related maximum test. The gain in efficiency is most pronounced for small sample sizes and small treatment effects. Moreover, the proposed approach gave results similar to what was obtained by using a common gatekeeping procedure, where the test based on the composite outcome serves as the gatekeeper, deciding whether or not to proceed with testing the individual outcomes, Dmitrienko et al. (2010). A comparison of the estimated power for composite endpoint and its constituents, for the gatekeeping approach and the proposed method, can be made on the basis of simulation results as shown in Supplementary Table 3 and  Supplementary Table 4. A further interesting extension of our simulation study would be a comparison to exact, finite sample size methods, such as a permutation-based global gatekeeping approach.
The advantages of our method are (i) its simplicity, particularly an explicit formulation of the correlation matrix is not needed, (ii) the availability of convenient software through the function mmm() in the R package mult-comp (Hothorn et al., 2008), (iii) flexibility by providing both multiplicity-adjusted p-values and simultaneous confidence intervals, (iv) full flexibility for the individual models (per endpoint) to include different covariates next to the treatment variable of interest (e.g., adjustment for baseline imbalance) and (v) simultaneous assessment of differently scaled endpoints (e.g. proportions and continuous). Moreover, the overall significance level may be split according to pre-specified weights following the approach by Westfall and Soper (1998).
It is a limitation that the proposed method relies on asymptotic properties and, in particular, strong FWER control may not hold for small sample sizes as it is only guaranteed to hold asymptotically (Pipper et al., 2012). However, the simulation results indicate that the method may in practice still be useful for quite small sample sizes and for specific applications it may be recommendable to explore the possible inflation of FWER due to finite sample sizes via simulations. Although the proposed method does not involve any a priori assumptions or specifications of the covariance structure, using the proposed method for sample size calculations will require a priori specification of correlations. This may be a limitation in practice.
The proposed methodology may also be extended to composite endpoints used in cardiovascular trials but also increasingly seen in studies on rheumatoid arthritis, asthma and diabetes (Unnikrishnan et al., 2013). Cardiovascular death, nonfatal myocardial infarction (MI) and nonfatal stroke are widely accepted primary endpoints in cardiovascular studies. While the outcome death has typically a low incidence, nonfatal components such as non-fatal stroke and nonfatal myocardial infarction have substantially higher event rates. Therefore, using the time to first event of death, MI, and stroke as a composite endpoint would result in an increased event rate (Yusuf et al., 2000). Generally, we may consider composite end-points of the form min j=1, . . .,J {T j }, where T 1 ,. . .,T J denote the individual event times (which of course may be subject to censoring). This proposed method extends in a straightforward way to such endpoints. Table 6. Simultaneous assessment of the composite endpoint and its components. The first columns of numbers provide the estimated odds ratios with multiplicity adjusted 95% confidence intervals for the proposed method (in case of both equally and unequally weighted endpoints). The subsequent column contains the raw p-values. The adjusted local significance levels α j ð Þ w;adj ¼ w j α w;adj for the proposed method and the Bonferroni correction in case of equally and unequally weighted endpoints are shown as well.