A framework to assess the value of subgroup analyses when the overall treatment effect is significant

ABSTRACT Although subgroup analysis has been developed and widely used for many years, it is still not clear whether we should perform and how to perform such subgroup analyses when the overall treatment effect is significant. In this paper, we develop a framework to assess and compute the long-term impact of different strategies to perform subgroup analysis. We propose two performance measures: the average gain for patients in the future (E) and the probability of recommending a change to a worse treatment at individual patient level (P). Five families of decision rules are applied under different assumptions for the individual treatment effect (TE) variation. Three distributions reflecting optimistic, moderate, and pessimistic scenarios are assumed for true treatment effects across studies. This framework allows us to compare subgroup analyses decision rules, and we demonstrate through simulation studies that there are decision rules for subgroup analysis which can decrease P and increase E simultaneously compared to the situation of no subgroup analysis. These rules are much more liberal than the usual superiority testing. The latter typically implies a dramatic decrease in E.


Introduction
The choice of relevant patient populations is an important step in the benefit assessment of a new drug. It may result in the application of subgroup analyses in a clinical trial, even if the overall effect was significant. A nice illustration of this point is the debate regarding the effect of Clopidogrel in the CAPRIE trial (Bender et al., 2010;Hasford et al., 2010). The trial was a randomized triple-blind clinical trial including patients with atherothrombosis, i.e., diagnosed by the disease manifestations of myocardial infarction (MI), stroke, or symptomatic peripheral arterial disease (PAD), randomly assigned to be treated by either Clopidogrel or Asprin using block randomization with center and disease group (Stroke, MI, and PAD) as stratification factors. The main objective was to test the superiority of Clopidogrel vs. Aspirin in the secondary prevention of vascular events. The primary endpoint was the first occurrence of an event either MI, ischemic stroke, or vascular death. The primary analysis defined in the protocol was an intention-to-treat (ITT) analysis for the primary endpoint in the total cohort of patients. A sample size of 19,185 was included to detect an overall relative risk reduction (RRR) of 11.6% with 90% power using a two-sided log-rank test with 5% significance level (CAPRIE Steering Committee, 1996).
The results from the ITT analysis showed a statistically significant p ¼ 0:043 ð ÞTE in terms of RRR of 8.7% in favor of Clopidogrel (95% confidence interval [CI] 0.3-16.5), which was consistent with the assumed TE (11.6% RRR) in sample size calculation. In an additional analysis, the CAPRIE investigators separately examined the effect of the treatments on the primary outcome in each of the three strata of disease group (Fig. 1), suggesting some heterogeneity. Formally, this corresponded to a borderline statistically significant interaction p ¼ 0:042 ð Þ (CAPRIE Steering Committee, 1996). Interestingly, based on these two analyses, several institutions came to different decisions. In the approval of Clopidogrel for the European market, European Medicines Agency (EMA) accepted the primary endpoint analysis of the overall population because neither strong heterogeneity nor a deficient definition for the overall study population has been found in the CAPRIE trial (European Medicines Agency, 2000;Hasford et al., 2010). Similarly, FDA adhered to the ITT analysis for CAPRIE and approved Clopidogrel accordingly for all patients with atherothrombotic diseases (Food and Drug Administration, 1998). In a cost-benefit assessment for the health care system in the UK, NICE concluded with respect to efficacy in accordance with the primary ITT analysis of the overall population of the CAPRIE trial. However, it considered the balance between clinical effectiveness and cost-effectiveness not to justify a replacement of aspirin by Clopidogrel to prevent vascular events (NICE, 2005). In a benefit assessment for the German system, IQWiG acknowledged the superiority of Clopidogrel only in the subgroup of patients with PAD but performed no evaluation of cost-effectiveness (IQWiG, 2006).
The desire to perform subgroup analysis after reaching a significant overall effect appears also in multi-regional clinical trials (MRCT), where the overall effect is the global TE in the whole population, regardless of the regions, but since the same treatment may work differently in patients from different countries and regions, subgroup analysis by country or region is often required by regulatory authorities in MRCTs (Pocock, 2013). However, the situation is slightly different here, as the subgroup analyses are already intended when planning the study, and can be taken into account in sample size calculation (Ikeda and Bretz, 2010).
Although subgroup analyses in clinical trials have been a matter of debate and methodological investigation for at least two decades (ICH Steering Committee, 1999;Pocock et al., 2002;Yusuf et al., 1991), we now meet a rather new and different situation. Traditionally, subgroup analyses were discussed in the context of trials failing to demonstrate an overall effect, and the focus was on limiting the risk of generating spurious signals of treatment superiority by testing many (small) groups and focusing too much on the estimates of maximally observed effects. If we apply subgroup analyses after demonstrating an overall effect, we may be more concerned about failing to demonstrate an existing TE in a subgroup due to insufficient power. There may be too much focus on confidence limits or the observed minimal effects, even if the spread of the treatment effect over subgroups may mainly reflect a random fluctuation. This new situation did not yet attract much attention, although an EMA guideline proposal on subgroup analysis in 2010 already mentioned the topic and the 2014 version of the draft guideline includes a section on this (European Medicines Agency, 2010, 2014.
The main objective of this paper is to provide and use a framework to assess and compute the long-term effect of different strategies for performing subgroup analyses when the overall TE is significant. The formal framework and decision rules considered are presented in Section 2. Results are displayed in Section 3. Section 4 summarizes main results and discusses potential limitations. A conclusion in Section 5 finalizes the paper.

Notations
Our framework is based on considering a sequence of clinical trials, all comparing some standard treatment to a new experimental treatment. The sample size calculation for all studies is based on the same assumed effect θ A , resulting in identical sample sizes N. The patient population of each study can be divided into K subgroups of equal size, and we have a decision ruleφ to decide on the superiority of the experimental treatment in the whole study population, and a decision rule ϕ to perform such a decision at the subgroup level.
In the sequel, we denote the studies by s ¼ 1; . . . ; S, the subgroups in each study by g ¼ 1; . . . ; K, and the individual patients in each subgroup by i ¼ 1; . . . ; I with I ¼ N=K. For each patient, we assume an individual true TE θ sgi , because it is unrealistic for every patient to have the same characteristics apart from group partition, thus each individual patient would react differently to the same treatment. Further, θ sg denotes the true TE in subgroup g of study s, and θ s denotes the true TE in study s.

Performance measures
In evaluating the performance of particular subgroup decision rules, we propose two measures. The point of departure of these measures is the general goal of clinical trials to improve the overall outcome of patients by accepting new effective treatments for whole patient groups without putting too many patients at risk of being harmed due to heterogeneity of treatment effects within such groups. However, due to the stochastic nature of the results of clinical trials, we have to balance between these two subgoals. On one hand, we would like to improve the overall outcome as much as possible by accepting as many effective treatments as possible. On the other hand, we want to minimize or control the risk of recommending a patient to switch to a treatment, which is actually inferior, i.e., with θ sgi < 0.
To measure the overall improvement, we consider the overall gain. We assume that, if the experimental treatment is declared superior in a subgroup, all patients in that subgroup will receive the new treatment instead of the standard treatment in future. Hence, based on our simulation, this gain is defined as where ϕ sg denotes the result of applying the decision rule ϕ in subgroup g of study s.
In the case of θ being a difference in survival or success probability, E approximates the average gain in survival or success probability over all future patients, if we always follow the recommendations made by the decision rules. If θ is a difference in the expected value of a continuous outcome, e.g., a quality of life score, then E is the average gain in this score over all future patients.
To measure the risk of recommending an inferior treatment, we consider the fraction of patients with a negative true TE in the subgroups with a positive decision, i.e., P approximates the probability of recommending an inferior treatment among all patients changing treatment as a consequence of the study results. In general, we aim to accept as many treatments with a positive effect as possible for as many patients as possible. At the same time, we should keep the risk of recommending new treatments to patients who do not benefit from the new treatment as low as possible. Consequently, we should aim at maximizing E and minimizing P. Good decision rules should imply a reasonable balance between P and E.

Subgroup decision rules
In this paper, we consider five families of subgroup decision rules. All these rules will be only applied if the overall ruleφ decides on superiority, i.e., ifφ ¼ 1.
In the first family ϕ <;α with α 2 0:0; 1:0 ð Þ, we consider the null hypothesis of usual superiority testing, namely that the new treatment is inferior to the standard treatment, i.e., ϕ <;α sg ¼ 1 if and only if the lower bound of the two-sided 1 À α ð Þ% CI for θ sg is above 0. In the second family ϕ >;α with α 2 0:0; 1:0 ð Þ, we take the opposite view: we do not require evidence for the superiority of the new treatment, but deny a subgroup effect only if we have evidence for the inferiority, i.e., ϕ >;α sg ¼ 0 if and only if the upper bound of the 1 À α ð Þ% CI for θ sg is below 0.
For the limiting case α ¼ 1:0, in both families the decision rule reduces to comparing the estimate with zero, i.e., ϕ E sg ¼ 1 ifθ sg ! 0. In the second family, for the limiting case α ¼ 0:0, we approach the situation that no subgroup analyses are performed, i.e., we decide on superiority for all subgroups as soon as we decide on overall superiority, i.e., ϕ N sg ¼φ. Since there is some tradition in subgroup analysis to only perform subgroup specific tests in the presence of evidence for heterogeneity of the subgroup specific TEs (Brookes et al., 2004), we also introduce the family ϕ I;δ;α . Here ϕ I;δ;α sg ¼ 1 if and only if the null hypothesis H 0 : θ s 1 ¼ Á Á Á ¼ θ s k can be rejected at level δ and if the lower bound of the 1 À α ð Þ% CI for θ sg is above 0. Finally, we consider a family popular in the analysis of multiregional trials where the estimate in a subgroup should be at least some fraction of the overall TE estimate (JPFSB, 2007) ϕ F;γ sg ¼ 1 if and only ifθ sg ! γθ s . Table 1 summarizes the main properties of the families considered. To well illustrate the difference among the five families of decision rules, we reconsider the CAPRIE trial. If applying the different decision rules to the CAPRIE trial, quite different decisions can be reached.
Starting from the first family ϕ <;α with α ¼ 0:05, we conduct subgroup analysis in the widespread used way to perform superiority testing in each subgroup at the 5% level, or equivalently to compare the lower bound of the 95% CI with 0. As we can see from Fig. 1, a treatment effect will be claimed only in the PAD group. If we weaken this criterion by allowing a larger α and consequently a narrower confidence interval, in the Stroke group, we observe a CI of [−3.5, 17.0] for α ¼ 0:1 and of [0.04 14.0] for α ¼ 0:26. Hence, we can now claim a treatment effect both in the PAD and the Stroke groups, if α ! 0:26. This also holds for the limiting case α ¼ 1:0, i.e., for ϕ E , where we only compare the estimate with 0, as (only) in these two groups the effect estimates are above 0 (cf. Fig. 1). Now we can even become more liberal and consider the second family ϕ >;α , comparing the upper bound of the 1 À α ð Þ% CI with 0. We start with α ¼ 1:0, i.e., with ϕ E , and then decrease α. In the MI group, for the α ¼ 0:68, the 32% CI is [−7.3, −0.2], i.e., the upper bound of MI group starts to be negative. Hence for α<0:68, we would accept Clopidogrel in all three groups.
Since the interaction between treatment effect and disease group is statistically significant p ¼ 0:042 ð Þ , using decision rule family ϕ I;δ;α with δ ¼ 0:05, is equivalent to ϕ <;α . Consequently for α ¼ 0:05, Clopidogrel is only accepted in the PAD group. When using the last family of decision rules ϕ F;γ with γ ¼ 0:5 as suggested in the Japanese guideline of MRCTs (JPFSB, 2007), we would reject Clopidogrel only in the MI group, as not only the largest estimate found in the PAD group, but also the estimate of 7.3 observed in the Stroke group is larger than half of the overall estimate, i.e., 8:7=2 ¼ 4:35 (CAPRIE Steering Committee, 1996).

Assumptions on subgroup specific TEs and individual TEs
Given the study specific TE θ s , we assume for the subgroup specific effects with N denoting a centered normal distribution, ensuring that the average of the subgroup specific effects θ s . is identical to θ s (for more details see Appendix I of Supplementary Material). Within each subgroup, it would not be reasonable to assume identical TEs for each patient, because the subgroup categorization considered typically reflects only one factor such as age, gender, or a disease characteristic. Hence it is likely that there are also other factors with an influence on the treatment effect, resulting in an additional variation from patient to patient. Hence for the individual TEs we assume again, ensuring that θ sg: ¼ θ sg .
To facilitate the interpretation of our results, we reparameterize the variance components σ 2 G and σ 2 GI in the following way: with σ 2 I ¼ σ 2 GI þ σ 2 G , we denote the overall variance of the individual TEs in a single study (i.e., the conditional variance given θ s ). We express the between-subgroup variation σ 2 G as a fraction of the overall variation of the individual TEs, i.e., we define R 2 can be easily interpreted as an explained variation, i.e., how much of the inter-individual variation in TEs can be explained by between-subgroup variation. Further, we relate the overall If and only ifθ sg ! γθ s individual TE variation to the assumed effect by introducing τ, in order to obtain a quantity which is independent of the scale for the TE: The following interpretation of τ may be helpful: If the true effect θ s is equal to the assumed effect θ A for a single trial, the choice τ ¼ 0:5 implies 2 times of standard deviation (SD) of θ sgi equals θ A , and hence corresponds to a situation where 2.5% of the patients have a negative TE. Similarly, τ ¼ 1:0 corresponds to a situation where about 15.8% of the patients have a negative TE.

Assumptions on distribution of true study effects
The success rates of clinical trials (in the sense of reaching a significant positive TE for the new treatment) varies substantially among different areas (Dent and Raftery, 2011;Djulbegovic et al., 2013Djulbegovic et al., , 2008Kumar, 2005). Hence it is necessary to consider different scenarios for the true TE. We consider three scenarios in this paper.
In the first one, we assume that on average the true TE is identical to the assumed effect, and although there is a variation from study to study, the true TE is negative in only 2.5% of the trials. We call this scenario the optimistic scenario. Assuming a normal distribution of true TE, it can be expressed as Next we consider a scenario where on average the true TE is half of the assumed effect. Keeping the assumption on normality and the degree of variation of the true TE, this means that the true effect is negative in 15.8% of all trials, and at least as large as the assumed effect in another 15.8%. This moderate scenario is given by Last, we consider a scenario where on average there is no effect, and in only 2.5% of all trials the true effect is at least as large as the assumed effect. This scenario named pessimistic scenario is given by Table 2 summarizes the main properties of the three scenarios considered.

Scenarios for outcomes
It is an implicit assumption in our considerations that the values of E and P are mainly determined by the structure of our framework described so far, as long asφ and ϕ are based on asymptotically most powerful tests. They should not depend on the concrete choice of outcome scales, nuisance parameters of the outcome distribution, and inference procedures. However, to compute E and P, a concrete choice has to be made.
In this paper, we present the results for the case of a binary outcome with the risk difference as the effect measure, i.e., θ sgi ¼ π E sgi À π S sgi with π S sgi and π E sgi denoting the probability of a success for Table 2. Overview about the three scenarios for the distributions of true study effect.

Scenario
Distribution Implication for distribution of true study effect Optimistic θ s ,Nðθ A ; ð0:5θ A Þ 2 Þ Negative in only 2.5% of the trials Moderate θ s ,Nð0:5θ A ; ð0:5θ A Þ 2 Þ Negative in 15.8% of all trials, and at least as large as the assumed effect in another 15.8% Pessimistic θ s ,Nð0; ð0:5θ A Þ 2 Þ At least as large as the assumed effect in only 2.5% of all trials patient i in subgroup g in study s, treated by either the standard treatment or the experimental treatment, respectively. π S sgi is chosen as the constant value 0.4, implying that there is no association of the individual TEs with the prognosis under the standard treatment. The decision ruleφ at the overall level is based on the χ 2 test, confidence intervals for the risk difference are based on Newcombes method No. 10 (NewCombe, 1998), and interaction tests are based on logistic regression. The power calculation of each single study is based on assumed success rates of 0.6 and 0.4 in the two treatment groups, i.e., θ A ¼ 0:2. Consequently, a 90% power requires a sample size of N ¼ 280 for each study. As we assume a normal distribution for θ sgi , we cannot avoid that π E sgi is out of the interval [0,1] for some patients, and hence we change θ sgi to max 0; min 1; π E sgi À max 0; min 1; π S sgi in computing the values of E and P.
We focused on the risk difference in the main analysis, as it gives us a simple interpretation of E. To check our implicit assumption, we considered six further scenarios in a sensitivity analysis, allowing an association between prognosis and individual TEs, a different overall prevalence, the use of the log odds ratio as effect measure and continuous outcomes with Cohen's d as TE measures. Details are outlined in Appendix II and results are shown in Figure 5-7 in Appendix III of Supplementary Material.

Results
We start with illustrating the use of E and P being applied to ordinary clinical trials with no subgroup analyses. Fig. 2 shows the histograms of the individual true TEs in the three scenarios introduced in Section 2.5 as well as those of patients in studies resulting in a decision on superiority. In the optimistic scenario, the vast majority of trials results in such a decision, and consequently the value of E is close to the assumed TE of 0.2. The fraction of patients suffering from suggesting an inferior treatment is very small, i.e., within 3%. In the moderate scenario, less than 50% of the trials result in a positive decision, and consequently the value of E is less than half of the assumed effect, and P increases to 7%. In the pessimistic scenario, only very few trials result in a positive decision. As in these few trials the true TE tends to be positive, we have still a small overall gain indicated by a value of 0.02 for E. Also the control of P is still functioning to some degree, with limiting it to 14%.
Plots of P and E using different decision rules for the optimistic, moderate, and pessimistic scenarios with τ ¼ 0:5 and τ ¼ 1 and five values of R 2 are shown in Fig. 3. In these plots, we connected the points belonging to the two families of superiority testing ϕ <;α and inferiority testing ϕ >;α , such that we have a route from the case of no subgroup analysis over the case of comparing the estimate with 0, to the case of superiority testing at levels of 20% and 5%. Although the magnitude of P and E differs across the three scenarios and depends on τ, the general patterns observed are very similar. There are several different important aspects we can read out of Fig. 3. We first start with looking at rather liberal approaches, i.e., the family ϕ >;α with α 2 0:0; 1:0 ð Þ, starting with α ¼ 0:0, i.e., the case of no subgroup analysis ϕ N À Á , and moving to α ¼ 1:0, corresponding to the case of only comparing the estimate with 0 ϕ E À Á . In Fig. 3, this means to start from the point in the upper right corner, and to follow the lines until the diamonds. For both values of τ and all choices of R 2 in all scenarios, we can observe a reduction in P and a slight increase or decrease in E. The reduction in P is increasing when R 2 and τ increase, reflecting that the overall fraction of patients with a negative TE increases. The decrease in E is largest for R 2 ¼ 0, as we then perform subgroup analyses without any need. An increase in E can be observed if τ ¼ 1 and for large values of R 2 , i.e., we remove more patients with negative TE than with a positive TE using subgroup analysis. The %E and %P columns in the middle part of Table 3 quantifies the changes observed for E and P in Fig. 3, comparing ϕ E with ϕ N . The decrease in P is less than 0.5% if R 2 ¼ 0, between 3% and 11% if R 2 ¼ 0:25, and between 11% and 27% if R 2 ¼ 0:5. The decrease in E is always less than 0.75%, and in the case of substantial heterogeneity of TE, i.e., τ, R 2 are large, the increase maybe up to 9%.
Then we continue with the more stringent approaches, i.e., the family ϕ <;α with α 2 0:05; 1:0 ð Þ , starting with α ¼ 1:0, i.e., the case of comparing the estimate with 0, and moving to α ¼ 0:05, i.e., significance testing at 5% level in each subgroup. This means in Fig. 3, we follow the lines from the diamonds to the circles at the bottom. The points we came across in between refer to superiority testing at 20%. We can always observe that P and E are both decreasing, i.e., we reduce the gain, as well as the  The results for the decision rules ϕ <;α and ϕ >;α are connected by a line starting with ϕ >;0:0 , i.e., no subgroup analysis, in the upper right corner and ending with ϕ <;0:05 (marked by a circle). The two points marked in between correspond to ϕ E (marked by a diamond) and ϕ <;0:2 . The cross and plus correspond to ϕ I;0:05; 0:05 and ϕ I;0:15; 0:05, respectively. The filled and hollow triangles correspond to ϕ F;0:5 and ϕ F;0:75 . Note that y-and x-scales vary from plot to plot.
fraction of patients being recommended an inferior treatment. However, the reductions are now much more pronounced comparing to moving from ϕ N to ϕ E only. The %E and %P columns in the right part of Table 3 quantifies these reductions. The reduction in P is in the range of 9-29% if R 2 ¼ 0, and between 64% and 73% if R 2 ¼ 0:5. The reduction in E is in the range of 16% -40% if R 2 ¼ 0, and between 6% and 26% if R 2 ¼ 0:5. So whether superiority testing at 5% is a good idea or not, depends on the degree of heterogeneity and how we balance the undesirable reduction in E against the desirable reduction in P. The next aspect depicted in Fig. 3 is the value of using interaction tests as gatekeepers, i.e., specific members of the family ϕ I;δ;α . Two instances of this strategy are considered: performing subgroup tests at α ¼ 5% if the interaction test was significant at either δ ¼ 5% or δ ¼ 15%. The results are marked by a cross or plus, respectively. We can immediately see that the cross and plus are never above the line in the corresponding color, indicating that we can always find such a significance level that superiority testing yields a larger value of E and a smaller value of P simultaneously. This suggests that this strategy does not offer a general advantage. Similarly, Fig. 3 depicts the value of requiring to preserve a pre-specified fraction of the overall estimate, i.e., specific members of family ϕ F;γ , represented by the filled and hollow triangles, referring to a fraction of γ ¼ 0:5 and γ ¼ 0:75, respectively. Again, these symbols are never above the corresponding lines, hence, we cannot conclude a general advantage comparing to superiority testing. However, they tend to be closer to the curves than the symbols referring to the interaction gatekeeper strategy.
So far, we only considered the case of 2 subgroups. The upper part of Fig. 4 shows the corresponding plots for the moderate scenario with τ ¼ 0:5, but comparing the case of 2, 3, or 4 subgroups. We observe patterns similar to Fig. 3. The reduction in P does not seem to depend on the number of groups, but the degree of reducing E increases with the number of subgroups increasing. For example, for R 2 ¼ 0:25 and superiority testing at 5% level, we observe a reduction by 24.8% in the case of 2 groups, by 39.1% in the case of 3 groups, and 49.6% in the case of 4 groups. Note that the rule of comparing the estimate with 0 is still associated with a moderate loss in E. With an increasing number of subgroups, the interaction gatekeeper approach tends to become much worse than superiority testing, and the discrepancy to superiority testing also increases for the fraction of estimate approach.
Finally, we consider the effect of increasing the sample size N in the lower part of Fig. 4, simulating the situation that we can assess a benefit based on two or four studies, which are homogeneous enough to allow to be pooled together. Again, we observe the same patterns. Not surprisingly, the increase of power leads to an increase in E, and applying superiority testing at the Table 3. Selected results showed in Fig. 3. % denotes the relative decrease compared to ϕ N . 5% level in two subgroups implies a level of E similar to no subgroup analysis in a study of half the size. In contrast, the impact on P is less pronounced, but we can observe a slight increase with larger sample sizes. The discrepancy between superiority testing and the interaction gatekeeper approach seems to decrease slightly when increasing the sample size, whereas the discrepancy to fraction of estimate tends to increase slightly. Selected values corresponding to Fig. 4 can be found in Table 4 in Appendix III of Supplementary Material.

Discussion
In this paper, we propose a framework that allows us to compare different subgroup analysis strategies applied to clinical trials which have demonstrated an overall effect. As expected, using subgroup analysis can help reducing the number of patients who are suffering from an incorrect recommendation. The stricter the criterion we employ for performing subgroup analysis, the smaller the fraction of patients with incorrect recommendation among all with such a recommendation. However, we often have to pay the price for reducing the overall gain E while reducing P. If there is no subgroup variation, we have the maximal reduction in E and the minimal reduction in P. Even in the case of substantial subgroup variation, e.g., if the subgroups explain at least 50% of overall variation in TEs from patient to patient, strict decision rules like superiority testing at the 5% level can imply a substantial reduction of E, in particular when more than 2 subgroups are considered.
The simple decision rule of requiring a positive TE estimate in a subgroup implies only a small loss in E in the worst case, and if both large subgroup variation and large individual variation exist,  Figure 4. Plot of E vs. P for different number of subgroups K (3 plots in the upper part) and using different overall sample sizes N (3 plots in the lower part) for the moderate scenario and τ = 0.5. For the explanation of symbols see Fig. 3. Note that y-and x-scales vary from plot to plot.
we could even obtain a gain in E. Simultaneously we always reduce P to a non-negligible degree in case of certain degree of subgroup variation. So here we have or are close to a "win-win" situation. However, using subgroup analysis based on the family of superiority testing, even with a moderate significance level like 20%, typically implies a non-negligible reduction in E. So any attempt to use some type of superiority testing in subgroup analysis destroys this "win-win" situation. Consequently, the use of superiority testing can be typically only justified, if there is an external decision that a reduction in P outweighs the reduction in E.
Interaction tests are often performed as a gatekeeper of subgroup analysis. However, we found in our framework that decision rules using the interaction gatekeeper approach generally do not offer any advantage. Similarly, requiring a certain fraction of the overall effecta decision rule adopted by the Japanese guideline for global clinical trialsdoes not offer advantages either.
Although our framework is based on considering single trials, it also allows to throw some light on the situation of performing benefit assessments based on a meta-analysis. On a qualitative level, subgroup analysis seems to behave very similar in meta-analysis with respect to the relation between different decision rules regarding the impact on E and P. On a quantitative level, we can reach higher levels of E. Of course we have to be aware of the fact that the gain in E is probably due to larger sample sizes allowing to demonstrate smaller TEs, for which the clinical relevance might be questionable.
In interpreting results, some limitations of our framework have to be taken into account. First, we assume that σ 2 I i.e., the inter-individual variation of the TEs, is independent of the true TE θ s of the trial. It seems to be more realistic to assume that σ 2 I decreases if θ s approaches 0, as if a treatment is not effective on average, it is unlikely that there are large effects at the individual level. However, assuming proportionality between σ 2 I and θ s is also too strong. Our choice of assuming no association between σ 2 I and θ s can be regarded as a conservative choice, implying an overestimation of P, as we assume more negative individual TEs than to be expected in reality.
Second, our framework assumes that decisions on making a treatment available for all patients are based on a single trial. This is usually not the case. For example, the FDA and EMA guidelines assume the existence of two pivotal studies (European Medicine Agencies, 2000; Food and Drug Administration, 1998). However, the example of the CAPRIE trial illustrates that sometimes decisions are based on one single, large trial. Furthermore, our considerations for the case of pooling two or four studies suggest that many considerations and conclusions will be similar if more than one study is available.
Third, our framework assumes that a proof of superiority implies a decision to make the treatment available for all patients, whereas a failure to do so implies no access to the treatment in future. The latter may be the case if we consider decisions on drug approval by regulatory agencies (Maggioni et al., 2007), but then the first assumption is not fulfilled, as a proven gain in efficacy has to be balanced with the safety profile of the new treatment. If we consider decisions on reimbursement instead of drug approval (Paget et al., 2011), then also the failure to demonstrate superiority may not imply that the drug is not available for all patients. For example, it may be available for those who are willing to pay by themselves. In Germany, a negative decision on an additional benefit in an health technology assessment assessment performed after drug approval implies that the drug may still be sold, but just at the price of the comparator, and it is up to the pharmaceutical companies to decide whether they accept this or remove the drug completely from the market.
Fourth, in our framework we neglect that treatments are developed for patient populations of different sizes. Our results can change if there is a correlation between true TEs and the size of the target population. Fifth, for simplicity, we made the strong assumption that all subgroups in a trial are of equal size, which may not be true in most clinical trials. Similarly, we considered only one partition of the study population into subgroups. In most trials, several factors like age, gender, or baseline characteristics may define different partitions, adding the additional complexity of patients with contradicting treatment recommendations. Sixth, we assume in our framework that TEs observable in clinical trials will also be observed when the treatment later is applied as part of the standard care. Finally, we did not include all approaches of subgroup analyses in our consideration. However, our framework will allow to investigate other alternative approaches, e.g., Bayesian methods (Jones et al., 2011).

Conclusion
Subgroup analysis is a great temptation to improve the benefit assessment of new drugs by allowing to consider smaller relevant patient groups. However, there is a risk of overlooking superior new treatments due to insufficient power, which may decrease the overall efficiency of the clinical trial culture to improve the average outcome. The simple rule of comparing effect estimates instead of confidence intervals with 0 in subgroup analyses in trials with a significant overall effect may supply a good compromise. It allows us to reduce the fraction of patients suffering from being recommended a worse treatment without diminishing the overall gain achievable by the current clinical trial culture. More strict decision rules may be justified, if there is a priori a high likelihood for substantial subgroup variation, or if avoidance of incorrect treatment recommendations change at the individual level are given more weight than improvement of the average outcome.
(4) Finally we consider continuous outcomes and Cohen's d as effect measure, i.e., θ ¼ ðμ E À μ S Þ=σ 2 . The t-test was used as overall test. CIs were t-test based, and a regression model was used to assess interactions. The assumed effect was 0.2, we could assume σ 2 = 1 implying a sample size of 1052. The data was generated as.

I
Results of scenario(a) are shown in Figure 7 in Appendix III of supplementary materials.

Software
Software in the form of Stata code, together with sample simulation data sets is available on request from the corresponding author (E-mail: sun@imbi.uni-freiburg.de).