Estimands for Continuous Longitudinal Outcomes in the Presence of Treatment Discontinuation—A Simulation Study in Hyperkalemia Treatments

Abstract The ICH-E9 (R1) addendum guideline requires a precise description of the treatment effect of interest, named “estimand,” reflecting the clinical question posed by the trial objective, in each specific therapeutic setting. In clinical trials evaluating maintenance treatments, such as hyperkalemia treatments, one of the intercurrent events is planned treatment discontinuation due to abnormal laboratory values (outside the acceptable range). When using longitudinal data for evaluation, the mixed model for repeated measures, responder analysis, and survival analysis are often used to deal with intercurrent events. Under composite variable or hypothetical strategies, prioritized composite outcome or the last rank carried forward can also be considered. In this study, we propose a prioritized composite outcome in clinical trials evaluating maintenance treatments with treatment discontinuation. In addition, the estimand for each analysis was defined, and operating characteristics were evaluated. Simulation studies indicate the usefulness of the prioritized composite outcome using the composite variable strategy and the mixed model for repeated measures using the hypothetical strategy.


Introduction
A maintenance treatment is administered to maintain a certain laboratory test value within a range specified as the treatment target (hereafter referred as the "target range"). For example, the target range of serum potassium for hyperkalemia treatment was set at 3.8-5.1 mmol/L in several clinical trials Weir et al. 2015). For the evaluation of maintenance treatments in clinical trials, however, the treatment is discontinued when a value is outside the limits defined for patient safety (hereafter referred as "acceptable range"). An event in which a value is outside the acceptable range is referred as "threshold breach. " For example, in a phase III confirmatory study for hyperkalemia treatment (OPAL study) , patients discontinued their planned treatment due to a breach of the threshold of 3.8-5.5 mmol/L (acceptable range), particularly in the placebo group. The cumulative proportion of treatment discontinuation was approximately 40% through the trial duration. Because of treatment discontinuation, measurements under treatment cannot be observed. The last rank carried forward (LRCF) method (O'Brien, Zhang, and Bailey 2005) was applied for patients who discontinued treatment in the primary analysis of the OPAL study, and thoroughly reviewed in the FDA statistical review (FDA 2014), NEJM , and EMA assessment report (EMA 2017 When considering estimands in clinical trials for maintenance treatments, several strategies specified in ICH E9 (R1) (ICH 2019) can be applied for addressing the intercurrent event of treatment discontinuation due to threshold breaches. Under the treatment policy strategy, the observed value for the variable of interest is used regardless of whether intercurrent events occur. This strategy is derived from the intention-to-treat (ITT) principle, as specified in ICH E9 (ICH 1998) and ICH E9(R1) (ICH 2019); however, it is not always considered optimal for regulatory and clinical decision-making (ICH 2019). Under the hypothetical strategy, a scenario is envisaged in which the intercurrent event would not occur and the treatment effect is considered under that scenario (ICH 2019). Under the composite variable strategy, information on intercurrent events is included in the definition of an endpoint. The composite variable strategy can realize the ITT principle, where the value for the original variable of interest after intercurrent events might not be meaningful, but where the intercurrent event itself meaningfully describes the outcome of the patient and can avoid the complexity caused by intercurrent events (ICH 2019).
We could apply the following statistical methods when evaluating maintenance treatments using longitudinal laboratory values: the mixed model for repeated measures (MMRM) (Mallinckrodt, Clark, and David 2001), the last observation carried forward (LOCF), responder analysis, survival analysis, and LRCF. The MMRM could be argued to be used under a hypothetical strategy that assumes treatments would be continued even after an intercurrent event. The LOCF uses the assumption that outcomes do not change even after an intercurrent event under the hypothetical strategy. However, the LOCF is not recommended at present, because of the lack of clinical validity of the assumption and the problem of single imputation (Molenberghs et al. 2004). The responder analysis uses a binary composite variable; therefore, information can be lost since a continuous variable is dichotomized (Snapinn and Jiang 2007). The survival analysis considers the time of occurrence of intercurrent events under the composite variable strategy. The LRCF carries forward relative ranks of variables at the time of treatment discontinuation as hypothetical ranks at the time of primary evaluation (O'Brien, Zhang, and Bailey 2005). No conditions or missing mechanisms for reasonable statistical inference have been provided for the LRCF (Fan et al. 2016), and its estimand has not been discussed.
In recent years, prioritized composite outcome (PCO) has been considered as a new composite outcome in many diseases such as human immunodeficiency virus (Finkelstein and Schoenfeld 1999;Wittkop et al. 2010), schizophrenia (Henderson, Diggle, andDobson 2000), rheumatoid arthritis (Wong et al. 2007), scleroderma-associated lung disease (Elashoff, Li, and Li 2007;Tseng and Wong 2011), cardiovascular disease (Lim et al. 2008;Rogers et al. 2014;Abdalla et al. 2016), amyotrophic lateral sclerosis (Berry et al. 2013), antibiotics (Evans et al. 2015), and kidney transplantation (Fergusson et al. 2018). PCO is defined to reflect the priority or importance of multiple endpoints. PCO can use the information of multiple endpoints more efficiently, particularly when its components include a continuous variable, while the responder analysis and survival analysis cannot, because they dichotomize continuous variables. Certain effect measures for PCO have been proposed (net-benefit (Buyse 2010), win ratio (Pocock et al. 2012), and win odds (Brunner, Vandemeulebroecke, and Mütze 2021)). Although PCO has not been applied for maintenance treatments, it can be used as a composite variable strategy in clinical trials with treatment discontinuation due to threshold breaches.
ICH-E9 (R1) addendum guideline provided a structured framework for describing a precise description of the treatment effect of interest, named "estimand. " Estimand and its aligned estimators should be well clarified in each therapeutic setting. However, no research has been conducted in the setting of maintenance treatments. In this study, on the basis of the addendum, we describe estimands in clinical trials for maintenance treatments with treatment discontinuation due to threshold breaches. Moreover, we propose a suitable definition of PCO for the illustrative clinical trial. The performance of these analysis methods was evaluated by simulation studies used OPAL study as an example. The remainder of this article is organized as follows. Section 2 presents an overview of the OPAL study. Section 3 clarifies related estimands and reviews some common and proposed analysis methods. Section 4 presents the results of simulation studies based on the OPAL study. The discussion and conclusion are presented in Section 5.

Illustrative Clinical Trial for Hyperkalemia Treatments
The OPAL study ) is a phase III confirmatory study of a nonabsorbed potassium binder for hyperkalemia.
The trial is a randomized withdrawal study that consists of two phases: an initial phase and a randomization phase. All participants with hyperkalemia were administered the investigational drug (active drug) in the initial phase, and only those who achieved the target potassium level of 3.8-5.1 mmol/L were randomly assigned to the active drug or placebo. The time of primary evaluation in the randomization phase was 4 weeks. For ethical reasons, dose adjustment of the active drug and concomitant medication was allowed after the first occurrence of the serum potassium value being outside the acceptable range of 3.8-5.5 mmol/L. Therefore, the subsequent potassium values after the change from the planned treatment were not used for the primary analysis. The primary endpoint in the randomization phase was the change in serum potassium from the baseline at week 4 or that at the visit associated with the first threshold breach. For treatment comparison, analysis of covariance (ANCOVA) was applied using rank as an outcome variable. The rank for the subject who presented a threshold breach before week 4 was defined on the basis of the amount of change at the time of its threshold breach and was carried forward to week 4 (LRCF). All patients who continue their treatment without threshold breaches until week 4 are assigned designated ranks excluding those already reserved from week 1 to week 3 for the patients with threshold breach. The median difference between the groups was estimated using associated changes, which were calculated on the basis of the carried forwarded rank and the observed changes. As a supplemental analysis, the time to recurrence, as defined by a serum potassium value of 5.5 mmol/L or higher, was evaluated. The median changes in serum potassium value were 0.00 mmol/L in the active drug group and 0.72 mmol/L in the placebo group. The median difference between the groups was 0.72 mmol/L (95% CI, 0.46-0.99; p < 0.001). The cumulative recurrence proportions through the trial duration were approximately 10% in the active drug group and approximately 40% in the placebo group. The most common reasons for treatment discontinuation were upper threshold breach (4% in the active group and 31% in the placebo group) and lower threshold breach (5% in the active drug group and 2% in the placebo group). The proportion of treatment discontinuations for reasons other than threshold breaches was small, and no difference was observed between the groups.

Estimands and Aligned Analysis Methods for Hyperkalemia Treatments
In this section, we consider estimands and corresponding analysis method candidates for the illustrative clinical trial for hyperkalemia treatments based on the estimand thinking process outlined in ICH E9 (R1). Section 3.1 organizes candidate estimands and their five attributes; treatment, population, intercurrent event, variable, and population-level summary (of treatment effect). Section 3.2 explains the corresponding analysis methods and its details. All attributes described in Section 3.1 and aligned analysis methods described in Section 3.2 are summarized in Table 1.

Attributes of Each Estimand and Strategies for Addressing Intercurrent Event
We consider estimands in randomized, placebo-controlled clinical trials for maintenance treatments in the target population of normalization-introduced patients with hyperkalemia motivated by the OPAL study. The trial objective is to confirm the superiority of the active drug over placebo to achieve the target normal potassium level in the patients with serum potassium levels normalized in the initial treatment phase. There are several candidates for estimands associated with this objective. Their attributes will be given below. In such trials, the serum potassium values (longitudinal outcomes) were measured over time, and treatment was discontinued when the serum potassium value went outside the thresholds. We considered only the upper and lower threshold breaches of the serum potassium value as "intercurrent events" because the proportion of the occurrence of other intercurrent events was low in the illustrative trial and the impact of threshold breach as an intercurrent event is the primary focus of this research.
In clinical trials, the threshold is strictly set in accordance with ethical considerations. In clinical practice, visits and dose adjustments within 4 weeks are rare, and the treatment would not immediately change even if the thresholds of 3.8 and 5.5 mmol/L is breached. Therefore, "hypothetical strategies" could be considered under the assumption that the treatment would be continued even after the threshold is breached within 4 weeks. Some researchers mentioned the estimand in cases where the data after a switch should not be used in the estimators (Leuchs et al. 2015). However, the intercurrent event itself has clinical meaning; therefore, evaluation under the "composite variable strategy" is also relevant. While the treatment policy strategy is considered a gold standard for outcome studies, it was not used in the OPAL study nor in other Phase III pivotal trials in the same disease category (FDA 2016).
On the basis of the above argument, we consider hypothetical and composite variable strategies. In the hypothetical strategy, we assumed a hypothetical scenario where the patient could continue the planned treatment despite a threshold breach of the acceptable range of 3.8-5.5 in clinical practice in contrast to the situation in a clinical trial involving change from the planned treatment due to threshold breaches. Change in serum potassium level from baseline at week 4 as "variable" and mean/median differences between groups as effect measures ("population-level summary" of treatment effect) were considered. In the composite variable strategy, in addition to the serum potassium level at the time of primary evaluation (week 4), the occurrence of threshold breaches and/or time to threshold breaches were considered as components of the composite variable; threshold breaches of the acceptable range are included in the definition of the endpoint and are actively used for evaluation. PCO, response, and time to first threshold breach are considered as appropriate "variables" under the composite variable strategy. Each variable was accompanied with aligned effect measures ("population-level summary" of treatment effect): win ratio, win odds, and net benefit for PCO; difference in normal proportion at week 4 for response; and hazard ratio and RMST difference for time to first threshold breach.
In this section, various attributes for estimand candidates were explained. By combining the attributes in each column organized in Table 1, the corresponding estimand is constructed.

Analysis Methods
This section describes analysis methods for each effect measure under the hypothetical and composite variable strategies. Six existing methods (the responder analysis, survival analysis, MMRM, LOCF, LRCF, and linear extrapolation) and the proposed PCO were employed. Linear extrapolation was set as a comparator under the hypothetical strategy to evaluate the impact of the assumption in the LRCF method.

Prioritized Composite Outcome
The overall treatment benefit based on the clinically defined PCO was evaluated under the composite variable strategy. The PCO, which is related to generalized pairwise comparisons (Buyse 2010), can be defined as ranks (Evans et al. 2015). In this study, the following order of worse to better outcomes (priority of multiple outcomes) was used to assign ranks: (a) time to the first upper threshold breach of serum potassium ≥5.5 mmol/L, (b) time to the first lower threshold breach of serum potassium <3.8 mmol/L, and (c) serum potassium value at week 4. In both outcomes (a) and (b), all patients who breached the thresholds at the same time were treated as a tie regardless of their serum potassium value. The win ratio (Pocock et al. 2012), win odds (Brunner, Vandemeulebroecke, and Mütze 2021) and net benefit (Buyse 2010) were calculated as effect measures. We performed the test proposed by Finkelstein and Schoenfeld (1999), which is based on the Wilcoxon rank sum test.

Responder Analysis
Success of the treatment up to week 4 was evaluated using a binary composite variable (response/nonresponse) under the composite variable strategy. Response was defined as an event where the serum potassium level remains within the target range of 3.8-5.1 mmol/L at week 4 without any threshold breach (outside the acceptable range of 3.8-5.5 mmol/L) throughout the 4 weeks. Nonresponse was defined as an event where either a threshold breach occurred over the 4 weeks or the serum potassium level went outside the target range at week 4. The risk difference was calculated as an effect measure. The test was performed using Fisher's exact test.

Survival Analysis
Time to the first threshold breach was evaluated using the composite variable strategy. An event for time-to event endpoint was defined as a breach of either the upper or lower thresholds (<3.8 or ≥5.5) within 4 weeks, whichever comes first. If no threshold breach was experienced before or at week 4, it was treated as censored. The hazard ratio was calculated as an effect measure. The test was performed using the log-rank test. RMST difference was also calculated as an effect measure. The test was performed using the approximated Z-test.

Mixed Model for Repeated Measures
Under the assumption that the subjects who presented a threshold breach would continue treatment, the treatment effect was evaluated following the hypothetical strategy. The variable was the change in serum potassium level at week 4. The MMRM models the mean (expected value) of the variable or assumes average trajectory in each scenario where all patient could continue the planned treatment despite a threshold breach. In other words, the trajectory of individual patient with a threshold breach was not clearly envisioned. The MMRM employed a model with baseline serum potassium value, categorized treatment, visit and its interaction with treatment, and with an unstructured covariance structure for repeated measurements; the model was estimated using the restricted maximum likelihood approach. For patients who had a threshold breach, serum potassium values up to the time of the threshold breach were used for analysis. The mean difference in the change in potassium values from baseline at week 4 was used as an effect measure. The test was performed using the t-test based on the least squares method.
In the simulation, the missing at random (MAR) assumption held because the value at the time of the threshold breaches was observed. Therefore, not accounting for unmeasured variables, the MMRM produced fairly unbiased estimates.

LOCF-ANCOVA
The variable was the change in serum potassium level at week 4. This method was based on the assumption that the medical condition and laboratory values at the time of threshold breach would remain unchanged thereafter under a hypothetical strategy. The potassium value at week 4 for the patient presenting a threshold breach was imputed by the last observed value. As an effect measure, the mean difference in the change in potassium value from baseline at week 4 was calculated. The test was performed using the t-test based on the least squares method via ANCOVA with the baseline serum potassium value as a covariate.

Last Rank Carried Forward
Although the illustrative trial  did not explicitly mention its strategy for addressing the intercurrent event, the LRCF is based on the assumption that ranks for changes in serum potassium level from the baseline at threshold breach would remain unchanged at subsequent visits (O'Brien, Zhang, and Bailey 2005). Under the assumption, for example, when the trajectory behaviors are on some increasing trend, the trajectory behaviors of individuals with a threshold breach follow the increasing trend. According to the LRCF assumption, ranks for subjects with a threshold breach at the time of threshold breach are carried forward to subsequent visits. Therefore, we regarded the LRCF as the hypothetical strategy method that yields the value of the variable for subjects with a threshold breach as if the allocated treatment would continue even after the threshold breach.
The brief description how the ranks are assigned is given in the following. The patients who breached are assigned to their designated ranks within each week from week 1 to week 4. If patients has a breach at week 1, then the rank is assigned based on the change at week 1 and that rank is be reserved and carried forward to week 4 and at week 2, reserved ranks can no longer be used by the other patients and so on and so forth. At week 4, the ranks are assigned to all patients who reach week 4, except the reserved ranks from the previous weeks.
The variable was the change in serum potassium value from the baseline at week 4. To estimate the effect measure, we first defined the associated change which can handle the missing value due to threshold breach by considering the ranks as follows; The missing value at week 4 for patients with a threshold breach was interpolated using the observed serum potassium value at week 4 for subjects without a threshold breach, considering the ranks that were carried forward. When the missing value could not be interpolated because there was no subject with a more extreme rank, the maximum observed potassium value in the trial was imputed for subjects with an upper threshold breach, and the minimum observed potassium value was imputed for those with a lower threshold breach. As an effect measure, the median difference in the associated change in potassium value from the baseline at week 4 was calculated using the Hodges-Lehmann estimator. The test was performed using Wilcoxon's rank sum test.

Linear Extrapolation
The treatment effect under the assumption that the linear relationship between the change in serum potassium from the baseline and time would remain unchanged even after a threshold breach was evaluated under the hypothetical strategy. The variable was the change in the serum potassium value from the baseline at week 4. The missing potassium value at week 4 for the patients who discontinued treatment was linearly extrapolated. As an effect measure, the median difference in the change in the potassium value from the baseline at week 4 via Hodges-Lehmann estimator was calculated; Wilcoxon's rank sum test was used.

Trial Design
With reference to the randomization phase of the OPAL study, we considered a randomized controlled study comparing an active drug (A) with a placebo (P) for maintenance of longitudinal outcomes (potassium). Potassium data at five time points were available: at randomization (baseline; 0) and at 1, 2, 3, and 4 weeks after randomization. The serum potassium level at week 4 was of primary interest. Treatment discontinuation due to threshold breaches was considered the intercurrent event. The upper and lower thresholds of serum potassium for treatment discontinuation were set to 5.5 and 3.8, respectively, which were the same as that in the OPAL study. No intercurrent event other than treatment discontinuation due to threshold breaches was considered.

Data Generation (Base Case Scenarios)
Similar to simulation studies for the design of the OPAL study , the multivariate normal distribution was used. Scenarios of mean serum potassium levels for each group were defined by a combination of trajectories. The correlation structure used in the design of the OPAL study was adopted. Measurements after treatment discontinuation due to the threshold breaches were treated as missing monotonically. The missing data mechanism was the MAR because the observed value at the threshold breach determined the missingness. The following three trajectories of mean serum potassium levels were considered to construct the simulation scenarios: • Trajectory 1 (T1): early return to high serum potassium level (4.6, 4.9, 5.1, 5.1, and 5.1 mmol/L) • Trajectory 2 (T2): late return to high serum potassium level (4.6, 4.6, 4.6, 4.6, and 5.1 mmol/L) • Trajectory 3 (T3): retention of normal serum potassium level (4.6, 4.6, 4.6, 4.6, and 4.6 mmol/L) T1 and T3 were based on the OPAL study , and T2 was based on a simulation study by Siddiqui, Hung, and O'Neill (2009).
The following four major scenarios ( Figure 1) were examined in the simulation studies: The correlation structure used in the design of the OPAL study was also used for this simulation study, in which the correlation decreased as the time interval increased (Supplementary Table 1). The sample size was set to 20 subjects in each group, and the power was 80% in a two-sample t-test. The mean difference in the change in serum potassium at 4 weeks was 0.5, and the standard deviation for the change in serum potassium at 4 weeks was 0.55 in both groups with reference to the parameters of the OPAL study. The number of simulations was 100,000 for each scenario. All analyses were performed using SAS 9.4.

Other Scenarios to Evaluate the Impact of the Proportions of the Intercurrent Event
To evaluate the impact of the proportions of the threshold breaches, the trajectories of the mean values in both groups were lowered or raised at the same time without changing the differences in the mean values between the groups. The shift in the mean values for the lowered (raised) trajectory was set to −0.2 (+0.2) at all time-points. Hereafter, the original setting is referred to as S(±0), and each shift is referred to as S(−0.2) and S(+0.2). In addition, to evaluate the impact of the intraindividual variation on the results, the same examination was performed with a weaker correlation structure (Supplementary Table 1).

Performance Measures
Type I error rate and power and bias for all seven statistical analysis methods described in Section 3.2 were evaluated in the simulation studies. Type I error rate and power were calculated by the rejection proportion. The one-sided significance level was 2.5%. In bias evaluation, the mean of measures of difference and median of measures of ratio were used. Under the hypothetical strategy where the treatment would continue after the threshold breaches, true differences in the serum potassium value at week 4 were 0 for Scenarios 1 and 2 and 0.5 for Scenarios 3 and 4. Under the composite variable strategy, the true value was known only in Scenario 1; the win ratio, win odds and hazard ratio was 1, and the net benefit and risk difference was 0. In Scenarios 2-4, the true value was not known because the data was generated without considering the true value under the composite variable strategy.
Additionally, the cumulative proportion of threshold breaches until each visit was calculated over simulations in each scenario to evaluate the impact of the proportion on both the rejection proportion and bias of each method.

Summary of Threshold Breaches
Cumulative proportions of the threshold breaches in each trajectory with the standard correlation structure are presented in Figure 2. Although the proportions of both upper and lower threshold breaches were slightly higher in the weak correlation structure, the overall trend was not different (data not shown).
A substantial proportion of lower threshold breaches were observed only in T2 with S(−0.2) and T3 with S(−0.2). Therefore, the use of S(−0.2) in Scenarios 2 (T2 vs. T1) and 4 (T3 vs. T1) was associated with a higher proportion of lower threshold breaches in the active drug group than in the placebo group. Using S(−0.2) in Scenario 3 (T3 vs. T2), the lower threshold breach proportions did not differ between the groups; however, the upper threshold breach proportion was higher in the placebo group. In scenarios using any combination of trajectories in S(±0) and S(+0.2), most of the breaches were upper threshold breaches. The proportion of threshold breaches was greater in the placebo group in all scenarios but Scenario 1 and greater in S(+0.2) than in S(±0).

Operating Characteristics of Various Analysis
Methods The rejection proportions and estimates in the standard correlation structure are summarized in Tables 2-5; those in the weak correlation structure were similar and are presented in Supplementary Tables 2-5. The simulation results for the scenarios are summarized as follows: Scenario 1: No Effect at all Time-points (Table 2) The proportions of threshold breaches in both groups were equal because the trajectory of each group was the same. The Type I error rates in all analysis methods were maintained in all shifts; however, those in the responder analysis were conservative in all shifts, those in RMST was conservative in S(−0.2), and those in MMRM were conservative with increasing shift. No bias was found for each effect measure.
Scenario 2: No Effect at Week 4, but Effective at Weeks 1, 2, and 3 (Table 3) In S(−0.2), the proportion of lower threshold breaches was higher in the active drug group; however, the threshold breach proportions were similar in both groups if the upper and lower threshold breaches were not distinguished. In S(±0) and S(+0.2), almost only upper threshold breaches were observed, and the proportion of which was high in the placebo group, and the proportion of threshold breaches in each trajectory in S(+0.2) was greater than that in S(±0).
In this scenario, it is possible to consider two purposes of treatment evaluation: (1) both the serum potassium value at the time of primary evaluation (week 4) and the occurrence of intercurrent event (threshold breach) are of interest, and (2) only the serum potassium value at the time of primary evaluation is of interest.
For the first purpose, the PCO, responder analysis, and survival analysis can be evaluated under the composite variable strategy. In this situation, the rejection proportion can be the power. The methods in decreasing order of their power were survival analysis (RMST)> survival analysis (HR)> PCO responder analysis. The power was less than 80% for the RMST in S(+0.2) and almost zero for the responder analysis in all shifts. The power of the PCO and survival analysis in S(+0.2) was greater than that in S(±0). The risk difference was almost zero in all shifts. In S(−0.2), all estimates for PCO favored the placebo group. The win odds were slightly closer to the null value than the win ratio.
When considering the second purpose, the Type I error rate and bias were evaluated under the hypothetical strategy. Therefore, the MMRM, LOCF, LRCF, linear extrapolation can be evaluated. The Type I error rate was maintained in the MMRM, but inflated in the LOCF, LRCF, and linear extrapolation. The order of magnitude of the inflation of Type I error rate was LOCF<LRCF≈linear extrapolation, whereas that of the bias of the estimates in favor of the active drug was LOCF<LRCF<linear extrapolation. As the shifts increased, the Type I error rate and bias of these three estimators increased.
Scenario 3: No effect at weeks 1, 2, and 3, but effective at week 4 (Table 4) Although the proportion of upper threshold breaches at week 4 was higher in the placebo group, there was no difference in the occurrence of threshold breaches between the groups before week 4.
For the composite variable strategy, the methods in decreasing order of their power were PCO>responder analysis>survival analysis (HR)>survival analysis (RMST); the maximum powers of the PCO, responder analysis, survival analysis (HR) and survival analysis (RMST) among the three shifts were 91%, 73%, 39%, and 2%, respectively. The powers of the responder and survival analyses were lower in S(−0.2) and higher in S(+0.2). The PCO in S(+0.2) exhibited lower power, and the estimates were more attenuated than in S(±0). The win odds were slightly closer to the null value than the win ratio.
Among the analysis methods under the hypothetical strategy, the power of the MMRM was the highest. For the LOCF, LRCF, and linear extrapolation, the power was approximately 90%. The power of all methods except the LOCF was lower in S(−0.2) and S(+0.2) than in S(±0). There was no bias in the MMRM estimates. There was a bias of 0.02-0.05 toward the null in the LOCF, LRCF, and linear extrapolation, and the bias was exaggerated particularly in S(−0.2) and S(+0.2).
Scenario 4: Effective at all time-points (Table 5) The proportion of threshold breaches in the placebo group was higher throughout all shifts. In S(−0.2), the proportion of lower threshold breaches was higher in the active drug group, and the proportion of upper threshold breaches was higher in the placebo group. In S(±0) and S(+0.2), the proportion of upper threshold breaches was higher in the placebo group; there were almost no lower threshold breaches.
Under the composite variable strategy, the order of the powers from the highest to the lowest was PCO>responder analysis> survival analysis in most cases. Particularly in S(−0.2), the powers of the PCO, responder analysis, and survival analysis (HR) and survival analysis (RMST) were merely 71%, 22%, 6%, and 3%, respectively. The power decrease from S(+0.2) to S(±0) for the PCO was 3%, while that for the responder analysis, survival analysis (HR) and survival analysis (RMST) were 12%, 31%, and 34%, respectively. The power decrease from S(±0) to S(−0.2) for the PCO was 21%, whereas that for the responder analysis, survival analysis (HR) and survival analysis (RMST) were 44%, 47%, and 35%, respectively. All estimates in each shift in S(−0.2) were attenuated compared to   that in S(±0) and S(+0.2), and the magnitude of attenuation was greater in responder analysis and survival analysis. The win odds were slightly closer to the null value than the win ratio.
Among the analysis methods under the hypothetical strategy, the LOCF exhibited the highest power of approximately 98%.
The MMRM, LRCF, and linear extrapolation exhibited a power greater than 90%. There was no bias in the MMRM estimates, whereas the others showed a bias in the direction of overestimation. The magnitude of the bias was LOCF<LRCF<linear extrapolation. Bias in the LRCF and linear extrapolation were

Discussion
This study evaluated the statistical methods for the evaluation of maintenance treatments through continuous longitudinal outcomes in situations that involved treatment discontinuations due to threshold breach as an intercurrent event. We described the estimands and proposed a definition for the PCO. By clarifying strategies and considering the intercurrent event due to threshold breach within the framework of ICH-E9(R1), we organized all attributes of estimands, assumptions, and effect measures and conducted simulation studies to evaluate the operating characteristics of all analysis methods. In this study, we focused on the composite variable strategy and hypothetical strategy for considering intercurrent events.
In the composite variable strategy, the PCO, responder analysis, and survival analysis considered different types of information: Types of threshold breaches, time to threshold breaches, the serum potassium value at the final time point, and priority of the aforementioned information. Therefore, the clinical meanings are different. The PCO was defined by the information on the type, time, and continuous potassium value, with the priority (clinical importance) of multiple components. In contrast, the responder analysis only considered response/nonresponse, and the survival analysis only considered the time to threshold breaches. The conservativeness of Type I error rates were due to the small sample size for the responder analysis (Crans and Shuster 2008), and due to low event rates for the RMST especially in S(−0.2) (less than 20% in T1 with S(−0.2) in Figure 2). Additional simulations with 40 or 200 subjects for each group showed that the conservativeness in the RMST decreased with increasing the number of subjects (data not shown). The power of the PCO was the highest among these methods because the continuous variable as one component of the PCO presented the most information. The power of the responder analysis was low because of dichotomization of the information and the conservativeness of Fisher's exact test under a small sample situation (Crans and Shuster 2008). The lower power of the survival analysis compared to that of the responder analysis in Scenarios 3 and 4 can be partly because the survival analysis uses only an acceptable range wider than the target range and it is difficult to capture the difference between the groups; in contrast, the responder analysis uses the target range at week 4. The magnitude of the proportion of threshold breaches influenced the power. Particularly in S(−0.2), a decrease in the power with an increase in the proportion of lower threshold breaches for the active drug was confirmed. The decrease in the power of the PCO was the lowest because the PCO, which can differentiate the type of threshold breach, alleviated the adverse influence of the lower threshold breaches. In addition, the decrease in power of the PCO, as observed in S(+0.2) in Scenario 3 compared to S(±0) in the same scenario, is attributed to loss of information by the increase in threshold breaches or use of less information with respect to the continuous variable in defining the PCO. The estimates of effect measures for the PCO favored the placebo group in S(−0.2) in Scenario 2 because the PCO detected the excessive decrease in serum potassium level induced as the low prioritized component. The win odds were slightly closer to the null than the win ratio because the proportions of ties due to threshold breaches were small (1-10%). The win ratio (Pocock et al. 2012), win odds (Brunner, Vandemeulebroecke, and Mütze 2021) and net benefit (Buyse 2010), which have been proposed as effect measures for the PCO, are being used as a primary analysis in the cardiovascular field (Maurer et al. 2018); however, it appears that they have not yet been generally accepted (Ferreira et al. 2020). If the definition of PCO and effect measures are accepted in a particular clinical setting, the PCO is recommended for use for maximizing power and reducing patient burden. Other methods with lower power should be used to clinically interpret the magnitude of the estimates as supplementary analyses because they have different facets of clinical meanings.
In some scenarios, the RMST differences were inconsistent with the HRs. This is because the RMST does not reflect the event at the final time point in the area under the survival curve. In Scenario 2, the RMST differences were optimistic due to disregarding the disadvantages of the active drug at the final time point. In Scenario 3, the treatment effect on the RMST difference was not shown because the difference in threshold breaches at the final time point between the groups was not taken into account. In Scenario 4, the RMST differences were conservative because the advantage of the active drug at the final time point was not considered. When events within a limited time period are assessed by the survival analysis, especially without the proportional hazards assumption, the RMST difference may be used as an effect measure. However, it is hazardous to adopt the RMST difference without adequate considerations on the impact of the events at the final time point. Generalized pairwise comparison (Buyse 2010) and PCO may be an alternative for evaluation in that case.
For the hypothetical strategy, we considered the MMRM, which is the most common method (Fletcher, Tsuchiya, and Mehrotra 2017); LOCF; LRCF, which was employed in the illustrative trial; and linear extrapolation. The four methods have a common estimand, where patients would continue their treatments even after threshold breaches. However, the assumptions for dealing with intercurrent events are different; the MMRM considers the MAR assumption without an explicit assumption for the transition after the intercurrent event, whereas the linear extrapolation, LRCF, and LOCF assumptions explicitly specify the transition. Because the MAR assumption can hold in the considered situation, the MMRM estimates were unbiased, the Type I error rate by the MMRM was well controlled, and the power was interpretable (Siddiqui, Hung, and O'Neill 2009). Type I error rates of the MMRM were conservative due to a small sample size (Ukyo et al. 2019) with increasing shift. In contrast, bias was observed in the other three methods. Molnar et al. (2009) reported that the direction of bias due to the LOCF depends on the natural course of the disease, the proportion of threshold breaches, and the difference between groups. In this study, we found that the degree and direction of biases were based on the following factors: The magnitude of deviation from the shape of the trajectories, the direction of deviation, and the magnitude of variance. In Scenario 4, the methods in decreasing order of bias were: Linear extrapolation, which assumed a linear transition; the LRCF, in which the relative change continued; and the LOCF, in which the value at the time of discontinuation continued because the true shapes of the trajectories of the serum potassium value were set to be steady in the middle of the study. As for the direction, the estimated treatment effect according to the LOCF, LRCF, and linear extrapolation approaches was exaggerated in Scenarios 2 and 4 because the differences in the potassium values between the groups at the midpoint were added up to the treatment effect at the final time point; it was underestimated in Scenario 3 because the same amount of impact on the estimator from the same transition in both groups at the midpoint diluted the treatment effect at the scheduled final time point. Although the LRCF and linear extrapolation assumptions did not hold, the powers of both estimators were similar because the impact of the difference in assumptions on the rank endpoints were similar when using the nonparameteric test. The shift did not significantly affect the power of all analysis methods under the hypothetical strategy. However, only in S(+0.2) in scenario 4, a slight decrease in power was observed in the MMRM due to an increase in the cumulative proportions of threshold breaches. According to the above observations, the MMRM was determined to be the only appropriate method to estimate the difference in serum potassium value between groups when the treatment would continue until the scheduled final time point without bias, regardless of threshold breach.
The choice of estimand or strategies depends on the clinical consideration of upper and lower threshold breaches. A hypothetical strategy may be employed when the aspect of interest is the ability of the active drug to decrease the serum potassium level. Threshold breaches may not need to be actively considered in this strategy. Among the effect measures under the hypothetical strategy, only the MMRM can be used without bias because we considered the situation where threshold breaches satisfied the MAR assumption. In contrast, the threshold breach may be actively considered for obtaining a comprehensive comparison, wherein the information on treatment discontinuation is included. In this case, the composite variable strategy can be used, and the PCO, in particular, has the advantage of distinguishing the type of threshold breaches, unlike other methods under the composite variable strategy. In addition, the performance of the PCO was best because the PCO can use the information from continuous variables without loss of information by dichotomization. The PCO and MMRM, which can maximize efficiency and reduce patient burden, are recommended for use if the clinical meaningfulness of these endpoints is accepted. However, the power was affected by the proportion of threshold breaches in both methods, which indicates the importance of performing simulations in consideration of threshold breaches. Our simulation was not a pure comparison between estimands but evaluation the operating characteristics of combinations of estimands and their aligned estimators that can be used in real settings were evaluated. This is because the target range for the responder analysis and the acceptable range for both PCO and survival analysis were different under the composite variable strategy, and the assumption of distributions and the adjustment of covariates were different among the methods under the hypothetical strategy.
This study has certain limitations. First, this research focused on a situation where only a single intercurrent event occurred. Even in a situation where there is only one intercurrent event of interest, such as a threshold breach, there is room for deliberation regarding an approach for considering the threshold breach. Second, we used our arbitrary definitions for endpoints using the composite variable strategy, where we combined two different types of information: The serum potassium level within the target range based on clinical considerations and threshold breaches over the acceptable range for considering treatment discontinuation. An approach to reflect both types of information in the definition of composite endpoints should be further scrutinized on a case-by-case basis with clinicians. Third, we did not evaluate the variance of the estimate in the simulation. In actual clinical trials, we can use bootstrap variance for all estimators, but in our simulation, we prioritized scrutinizing more simulation settings instead of high computational burden by bootstrapping.
In conclusion, the PCO under the composite variable strategy and MMRM under hypothetical strategy were suggested as candidates for the primary analysis in clinical trials which evaluating maintenance treatments with threshold breaches as an intercurrent event. Because the MMRM and the PCO provide different types of information as results, employing them as supplementary analysis is also relevant. The MMRM can provide beneficial information that the transition of the therapeutic effect at each time point. The treatment effect for the PCO is described as the win ratio, the win odds and the net benefit, but clinical meaningfulness of the effect measures are not common yet; further accumulation of experience of the PCO is important.

Supplementary Materials
The supplementary materials contain the correlation structure for data generation in the simulation (Supplementary table 1), and the simulation results in the weak correlation structure (Supplementary tables 2-5).