Analysis of survival outcomes using likelihood ratio test in trials incorporating patient's treatment choice

Methods for designing and analyzing multiple arms survival trials that incorporate patient's treatment choice are needed. In these trials, patients are randomized into two groups, random and choice. Participants in the choice group choose their treatment, which is not a current standard practice in randomized clinical trials. In this paper, we propose a new method based on the likelihood function to design and analyze these trials with time to event outcomes in the presence of non-informative right censoring. We use simulations to evaluate the methods for Weibull outcomes, complete and censored. Finally, we provide an illustration for designing a study in which we discuss some design considerations and demonstrate the methods.


Introduction
In some studies, time from treatment or diagnosis to an important clinical event is of interest.One complicating factor in analyzing these time to event outcomes is censoring which happens when the actual time to event is not observed.Right-censoring is the most commonly encountered type of censoring in clinical trials.It can occur when a participant starts at enrollment but terminates before the end of the study without observing the event of interest either due to withdrawal from the study or end of study participation.Another form of right censoring occurs when not all participants join the study at the same time and will consequently have different follow-up times.Participants with the shorter follow-up time who do not experience the outcome of interest by the end of the study are rightcensored.Data collected from individuals who are lost to follow-up or never experienced the event of interest cannot be ignored as we know their survival time is at least as long as the period documented.In this paper, we focus on random right-censored observations.Randomized clinical trials (RCTs) are considered the gold standard for clinical trials because they minimize factors that contribute to outcome variability, selection bias and biased ascertainment of outcomes [10].Some of the challenges facing RCT are patient's recruitment, retention and compliance, particularly since the informed consent procedure requires the treating physician to inform the patient about the risks and benefits associated with the trial, the alternative therapy available and the right to withdraw.Clinical trial designs that incorporate patient's treatment choices have been gaining popularity over the last few decades.These types of design allow the researcher to estimate a treatment preference effect (measure the extent to which participant choice modifies the treatment effect) which can provide additional insights into factors of the study beyond that of the treatment alone and emulate real-life situations where the patient's opinion is taken into consideration when deciding on treatment regimes.
One of the reasons for the recent increase in the use of designs that incorporate participant's treatment choice is the creation of Patient-Centered Outcomes Research Institute [11] through the Patient Protection and Affordable Care Act of 2010, more funds are being allocated to studies that center on patients.But, this design has been used even before 2010.In 1996, Torgerson et al. [14] ran a study using this design to look at the effect of an exercise program vs the standard treatment to reduce disability in patients with sub-acute back pain.Probably one of the more cited papers that focused on patient's treatment choice is by Janevic et al. [4].This paper looked at the role of choice in health education.The most recent study that we found using a design that allows for the incorporation of patient's treatment choice is by Molt et al. [8].In this latter paper, the investigators look at exercise programs to improve walking in patients with multiple sclerosis.
The advantage of such designs is that they allow the investigator an opportunity to assess the effect of choice on the study outcome.Consequently, in this type of design, the main focus shifts away from testing for the main treatment effect to testing for preference effect, defined as the interaction between treatment and choice.Otherwise, results of analyses for this design focusing only on the treatment effect but ignoring the choice effect may be misleading.To illustrate, if the treatment effect in the choice group has the opposite direction of the treatment effect of the random group, then the result may show no treatment effect.Another case may be that the treatment effect for the two groups is in the same direction but one group has larger treatment effect than the other.These types of information are precisely why this trial design focuses on preference (or interaction) effect.
Clinical trial designs that incorporate patient's treatment choice usually randomize patients into two groups, random and choice.Participants in the random group are randomized again between treatments following standard RCT, while participants in the choice group choose their treatments.Currently, there are available methods based on the ANOVA approach to analyze this type of design with Normal outcomes [12,16], binary [1] and count [13].Also, Long et al. [6] suggested methods based on the likelihood, but did not extend it for survival outcomes with censoring.
To date there has been no publications that looked at this type of design for survival data.So, we propose a novel approach based on likelihood method to analyzing clinical trials that account for patient's treatment choice for time to event outcomes with censoring.We focus on the Weibull distribution which accommodates decreasing, constant and increasing hazard (Figure 1).We evaluate the performance of this method via simulations with respect to simulated type I error and power.In the complete data scenario, we compare the proposed method to ANOVA which is currently the method being used in practice.Finally, we illustrate how to use the proposed method to design a study and discuss some design considerations that will provide guidance to researchers.

Study design and model assumptions
The following methods can be applied to studies with multiple treatments.For convenience, we assume two treatments A and B. We denote random variables and their observed values with upper-and lower-case letters, respectively.Assume N participants are randomized into choice and random groups with θN in the choice group and (1 − θ)N in the random group, where 0 < θ < 1 is set a priori.φ, unknown, is the the proportion of participants in the choice arm who chose treatment A versus treatment B. Let T and C be two indicator variables for treatment received and treatment selected, respectively.R is a group randomization indicator for the arm of the study the participant was randomized into, r i = 1 for the random group and zero otherwise.
We are making the assumption that all the participants in the choice group will make a choice and receive the treatment of their choice.We are also assuming that proportion of patients selecting a treatment is constant across all participants, i.e P(c i = 1) = φ.Let Y represent time to the event of interest and S the censoring indicator with s i = 1 for censored patient and zero otherwise.We are assuming that we have non-informative censoring, that is, the censoring mechanism does not contain information about the parameters of the event time distribution.The conditional expected values of Y for the choice and random groups are respectively, with where e β C0 and e β R0 are the mean time to event for participants on treatment B for choice and random groups respectively (i.e.t i = 0), and e β CT and e β RT are the ratios of mean time to event for participants on treatment A relative to those on treatment B (i.e.t i = 1).Effect sizes of interest are the ratio of the mean time to event of the two treatments between two groups.
Let f C and f R be the probability density functions of the outcome for the choice and random groups respectively.We write the likelihood as, To account for right-censoring, the likelihood is written as, where S C and S R are survival functions corresponding to the densities f C and f R , respectively.We use the maximum likelihood estimators (MLEs) to estimate the effects of treatment, choice, choice on treatment and other parameters of interest.Because there are no closed forms for the MLEs, we use numerical methods to obtain the estimates, in particular, we ran R-function OPTIM [15] with quasi-Newton algorithm, bounded Broyden-Fletcher-Goldfarb-Shanno (L-BFGS-B).

Weibull
One of the most commonly used distributions for failure time is Weibull.Weibull distribution depends on two parameters, shape 'a' and scale 'b'.The shape parameter plays a role in the failure rate [5].a < 1, a > 1 and a = 1 respectively indicate decreasing, increasing and constant hazard rates.The probability function of a Weibull is written as, , where is the gamma function.The survival distribution function is S(y) = e −(y/b) a and the hazard function is h(y) = ay a−1 /b a .
An advantage of using the Weibull distribution is its shape parameter.For example, in the case of viral disease, if a patient does not exhibit symptoms within a specific time period, the hazard of developing these symptoms decrease over time.Also, in many long-term studies such as chronic heart disease, the hazard of death increases as age increases.For our purposes in this paper, we make the assumption that the treatment does not affect the shape parameter, thus a will be constant across all the arms of the study.Investigators can usually make conjectures based on a pilot study or previous studies if the hazard is increasing, decreasing or constant.However, this pre-specification may not be accurate, therefore it is important to verify that the assumptions are realistic considering the data collected.Consequently, the parameters we want to estimate are β = (β C0 , β CT , β R0 , β RT , a).The scale parameter of the Weibull distribution is estimated using the estimates of the mean and shape parameter via the following equation, with κ = C, R depending on the arm of the study.
In survival studies, hazard ratios are of interest.Given that we fixed the shape parameter across all the arms of the study, the hazard ratio reduces to the ratio of the scale parameters, which consequently reduces to the ratio of the means.For example, the hazard ratio in either the choice or the random arm (κ = C, R) for treatment A vs treatment B: We can also investigate the hazard ratios for either treatment between the random and choice groups.

Test statistics
To test effects of interest, we chose to use the likelihood ratio test (LRT), , where H 0 represents the null hypothesis For large sample size, −2 log(λ(y)) converges to a chi-square distribution with d degree of freedom calculated as the difference between the number of free parameters specified under the unrestricted model and those under the null hypothesis (for details see Collett [3]).
One of the aims of conducting trials that incorporate patient's treatment choice is to study the effect of choice on treatment, that is, preference effect.To test the hypothesis of no preference effect, we want to test if the treatments under consideration are the same in both groups, random and choice.The null hypothesis is written as H 0 : β R0 = β C0 and β RT = β CT versus the alternative that at least one of the equalities fail.In the case of no preference effect, both the random and choice groups can be combined and treatment effect can be tested using traditional methods.

ANOVA approach
Previously published papers focused on ANOVA methods to analyze trials incorporating patient's treatment choice.Using the same mathematical computations performed by Rücker [12], Tuner et al. [16], Cameron et al. [1] and Shi et al. [13], one can show that the ANOVA approach can be used for data following Weibull distributions in complete data setting.Some important considerations when using the ANOVA methods to analyze these designs are the model assumptions needed.Mainly, the assumption that if given the choice, the proportion of participants choosing one treatment versus another in the random group is the same as in the choice group.This latter assumption is not needed for the likelihood method proposed in this paper.

Simulations to evaluate the performance of LRT methods
This section consists of three parts.First, we examine the estimates of the MLEs using the Quasi-Newton algorithm with L-BFGS-B.Second, we evaluate the LRT with regard to simulated type I and power using complete Weibull outcomes (no censoring) and compare the results to simulated type I and power using the ANOVA approach.Finally, we allow for random censoring in the model, and look at simulated type I error and power.
To determine the number of repetitions needed for the simulations, we relied on formulae from Morris et al. [7], and decided to run 4000 repetitions to accommodate our various goals.Weibull data was used for all simulations.The three cases of hazard rates decreasing, constant and increasing were used with shape parameters a = 0.5, a = 1 and a = 1.5, respectively.θ , the percent of participants randomized to the choice group was set to 50% for all simulations [17].
To simulate censored data, we simulated two independent Weibull distributions for event and censor times.Observed outcomes are chosen as the minimum between the two data set, with censoring defined when the observed outcome come from the censored data.We chose to fix the shape parameter and varied the scale parameter.If b was the scale for the random event distribution defined by Equation ( 6) for a given set of βs and a, then b * the scale parameter for random censoring is computed as, where p is the percent of censoring.Complete computational details are available in the supplementary materials.
To examine the estimates, we focused on the coefficients in the model.Each data set consisted of 400 observations.Sample size for the choice group choosing treatment A was randomly generated from a binomial with n = 200 and proportion of 40%.We set the parameters to (β C0 , β CT , β R0 , β RT ) = (2.5, 1, 2.5, 1.5) (Justification for the choice of values is in the next section).For the censored case, we simulated data from Weibull with 30% censoring using Equations ( 6) and (7).Tables 1 and 2 show the results of these simulations.The estimates are close to the true parameter values with the standard error slightly larger in the case of censoring as expected.Overall, the performance of the estimators is satisfactory.

Evaluating the LRT
To assess the performance of the suggested tests, simulation studies were performed.Given that one of the main aims of conducting RCT studies that accommodate patient's treatment choice is to study if the effect of treatment is the same for both the random and choice groups, we decided to focus on testing this effect in our simulation.In other words, we focused on testing the null hypothesis H 0 : β R0 = β C0 and β RT = β CT , that is, the effect size defined as the ratio of mean time to event between the random and choice groups for both treatment A and B are equal to 1, versus the alternative that at least one equality fails, that is, testing for the preference effect.For all type I error simulations, φ, the proportion of participants in the choice arm choosing treatment A vs B, was varied between 0.1 to 0.9 with 0.1 increments.Sample sizes considered were 100, 200 and 400.Time to event outcomes are common in diseases like Multiple Sclerosis, where the outcome of interest may be time to relapse or time to fall.Using this as a motivation to set up the parameter values, suppose we expect patients in a study to have on average the first relapse after 12 weeks.Given that Equations (3) and ( 4) are on the log scale, we set the parameters (β C0 , β CT , β R0 , β RT ) to (2.5, 0, 2.5, 0).Empirical type I errors were computed as the proportion of rejection using the cutoff based on chi-squared distribution with an area of 0.05 to the right and two degrees of freedom.All simulations were run 10 times using different seeds to account for simulation variations.
For the power simulations, φ was set to 40%, and sample size was fixed to 400.The model was based on a log link function (Equations 3 and 4), and the parameters were set to β C0 = β R0 = 1.8, β RT = 0 and β CT = log(x) with x varying between 0.1 and 1.9 in increments of 0.1.This implies that the effect size, ratio of mean time to event between random and choice for treatment 'B' varies between 0.1 and 1.9, while all other effects size were set to one.Simulations were run 10 times using different seeds to account for simulation variations.In the case of complete data, power based on LRT was compared to power based on ANOVA and the standard sample size formula in Equation ( 8) (For further details see Turner et al. (2014) [16]).
where π is preference effect, σ standard deviations for the four arms of the study with double indices used for the choice group and a single index for the random group, d 1 the difference between the means of the arms on treatment A and d 2 difference between means of the arms on treatment B.

Survival outcomes without censoring
From Figure 2, we can see that in the cases of constant and increased hazard, simulated type I error for both LRT and ANOVA performed comparably for the large sample of 400.
For the smaller sample of 100 with decreasing hazard, ANOVA starts conservative while LRT is overinflated, however, both approach 0.05 as the sample size increases.For the other two cases of hazard for a sample of 100, ANOVA seems to perform better.This may be due to the fact that for ANOVA, sample means converge faster than the MLEs (numerically estimated) used in LRT.This issue dissipates with the increase in sample size as expected.Simulated power for the three types of hazard functions show that LRT has higher power values relative to the ANOVA and standard sample size formula (Figure 3), giving it an advantage of requiring a smaller sample for the same power and effect size.For instance for N = 400 in Figure 3, the power to detect an effect of 0.5 for the case of decreasing hazard is close to 15% using ANOVA while it is close to 50% using LRT.The performance of all three methods are more comparable in the case of increasing hazard.When comparing the simulated powers based on LRT for different hazard functions, we noticed that decreasing hazard has lower power requiring a larger sample.We attribute these observations to the shape of the Weibull distribution.Distribution of Weibull with decreasing hazard has a heavier right tail with more extreme values when compared to constant and increasing hazard.In the cases of non-decreasing hazard the distribution follows a more symmetrical distribution with smaller variability.

Survival outcomes with random censoring
From Figure 4, we can see that for all types of hazard, simulated type I error of the LRT is over-inflated.The model performs better for the larger sample of 400.We note here, that for all samples only 70% of data are expected to be uncensored.We re-ran the analysis with N = 800 and the model performed better, simulated type I was very close to the preset 0.05 critical value for decreasing and non-decreasing hazards (plots available in supplementary materials).
For simulated power, Figure 5 shows that power gets better as the shape parameter increases.More specifically, for the sample of 400 with 30% censoring, power for the case of increasing hazard quickly reaches 80% power for small effects.When comparing simulated power with and without censoring (Figure available in supplementary materials), both have similar pattern across types of hazards but simulated power for censored case is consistently lower than complete case.As mentioned before, this is expected given that around 30% of the data are censored, and it can be remediated for by increasing the sample size, which is a common practice for survival studies (Collett [3], Section 10.3).

Design considerations
Multiple considerations need to be taken into account when designing time to event trials that allow some participants to choose their treatment.The first parameter that needs to be considered is the proportion of participants that will be randomized to the choice arm.According to Walter et al. [17], the optimal allocation is close to 50%, but other investigators chose different values depending on their needs.For example, Clarks et al. [2] in their Women Take PRIDE study used 3:2 randomization ratio with larger sample size in the random arm, because they added an additional 'usual care' arm to the random arm.Another parameter to consider is φ, the proportion of participants choosing one treatment  vs another.As we will show in this illustration, the choice of φ can have a large impact on the sample size.Ideally, investigators are encouraged to run either a pilot study or a survey, surveying patient's choice of the treatment if such information is not available in the literature.The other parameters to consider are similar to traditional survival trials such as expected proportion of censoring, the effect size and expected time to the event of interest.

The illustration
To illustrate how we can set up a study to detect a pre-selected preference effect, we use a hypothetical example motivated by the study design of an ongoing clinical trial (STEP study by Molt et al. [8]), where the outcome of interest is time to event.It is worth noting that this is not one of the original study outcomes and does not reflect any of their results.The STEP study looks at the effectiveness of an exercise training program for participants with multiple sclerosis (MS) with the goal of improving patient's walking performance.The investigators in the study use a design that incorporates participant's treatment choice to compare between two mode of deliverance of the program either with in-person supervision, facility based (FB), or home based with remote supervision via telerehabilitation(TELE).Patients are followed for up to 52 weeks.The primary outcome of the trial is walking speed based on Timed-25-Feet Walk.
For our illustration, we use a time to event outcome.In particular, we consider the time to first fall after randomization.It is estimated that around half of people with MS fall at least once in a three month period and over one-third report multiple falls in a similar time frame (Nilsagard et al. [9]).We propose a 26 weeks, prospective, intent-to-treat study with 12 weeks of intervention with no crossover study comparing the efficacy of real time telerehabilitation exercise versus a facility-based program (Figure 6).
In some studies like health education and exercise program, testing for the treatment effect may not be the primary aim.Instead, investigators may believe that patient's motivation may affect the outcome of interest, and would want to study the effect of choice on treatment (preference effect).Even if the effect of intervention is the main goal of a study, because the proportion of the participants was allowed to choose their treatment, the preference effect needs to be considered.If the preference effect is not significant, then both the choice and random groups can be combined to test the intervention effect.In this illustration, the primary endpoint is to test the effect of choice on treatment (preference effect), that is, test the ratios of mean time to first fall between the random and choice groups for treatment A and treatment B. Regarding censoring, any participant who is lost to follow-up or does not experience the outcome of interest by the end of the study will be censored.In our illustration, we expect 10% of the participants to drop out of the study, and 20% not to experience a fall by the end of 26 weeks, resulting in 30% censoring.
Using a 50-50 allocation between random and choice groups (θ = 0.50), we present two different scenarios: (1) 40% of the participants are expected to choose telerehabilitation vs facility-based treatment and (2) assuming that only 20% of the participants are expected to choose telerehabilitation.As illustrated in Figure 6, suppose we are going to track the time it takes a patient to first fall after randomization and we hypothesize that it takes a randomized patient to FB program 8 weeks on average to first fall.Further, we are going to hypothesize that for randomized patients, we are not going to see a difference between the two groups (no treatment effect in the random arm).On the other hand, given that patient's inner motivation may affect the outcome of an exercise program, we will hypothesize an improvement for patients who choose telerehabilitation and that it is going to take patients who chose TELE and get TELE four weeks longer to first fall, resulting in a 50% increase in mean time until the first fall.
In addition to considering different proportions of participants choosing TELE vs FB, we assume different hazard rates.First we hypothesize that we have a constant hazard of falling for all patients (i.e. a = 1), then we are going to assume that the program worked well enough to yield a decreased hazard of falling (i.e. a = 0.7).Given that we do not have a formula for the sample size, results are based on simulations.From Table 3, we see that in the case of constant hazard and assuming that 40% of the participants chose TELE, we expect to reach around 80% power with a sample of 800, but if only 20% choose TELE, the required sample size for the power near 80% will be 1200.Under decreasing hazard scenario, a power of 80% is not reached even with a sample size of 1200 (with 240 in TELE vs 360 in FB in the choice group).One possibility to address this issue is to increase the follow-up time so there will be more events to be observed.In case a large proportion of participants choose TELE (80%), a power of 83% in the case of constant hazard can be reached for a sample as small as 600.We emphasize here the importance of pilot studies and/or surveys, as one may design an underpower study if they over estimate the proportion of participants choosing one of the interventions.

Discussion
Survival analysis is important in clinical trials when the outcome of interest is time to an event such as time to death, time to cure or time to a recurrence of a disease.Previously available work dealing with designs that incorporate patient's treatment choice was mainly focused on ANOVA approach and did not provide a way to incorporate censoring in the case of time to event outcomes nor inclusion of covariates in the model.In this paper, we propose new methods based on likelihood ratio to analyze trials for time to event outcomes which can easily accommodate censoring.For future research, we plan on extending our work to incorporate covariates in the model and perform the needed numerical simulations to assess its performance and make the program available for researchers to use.
In the case of complete Weibull data, we found that both methods (ANOVA and LRT) performed similarly well for large samples.In the case of censored data, simulated type I error rates of LRT did not perform well for the smaller sample of 100.Due to the presence of censoring, sample size needs to be inflated more than in the case of no censoring to get similar performance with respect to type I error and power.Another factor to consider is the proportion of participants choosing one treatment over the other (φ).In the extreme cases of φ (near 0 or near 1), the simulated type I error is inflated because of the small sample in the choice group.In the case of complete data with sample size at least 400, simulated type I error are similar for LRT and ANOVA.However, the simulated power of LRT is better than ANOVA.This gives the proposed LRT method an advantage of reaching needed power using a smaller sample.With respect to different shapes of hazard, we noticed an increase in power with an increase in the value of the shape parameter.
When planning a study, we suggest running few scenarios and increase the sample size until the power needed for the effect specified is reached.We also recommend running either a pilot study or conducting a survey, to get an idea of the proportion of patients preferring one treatment vs another, because the bottom line is how many subjects do we have in each of the four groups.
As final notes, we only focused on the Weibull distribution in this paper, programs used can be extended to incorporate other survival distributions like lognormal.Also, in this work, we assumed that the shape parameter is fixed across all the study arms, even though the hazard rate may be affected by different treatment.Our proposed methods can be extended to allow for different shape parameters for different arms of the study but we expect the computations to be more complex and interpretations of effect sizes to be more challenging.In addition, sample size calculations which are based on simulations will result in a time-consuming process.As in any likelihood-based inference, the proposed likelihood approach is sensitive to distribution misspecification We included two example simulations in the supplementary materials (Figures S3(a) & S4(a)) to illustrate the issue.We recommend that researchers follow the common practice of checking the distribution of the observed data before applying the proposed method (for illustration see Figures S3(b) & S4(b) in the supplementary materials).

Figure 1 .
Figure 1.Weibull distributions with scale parameter 10 and different shape parameters.

Figure 3 .
Figure 3. Simulated power for preference effect using complete Weibull data with different shape parameters and N = 400.

Figure 4 .
Figure 4. Simulated type I error at 0.05 level for preference effect using Weibull data with 30% censoring, and different shape parameters.

Figure 5 .
Figure 5. Simulated power for preference effect using Weibull data with 30% censoring, N = 400 and different shape parameters.

Figure 6 .
Figure 6.Illustration of the study design using the assumed mean time for each group.Facility Base (FB), Telerehabilitation (TELE), Proportion of participants choosing TELE in the choice arm (φ).

Table 1 .
Summary of the simulated distribution of the MLEs based on 4000 replications of Weibull data without censoring using different shape parameters, a and N = 400.

Table 2 .
Summary of the simulated distribution of the MLEs based on 4000 replications of Weibull data with 30% censoring using different shape parameters, a and N = 400.

Table 3 .
Power for preference effect for a hypothetical rehabilitation study in both cases of constant (a = 1) and decreased hazards (a = 0.7) of falling.
a Proportion of participants choosing TELE in the choice arm b Telerehabilitation c Facility Base