A general framework for parametric survival analysis

Parametric survival models are being increasingly used as an alternative to the Cox model in biomedical research. Through direct modelling of the baseline hazard function, we can gain greater understanding of the risk profile of patients over time, obtaining absolute measures of risk. Commonly used parametric survival models, such as the Weibull, make restrictive assumptions of the baseline hazard function, such as monotonicity, which is often violated in clinical datasets. In this article, we extend the general framework of parametric survival models proposed by Crowther and Lambert (Journal of Statistical Software 53:12, 2013), to incorporate relative survival, and robust and cluster robust standard errors. We describe the general framework through three applications to clinical datasets, in particular, illustrating the use of restricted cubic splines, modelled on the log hazard scale, to provide a highly flexible survival modelling framework. Through the use of restricted cubic splines, we can derive the cumulative hazard function analytically beyond the boundary knots, resulting in a combined analytic/numerical approach, which substantially improves the estimation process compared with only using numerical integration. User‐friendly Stata software is provided, which significantly extends parametric survival models available in standard software. Copyright © 2014 John Wiley & Sons, Ltd.


Introduction
The use of parametric survival models is growing in applied research [1][2][3][4][5], as the benefits become recognised and the availability of more flexible methods becomes available in standard software. Through a parametric approach, we can obtain clinically useful measures of absolute risk allowing greater understanding of individual patient risk profiles [6][7][8], particularly important with the growing interest in personalised medicine. A model of the baseline hazard or survival allows us to calculate absolute risk predictions over time, for example, in prognostic models, and enables the translation of hazards ratios back to the absolute scale, for example, when calculating the number needed to treat. In addition, parametric models are especially useful for modelling time-dependent effects [4,9] and when extrapolating survival [10,11].
Commonly used parametric survival models, such as the exponential, Weibull and Gompertz proportional hazards models, make strong assumptions about the shape of the baseline hazard function. For example, the Weibull model assumes a monotonically increasing or decreasing baseline hazard. Such assumptions restrict the underlying function that can be captured, and are often simply not flexible enough to capture those observed in clinical datasets, which often exhibit turning points in the underlying hazard function [12,13].
Crowther and Lambert [14] recently described the implementation of a general framework for the parametric analysis of survival data, which allowed any well-defined hazard or log hazard function to be specified, with the model estimated using maximum likelihood utilising Gaussian quadrature. In this article, we extend the framework to relative survival and also allow for robust and cluster robust standard errors. In particular, throughout this article, we concentrate on the use of restricted cubic splines to demonstrate the framework, and describe a combined analytic/numeric approach to greatly improve the estimation process.
Various types of splines have been used in the analysis of survival data, predominantly on the hazard scale, which results in an analytically tractable cumulative hazard function. For example, M-splines, which by definition are non-negative, can be directly applied on the hazard scale, because of the positivity condition. Kooperberg et al. [15] proposed using various types of splines on the log hazard scale, such as piecewise linear splines [15,16]. In this article, we use restricted cubic splines to model the log hazard function, which by definition ensures that the hazard function is positive across follow-up, but has the computational disadvantage that the cumulative hazard requires numerical integration to calculate it. Restricted cubic splines have been used widely within the flexible parametric survival modelling framework of Royston and Parmar [17,18], which are modelled on the log cumulative hazard scale. The switch to the log cumulative hazard scale provides analytically tractable cumulative hazard and hazard functions; however, when there are multiple time-dependent effects, there are difficulties in interpretation of time-dependent hazard ratios, because these will vary over different covariate patterns, even with no interaction between these covariates [18].
In Section 2, we derive the general framework and extend it to incorporate cluster robust standard errors and incorporate background mortality for the extension to relative survival. In Section 3, we describe a special case of the framework using restricted cubic splines to model the baseline hazard and time-dependent effects, and describe how the estimation process can be improved through a combined analytical and numerical approach. In Section 4, we apply the spline-based hazard models to datasets in breast and bladder cancer, illustrating the improved estimation routine, the application of relative survival, and the use of cluster robust standard errors, respectively. We conclude the paper in Section 5 with a discussion.

A general framework for the parametric analysis of survival data
We begin with some notation. For the i th patient, where i = 1, … , N, we define t i to be the observed survival time, where t i = min(t * i , c i ), the minimum of the true survival time, t * i , and the censoring time, c i . We define an event indicator d i , which takes the value of 1 if t * i ⩽ c i and 0 otherwise. Finally, we define t 0i to be the entry time for the i th patient, that is, the time at which a patient becomes at risk.
Under a parametric survival model, subject to right censoring and possible delayed entry (left truncation), the overall log-likelihood function can be written as follows: with log-likelihood contribution for the i th patient where f (t i ) is the probability density function and S(.) is the survival function. If t 0i = 0, the third term of Equation (2) can be dropped. Using the relationship where h(t) is the hazard function at time t, substituting Equation (3) into Equation (2), we can write Now given that The log-likelihood formulation of Equation (6) implies that, if we specify a well-defined hazard function, where h(t) > 0 for t > 0, and can subsequently integrate it to obtain the cumulative hazard function, we can then maximise the likelihood and fit our parametric survival model using standard techniques [19]. When a standard parametric distribution is chosen, for example, the exponential, Weibull or Gompertz, and for the moment assuming proportional hazards, we can directly integrate the hazard function to obtain a closed-form expression for the cumulative hazard function. As described in Section 1, these distributions are simply not flexible enough to capture many observed hazard functions. If we postulate a more flexible function for the baseline hazard, which cannot be directly integrated analytically, or wish to incorporate complex time-dependent effects, for example, we then require numerical integration techniques in order to maximise the likelihood.

Numerical integration using Gaussian quadrature
Gaussian quadrature is a method of numerical integration, which provides an approximation to an analytically intractable integral [20]. It turns an integral into a weighted summation of a function evaluated at a set of pre-defined points called quadrature nodes or abscissae. Consider the integral from Equation (6) To obtain an approximation of the integral through Gaussian quadrature, we first must undertake a change of interval using Applying numerical quadrature, in this case Gauss-Legendre, results in where v = {v 1 , … , v m } and z = {z 1 , … , z m } are sets of weights and node locations, respectively, with m as the number of quadrature nodes. Under Gauss-Legendre quadrature, the weights v j = 1. We must specify the number of quadrature nodes, m, with the numerical accuracy of the approximation dependent on m.
As with all methods that use numerical integration, the accuracy of the approximation can be assessed by comparing estimates with an increasing number of nodes. We return to the issue of choosing the number of quadrature points in Section 3.

Excess mortality models
In population-based studies where interest lies in mortality associated with a particular disease, it is not always possible to use the cause of death information. This may be due to this information not being available or it considered too unreliable to use [21,22]. In these situations, it is common to model and estimate excess mortality by comparing the mortality experienced amongst a diseased population with that expected amongst a disease-free population. The methods have most commonly been applied to population-based cancer studies and have also been used in studies of HIV [23] and cardiovascular disease [24]. The total mortality (hazard) rate, h i (t), is partitioned into the expected mortality rate, h * i (t), and the excess mortality rate associated with a diagnosis of disease, i (t).
Transforming to the survival scale gives where R i (t) is known as the relative survival function and S * i (t) is the expected survival function. The effect of covariates on the excess mortality rate is usually considered to be multiplicative, and so, covariates, X i , are modelled as where 0 is the baseline excess hazard function and is a vector of log excess hazard ratios (also referred to as log excess mortality rate ratios). This model assumes proportional excess hazards, but in populationbased cancer studies, this assumption is rarely true and there has been substantial work on methods to fit models that relax the assumption of proportionality [24,[26][27][28]. A common model for analysing excess mortality is an extension of Royston-Parmar models [24]. These models are fitted on the log cumulative excess hazard scale. With multiple time-dependent effects, interpretation of hazard ratios can be complicated, and so, there are advantages to modelling on the log hazard scale instead. For example, in a model on the log cumulative excess hazard scale where both age group and sex are modelled as time-dependent effects, but with no interaction between the covariates, the estimated hazard ratio for sex would be different in each of the age groups. In a model on the log excess hazard scale, this would not be the case [18]. Previous work by Remontet et al. [29] used numerical integration but used quadratic splines, limited to only two knots, with no restriction on the splines.
The log-likelihood for an excess mortality model is as follows: Because the terms log do not depend on any model parameters, they can be omitted from the likelihood function for purposes of estimation. This means that in order to estimate the model parameters, the expected mortality rate at the time of death, h * (t i ), is needed for subjects that experience an event.

Cluster robust standard errors
In standard survival analysis, we generally make the assumption that observations are independent; however, in some circumstances, we can expect observations to be correlated if a group structure exists within the data, for example, in the analysis of recurrent event data, where individual patients can experience an event multiple times, resulting in multiple observations per individual. In this circumstance, we would expect observations to be correlated within groups. Failing to account for this sort of structure can underestimate standard errors.
GivenV, our standard estimate of the variance covariance matrix, which is the inverse of the negative Hessian matrix evaluated at the maximum likelihood estimates, we define the robust variance estimate developed by Huber [30] and White [31,32] where u i is the contribution of the i th observation to log L∕ , with N as the total number of observations. This can be extended to allow for a clustered structure. Suppose the N observations can be classified into M groups, which we denote by G 1 , … , G M , where groups are now assumed independent rather than individual level observations. The robust estimate of variance becomes 33 5280-5297 where u (G) i is the contribution of the j th group to log L∕ . More specifically, Rogers [33] noted that if the log-likelihood is additive at the observation level, where We follow the implementation in Stata, which also incorporates a finite sample adjustment ofV * r = {M∕(M − 1)}V r .

Improving the estimation when using restricted cubic splines
The very nature of the modelling framework described earlier implies that we can specify practically any general function in the definition of our hazard or log hazard function, given that it satisfies h(t) > 0 for all t > 0. To illustrate the framework, we concentrate on a particular flexible way of modelling survival data, using restricted cubic splines [34].
We begin by assuming a proportional hazards model, modelling the baseline log hazard function using restricted cubic splines where X i is a vector of baseline covariates with associated log hazard ratios , and s(log(t)| , k 0 ) is a function of log(t) expanded into restricted cubic spline basis with knot location vector, k 0 , and associated coefficient vector, . For example, if we let u = log(t), and with knot vector, k 0 s(u| , k 0 ) = 0 + 1 s 1 + 2 s 2 + · · · + m+1 s m+1 (19) with parameter vector , and derived variables s j (known as the basis functions), where 3 if the value is positive and 0 otherwise, and In terms of knot locations, for the internal knots, we use by default the centiles of the uncensored log survival times, and for the boundary knots, we use the minimum and maximum observed uncensored log survival times. The restricted nature of the function imposes the constraint that the fitted function is linear beyond the boundary knots, ensuring a sensible functional form in the tails where often data are sparse. The choice of the number of spline terms (more spline terms allow greater flexibility) is left to the user. A recent extensive simulation study assessed the use of model selection criteria to select the optimum degrees of freedom within the Royston-Parmar model (restricted cubic splines on the log cumulative hazard scale), which showed no bias in terms of hazard ratios, hazard rates and survival functions, with a reasonable number of knots as guided by AIC/BIC [13].

Complex time-dependent effects
Time-dependent effects, that is, non-proportional hazards, are commonplace in the analysis of survival data, where covariate effects can vary over prolonged follow-up time, for example, in the analysis of registry data [9]. Continuing with the special case of using restricted cubic splines, we can incorporate time-dependent effects into our model framework as follows: where for the p th time-dependent effect, with p = {1, … , P}, we have x p , the p th covariate, multiplied by some spline function of log time, s(log(t)| p , k p ), with knot location vector, k p , and coefficient vector, Once again, degrees of freedom, that is, number of knots, for each time-dependent effect can be guided using model selection criteria, and/or the impact of different knot locations assessed through sensitivity analysis.

Improving estimation
Given that the modelling framework is extremely general, in that the numerical integration can be applied to a wide range of user-defined hazard functions, the application of Gaussian quadrature to estimate the models may not be the most computationally efficient. For example, in Crowther and Lambert [14], we compared a Weibull proportional hazards model, with the equivalent general hazard model using numerical integration.
In the restricted cubic spline-based models described earlier, the restricted nature of the spline function forces the baseline log hazard function to be linear beyond the boundary knots. In those areas, the cumulative hazard function can actually be written analytically, as the log hazard is a linear function of log time. Defining our boundary knots to be k 01 , k 0n , we need only conduct numerical integration between k 01 , k 0n , using the analytical form of the cumulative hazard function beyond the boundary knots.
We define 0i and 1i to be the intercept and slope of the log hazard function for the i th patient before the first knot, k 01 , and 0i and 1i to be the intercept and slope of the log hazard function for the i th patient after the final knot, k 0n . If there are no time-dependent effects, then { 0i , 1i , 0i , 1i } are constant across patients. The cumulative hazard function can then be defined in three components If we assume t 0i < k 01 and t i > k 0n , then before the first knot, we have and after the final knot, we have and H 2i (t) becomes The alternative forms of the cumulative hazard function for situations where, for example, t 0i > k 01 , are detailed in Appendix A. This combined analytical/numerical approach allows us to use far fewer quadrature nodes, which given numerical integration techniques are generally computationally intensive, is a desirable aspect of the estimation routine. We illustrate this in Section 4.1.

Improving efficiency
In this section, we conduct a small simulation study to compare the efficiency of the Kaplan-Meier estimate of the survival function with a parametric formulation using splines, in particular, when data are sparse in the right tail. We simulate survival times from a Weibull distribution with scale and shape values of 0.2 and 1.3, respectively. Censoring times are generated from a U(0,6) distribution, with the observed survival time taken as the minimum of the censoring and event times, and an administrative censoring Copyright Table I.
From Table I, we see that at both 4 and 5 years, the mean squared error is lower for the parametric approach, compared with the Kaplan-Meier estimate. Bias is essentially negligible for all estimates. This indicates a gain in efficiency for the parametric approach in this particular scenario. Of course, this simulation setting is limited to a simple case of a Weibull, but note that we do not fit the correct parametric model, but an incorrect flexible model still does better than the Kaplan-Meier.

Example applications
We aim to show the versatility of the framework through three different survival modelling areas, utilising splines, whilst providing example code in the appendix to demonstrate the ease of implementation to researchers.

Breast cancer survival
We begin with a dataset of 9721 women aged under 50 years and diagnosed with breast cancer in England and Wales between 1986 and 1990. Our event of interest is death from any cause, where 2847 events were observed, and we have restricted follow-up to 5 years, leading to 6850 censored at 5 years. We are interested in the effect of deprivation status, which was categorised into five levels; however, in this example, we restrict our analyses to comparing the least and most deprived groups. We subsequently have a binary covariate, with 0 for the least deprived group and 1 for the most deprived group.
In this section, we wish to establish the benefit of incorporating the analytic components, described in Section 3.2, compared with the general method of only using numerical integration, described in Section 2. We use the general Stata software package, stgenreg, described previously [14], to fit the full quadrature-based approach, and a newly developed Stata package, strcs, which implements the combined analytic and numeric approach when using splines on the log hazard scale. We apply the spline-based models shown in Equation (18), with five degrees of freedom (six knots), that is, five spline variables to capture the baseline, incorporating the proportional effect of deprivation status, with an increasing number of quadrature points, until estimates are found to have converged to three, four and, finally, five decimal places. Table II compares parameter estimates and standard errors under the full numerical approach, across varying number of quadrature nodes, and Table III presents the equivalent results for the combined analytic/numeric approach. From Table II, we still observe variation in estimates and the log-likelihood to five or six decimal places between 500 and 1000 nodes, whilst for the combined approach shown in Table III, the maximum difference between 100 and 1000 nodes is 0.000001. For the combined approach, the loglikelihood does not change to three decimal places between 100 and 1000 nodes, whilst the log-likelihood for the full numerical approach is only the same to one decimal place.
We found that with the full numerical approach, it required 23 nodes and 50 nodes, to establish consistent estimates to three and four decimal places, respectively. We compare that to 18 nodes and 27 nodes under the combined analytic and numerical approach. Final results for the combined approach using 27 nodes are presented in Table IV. Table II. Comparison of estimates when using different numbers of nodes for the fully numeric approach.   From Table IV, we observe a statistically significant hazard ratio of 1.309 (95% CI: 1.212, 1.414), indicating an increased hazard rate in the most deprived group, compared with the least deprived group. Comparing computation time, the general approach with 50 quadrature nodes took 20.5 s on a standard laptop, compared with 17.5 using the combined approach with 27 nodes. Figure 1 shows the fitted survival functions from the full numerical approach (using stgenreg), the combined analytic/numerical approach (using strcs) and the Cox model. It is clear that all three models yield essentially identical fitted survival functions, although from a visual inspection all three appear to fit poorly.
We can investigate the presence of a time-dependent effect due to deprivation status, by applying Equation (23). We use five degrees of freedom to capture the baseline and use three degrees of freedom to model the time-dependent effect of deprivation status. Figure 2 shows the time-dependent hazard ratio, illustrating the decrease in the effect of deprivation over time. The improved fit when incorporating the time-dependent effect of deprivation status is illustrated in Figure 3.
Example Stata code to fit time-independent and time-dependent models presented in this section is included in Appendix B.

Excess mortality model
For the excess mortality model, we use the same data source as in Section 4.1. However, we now include women aged over 50 years. Expected mortality is stratified by age, sex, calendar year, region and deprivation quintile [25]. As for the analysis in Section 4.1, we only include the least and most deprived groups for simplicity. Age is categorised into five groups: < 50, 50-59, 60-69, 70-79 and 80+ years. There are 41 645 subjects included in the analysis, with a total of 17 057 events before 5 years post-diagnosis.

Proportional excess hazards model.
We initially fit a model where the excess mortality rate is assumed to be proportional between different covariate patterns. We compare the estimates with a model Copyright    using restricted cubic splines on the log cumulative hazard scale [24]. In both models, six knots are used with these placed evenly according to the distribution of log death times, with results presented in Table V. From Table V, we observe very similar hazard ratios and their 95% confidence intervals between the models on different scales.

Time-dependent effects.
A model is now fitted where the assumption of proportional excess hazards is relaxed for all covariates. This is carried out by incorporating an interaction between each covariate and a restricted cubic spline function of log time with four knots (three degrees of freedom). The knots are placed evenly according to the distribution of log death times. The estimated excess hazard ratio for deprivation group can be seen in Figure 4. If there is not an interaction between deprivation group and age group, then this hazard ratio is assumed to apply for each of the five age groups. If the model was fitted on the log cumulative excess hazard scale, then this would not be the case. This is illustrated in Figure 5 where the same linear predictor has been fitted for a model on the log cumulative excess hazard scale and the estimated excess hazard ratio is shown for two age groups and is shown to be different.    The impact of the default interior knot locations can be assessed through sensitivity analyses, varying the knot locations. In Figure 6, we compare the default choice (interior knots at 1.024 and 2.660), with three other choices, illustrating some minor variation in the tails of the estimated shape of the timedependent excess hazard ratio; however, the functional form is generally quite robust to knot location.
Example Stata code to fit time-independent and time-dependent excess mortality models presented in this section is included in Appendix C.

Cluster robust errors
To illustrate the use of cluster robust standard errors, we use a dataset of 85 patients with bladder cancer [35,36]. We fit a model for recurrent event data, where the event of interest is recurrence of bladder cancer. Each patient can experience a total of four events, shown in Table VI. A total of 112 events were observed. Covariates of interest include treatment group (0 for placebo, 1 for thiotepa), initial number of tumors (range 1 to 8, with 8 meaning 8 or more) and initial size of tumors (in centimetres, with range 1 to 7).
To allow for the inherent structure, events nested within patients, we fit a parametric version of the Prentice-Williams-Peterson model, allowing for cluster robust standard errors. This model uses nonoverlapping time intervals; thus, for example, a patient is not at risk of a second recurrence until after the first has occurred. The baseline hazard for each event is allowed to vary; that is, there is a stratification factor by event. We use five knots for a shared baseline between the events but allow departures from this baseline using restricted cubic splines with three knots for each of the subsequent events. For comparison, we also fit a Cox model, stratified by event number, with cluster robust standard errors [37]. Results are presented in Table VII.
From Table VII, we observe similar estimates from the spline-based model, compared with the Cox model with cluster robust standard errors. We can compare estimated baseline hazard rates for each of the four ordered events, from the spline-based model, shown in Figure 7. We can see from Figure 7 that those patients who go on to experience a third and fourth event have a high initial hazard rate, demonstrating the fact that they will likely be a more severe subgroup.
Example Stata code to fit the cluster robust spline model is included in Appendix D.

Discussion
We have described a general framework for the parametric analysis of survival data, incorporating any combination of complex baseline hazard functions, time-dependent effects, time-varying covariates, delayed entry (left truncation), robust and cluster robust standard errors, and the extension to relative survival. Modelling the baseline hazard, and any time-dependent effects parametrically, can offer a greater insight into the risk profile over time. Parametric modelling is of particular importance when extrapolating survival data, for example, within an economic decision modelling framework [11]. In this article, we concentrated on the use of restricted cubic splines, which offer great flexibility to capture the observed data, but also a likely sensible extrapolation if required, given the linear restriction beyond the boundary knots. In particular, we described how the general framework can be optimised in special cases with respect to the estimation routine, utilising the restricted nature of the splines to incorporate the analytic parts of the cumulative hazard function, in combination with the numerical integration. This provided a much more efficient estimation process, requiring far fewer quadrature nodes to obtain consistent estimates, providing computational benefits. However, it is important to note that although we have concentrated on the use of splines in this article, essentially any parametric function can be used to model the baseline (log) hazard function and time-dependent effects. Copyright  In application to the breast cancer data, we showed that the general numerical approach requires a large number of quadrature nodes, compared with the combined analytic/numeric approach, in order to obtain consistent estimates. This is due to the numerical approach struggling to capture high hazards at the beginning of follow-up time. Given that hazard ratios are usually only reported to two/three decimal places, the large number of nodes used in Section 4.1 will often not be required. In further examples not shown, where the hazard is low at the beginning of follow-up, often < 30 nodes are sufficient with the full numerical approach.
We have chosen to use restricted cubic spline functions of log time, because in many applications we have found this to provide an equivalent or better fit, compared with using splines of time. However, in studies with age as the timescale, it may be more appropriate to use spline functions of untransformed time.
Other approaches to modelling the baseline hazard and time-dependent effects include using the piecewise exponential framework, through either a Bayesian [38] or classical approach [39]. Han et al. [39] developed a reduced piecewise exponential approach that can be used to identify shifts in the hazard rate over time based on an exact likelihood ratio test, a backward elimination procedure and an optional presumed order restriction on the hazard rate; however, it can be considered more of a descriptive tool, as covariates cannot currently be included. The piecewise approach assumes that the baseline and any time-dependent effects follow a step function. Alternatively, using splines, as described in this article, would produce a more plausible estimated function in continuous time, with particular benefits in terms of prediction both in and out of sample, compared with the piecewise approach.
In this article, we have only looked at fixed effect survival models; however, future work involves the incorporation of frailty distributions. User-friendly Stata software, written by the authors, is provided, which significantly extends the range of available methods for the parametric analysis of survival data [14].
If t 0i < k 01 and k 01 < t i < k 0n , then we have