Exploring the Clustering Effect of the Frailty Survival Model by Means of the Brier Score

In this article, the Brier score is used to investigate the importance of clustering for the frailty survival model. For this purpose, two versions of the Brier score are constructed, i.e., a “conditional Brier score” and a “marginal Brier score.” Both versions of the Brier score show how the clustering effects and the covariate effects affect the predictive ability of the frailty model separately. Using a Bayesian and a likelihood approach, point estimates and 95% credible/confidence intervals are computed. The estimation properties of both procedures are evaluated in an extensive simulation study for both versions of the Brier score. Further, a validation strategy is developed to calculate an internally validated point estimate and credible/confidence interval. The ensemble of the developments is applied to a dental dataset.


Introduction
For frailty models, the importance of clustering typically is assessed by means of the variance of the frailty distribution only: when the variance is large, clustering is important and vice versa. In this article, we would like to investigate the importance of clustering in more detail. To this end, we propose to assess how clustering affects the quality of the model predictions.
In the course of the last three decades, many different measures of the predictive ability have been proposed for univariate survival data (O'Quigley et al., 2005;Schemper and Stare, 1996). These measures either measure discrimination, calibration, or a combination thereof, also called the overall performance (Steyerberg et al., 2010). Until now only the concordance probability, which is a discrimination measure, has been extended to the clustered data setting (Van Oirbeek and Lesaffre, 2010). In practice, many researchers have argued that a measure of the overall performance should be evaluated on top of a discrimination measure (Steyerberg et al., 2010). We therefore extended the Brier score in this article, which is a measure of the overall prediction performance of a regression model, to the clustered setting (Graf et al., 1999;Steyerberg et al., 2010). This extension results in two different versions of the Brier score, namely a conditional Brier score, which captures the predictive ability of the covariate and clustering effects jointly, and a marginal Brier score, which captures the predictive ability of the covariate effects only. By comparing both versions of the Brier score, one can evaluate explicitly the additional predictive gain of the clustering effects besides that of the covariate effects. This information cannot be obtained from a marginal model since it does not gauge the intra-cluster correlation explicitly. In order to check whether this predictive gain is important, we developed a 95% credible/confidence interval procedure as well as an internal validation procedure for both versions of the Brier score.
We first review the Brier score and its estimator for univariate survival models (Section 2). In Section 3, two versions of the Brier score are developed for the frailty model using a Bayesian and a likelihood approach. Their properties have been investigated in an extensive simulation study, detailed in the supplemenatry materials (available online) and summarized in Section 4. An internal validation scheme is presented in Section 5. We apply all our developments to a dental dataset in Section 6, followed by a discussion.

Brier Score for Univariate Survival Data
Consider a randomly selected subject i with a true failure time T i , a stochastic covariate vector X i , and the predicted survival probability S(t|X i ) at a given time point t > 0. The time-dependent squared residual BS i (t) for the ith subject at time point t is defined as: where Y i (t) = I {T i > t} corresponds to the event status of subject i at time point t. The Brier Score BS(t) is the population version of this measure and corresponds to the average squared distance between the event status and the predicted survival curve at t, or where the expectation is taken over the failure time distribution and the covariate distribution. This measure ranges between 0 and 1 whereby a value close to zero indicates a well-predicting survival model. BS(t) can be used to construct a time-dependent measure of the explained variance R 2 (t) by scaling the Brier score of the model of interest to the Brier score under a noninformative model BS 0 (t), i.e., R 2 (t) = 1 − BS(t) BS 0 (t) . If we assume that the model of interest has a predictive ability that is superior or equal to one of the null model, then R 2 (t) ranges by definition between 0 and 1. In the likelihood framework the Kaplan-Meier model (Kaplan and Meier, 1958), and in the Bayesian framework the gamma independent increments model (Kalbfleisch, 1978) can be used as a noninformative model. An integrated version of the Brier score (I BS) can be constructed on a time interval [τ 1 , τ 2 ] as I BS = τ 2 τ 1 BS(t)dW (t), with τ 1 mostly equal to 0 and the weight function W (t) mostly equal to t/(τ 2 − τ 1 ). I BS therefore expresses the average behavior of the Brier score over the time interval [τ 1 , τ 2 ] . Other weight functions W (t) can be proposed, for instance when one wishes to focus on the predictive ability of the model in certain parts of the considered time interval (Graf et al., 1999).
For univariate data, a uniformly consistent estimator has been provided for the Brier score, when censoring and failure times are conditionally independent given the covariates (Gerds and Schumacher, 2006). Let t i be the observed failure time of subject i such that t i = min(T i , T c,i ) corresponds to the observed failure time, with T c,i the right censoring time and T i the true failure time. If t i = T i , then t i is a true failure time and the censoring indicator δ i equals 1. If t i = T c,i , then t i is a censoring time and the censoring indicator δ i equals 0. For a sample of size n, Gerds and Schumacher (2006) proposed to estimate the Brier score as: where y i (t) is the observed event status, x i (z i ) the observed covariate vector of the survival (censoring) model, φ i (t,Ĝ(t|z i )) the time-dependent censoring weight, andŜ(t|x i ) (Ĝ(t|z i )) the estimator of the conditional survival function of the survival (censoring) model of subject i. Since the estimated model parameters are directly used in the estimation of the Brier score, they are part of both the selected failure time model and the estimated Brier score. The censoring model is fitted to the data by interchanging the role of event time and censoring time. The censoring model (and its covariates Z) and the survival model (and its covariates X) are usually not the same. For a subject i, the estimated censoring weight φ i (t) equals: with t i − the time point just before t i . BS(t) is a consistent estimator of the Brier score under conditional independence of the censoring mechanism and the failure time model and if the censoring model is correctly specified. Hence, because φ(·) only depends on the censoring model, estimator (1) is robust against misspecification of the failure time model (Gerds and Schumacher, 2006). Since the event status y i (t) is unknown for a censored subject i at a time point t ≥ t i , Graf et al. (1999) defined that φ(·) equals 0 for this subject when t ≥ t i . Consistency of this estimator was proven by Gerds and Schumacher (2006). Based on an extensive simulation study, the latter authors also stated that inverse probability of censoring weighted (IPCW) estimates based on the Kaplan-Meier model (Kaplan and Meier, 1958) can be used, in case that covariates have no or little impact on the censoring process.

The Two Proposals
In the presence of clustering, a frailty model can be used to model the survival times. This model accounts for clustering by introducing a frailty term w j to each separate cluster j of the dataset. The frailty term w j is a positive random variable with a frailty distribution f W (w|ζ ), depending on parameters ζ . A parametric distribution such as the gamma distribution is typically chosen. Two model formulations can be proposed for a frailty model: a conditional frailty model that explicitly contains the frailty terms in its formulation and a marginal frailty model that is constructed by integrating out the frailty terms from the conditional frailty model (Duchateau and Janssen, 2008, chap. 2). Hence, a conditional survival curve S C (t|X, W ) and a marginal survival curve S M (t|X) ≡ S C (t|X, w)f W (w|ζ )dw can be constructed. To exemplify the difference between the two survival functions, take for example the proportional hazards (PH) frailty model. For this model, it holds that S C (τ |X, W ) = 0 (τ ) exp(−W exp(β T X)) with 0 (τ ) the cumulative baseline hazard at time point τ and β a vector of regression coefficients. For PH frailty models of the power variance family, Two versions of BS(t) can therefore be defined for frailty survival models. The conditional Brier score BS C (t) compares for each subject at time point t the binary event status with the conditional survival curve, or: with BS C,w (t) the conditional Brier score of the cluster characterized by w. BS C (t) can be seen as a weighted average cluster conditional Brier score. The marginal Brier score BS M (t) compares at time point t the binary event status with the marginal survival curve, or: In summary, two different versions of the Brier score can be defined for the frailty model: BS C (t) evaluates the predictive ability of the chosen frailty model while accounting for the covariate effects and the clustering effect, while BS M (t) accounts for the covariate effects of the chosen model only. As such, the comparison of BS M (t) with BS C (t) expresses the added (predictive) value of considering the clustering effects explicitly on top of the covariate effects for the chosen model. As proven in Section A of the supplementary material (available online), BS C (t) is always lower than or equal to BS M (t). This means that, on average, the conditional survival curve follows the event status at least as closely as the marginal survival curve.

Estimating B S C (t) and B S M (t) for Frailty Models
BS C (t) and BS M (t) can be estimated by plugging in the estimates of the conditional S C (t|X, w) and the marginal survival curve S M (t|X) respectively in (1). Consider a sample of size n with J clusters of size n 1 , n 2 , . . . , n J and J j =1 n j = n. Both Brier scores can then be estimated as: and, with x ji , z ji , and y ji the covariate vector of the survival model, the covariate vector of the censoring model, and the event status for subject i of cluster j, respectively, and with BS C,j (t) = 1 ) the estimated conditional Brier score of cluster j at time point t. Clearly, BS C (t) is a weighted average of all cluster conditional Brier scores of the dataset. The censoring weight φ i (t) of a given subject i is estimated as in (2). When an analytical expression for the marginal survival curve is lacking for the frailty model,Ŝ M (t|x ji ) can be estimated as 1/K K k=1Ŝ C (t|x ji ,ŵ k ) witĥ w k sampled from f W (w|ζ ) and K large (Molenberghs and Verbeke, 2005, chaps. 7 and 11). Note that BS C (t) and BS M (t) are consistently estimated for the considered survival model only if the censoring model is correctly specified (Gerds and Schumacher, 2006).

Point and Interval Estimates of the Brier Scores
Both Brier scores can be computed using a Bayesian or a likelihood procedure, combined with the percentile nonparametric bootstrap method to construct a credible interval or a confidence interval, respectively. In these procedures, the estimation of the failure time model as well as the estimation of the Brier score, which includes the estimation of the censoring model, are both entirely performed using either Bayesian approaches or likelihood approaches. The bootstrap technique has been adapted to the clustered data setting, i.e., resampling by cluster and always selecting the same number of clusters in each bootstrap sample (Field and Welsh, 2007). The Bayesian procedure also necessitates this bootstrap technique to ensure that the resulting 95% credible intervals also capture the uncertainty of the estimation of the model parameters while using future values of the observed covariates. The description of both procedures using the BC a method instead of the percentile method can be found in Section B of the supplementary material (available online). In the remainder of this section, the measure of interest is denoted as M(t). For the Bayesian procedure, this amounts to: The likelihood procedure corresponds to: (a) Determine the point estimates of the model parameters, which are obtained for the PH frailty model by maximizing the marginal likelihood for the parameters of the baseline hazard function 0 (t) and the covariate effects β after which empirical Bayes estimates are calculated for the frailty terms w (Duchateau and Janssen, 2008, chap. 2). This step needs to be applied to the failure time model and the censoring model separately.

Simulation Study
The estimation properties of the likelihood and Bayesian estimation procedures (as described in Section 3.3) are checked for the parametric and semi-parametric survival model in an extensive simulation study, in which factors such as the censoring percentage and the cluster size are varied. Apart from an overoptimistic empirical bias of IBS C and IBS M when the censoring percentage is very high (80%) and an underoptimistic empirical bias of IBS C when the cluster size is very low (2), the semi-parametric likelihood approach shows a similar behavior as the parametric likelihood approach for the Bayesian method. For the likelihood method, the estimation properties of IBS C and IBS M using either the parametric or semi-parametric survival model are similar under all investigated scenarios. In general, BS M (t) is well estimated by both the Bayesian and likelihood methods. BS C (t), however, suffers from an overoptimistic bias for the likelihood method, especially when the cluster size is small and when the censoring percentage is high, while good estimation properties are observed for the Bayesian method. This overoptimistic bias is caused by the empirical Bayes estimates of the frailty terms: in a small simulation study, we saw that the overoptimistic bias of BS C (t) disappeared once the frailty terms were replaced by their population values. Similar results were found elsewhere Lesaffre, 2010, 2012). The 95% credible/confidence intervals based on the percentile or BC a method possess good coverage properties.
Sensitivity against misspecification of the frailty distribution and the censoring model were investigated as well for the parametric model. The estimation properties of the likelihood procedure are very similar to the corresponding properties of the Bayesian procedure, except for an additional overoptimistic empirical bias for IBS C .
Given that the cluster size is sufficiently large (larger than 10 in our simulation study), BS C (t) is virtually unaffected by misspecification of the frailty distribution. BS M (t), however, is in general strongly affected by misspecification. We observed that the conditional survival curves remain of good quality and that the baseline hazard "compensates" the misspecification of the frailty distribution. As a consequence, the quality of the marginal survival curve may be poor. We therefore recommend to use BS C (t) for model selection and BS M (t) to check how the clustering effect improves upon the predictive ability of the covariates for the model at hand. Note that this improvement in predictive ability does not necessarily correspond to one of the correct model. A functionally misspecified covariate effect or a misspecified censoring model will result in an overoptimistic bias, especially for high censoring percentages (50% and 80% in our simulation study). Note, however, that it is easier to detect misspecifications of the censoring model as the censoring percentage increases. Covariates that have no impact on the censoring process will not influence the quality of the estimation when included in the censoring model. Omitting important variables, however, does result in an overoptimistic bias. As a result, we recommend to use a saturated censoring model and to select the censoring model that results in the lowest estimates for BS C (t) and BS M (t). Note that this agrees with the results of Gerds and Schumacher (2006). Good estimates for the explained variance R(t) (based on either BS C (t) or BS M (t)) are obtained even when the censoring model is misspecified. Indeed, when the Brier score is used for comparative purposes only, it is only necessary that the same censoring model is used for all survival models (Gerds and Schumacher, 2006). For more details on the setup of the simulation study and its results, we refer to Sections C and D of the supplementary material (available online).

Internal Validation
Since external data are often lacking, the same data are usually used for both model building and measuring the predictive ability. As a result, overoptimistic estimates of the predictive ability are often acquired. One can correct for this by applying an adaptation to the clustered setting of the .632 and .632+ bootstrap cross-validation procedures of the Brier score for univariate survival models (Gerds and Schumacher, 2007). Let M(t) be the measure that one wants to validate and M(t) the overoptimistic point estimate of the measure. The procedure then corresponds to: Despite that step (a) is different from the procedure proposed by Gerds and Schumacher (2007), the properties of the upper procedure are the same as the properties of their procedure since for both procedures the training set is supported on 63.2% of the original data points (Efron and Tibshirani, 1997). The internally validated estimate corresponds to (1 − , with ψ(t) = 0.632 for the .632 bootstrap cross-validation estimate M .362 (t) and ψ(t) = 0.632 1−.368 R ror (t) for the .632+ bootstrap cross-validation estimate M .362+ (t) (Gerds and Schumacher, 2007). In the latter expression, R ror (t) is the relative overfitting rate that corresponds to: where M NoInf (t) corresponds to the no-information error rate or the Brier score for which the event status is independent from the covariates x. As such, for each subject i of cluster j, its contribution to M NoInf (t) is computed by averaging over all observed x values of the dataset: When R ror (t) is lower than 0 (higher than 1), R ror (t) is fixed to 0 (1). The .632+ bootstrap cross-validation estimate is found to perform better than the .632 bootstrap cross-validation estimate (Efron and Tibshirani, 1997). In order to compute BS C (t), an estimate of the random effect of each of the clusters of the dataset is needed. However, in the above procedure it might happen that some clusters of the original dataset are not represented in the training set of step (a) such that in the validation set of step (c) the cluster-specific estimate of BS C (t) cannot be calculated for these clusters. In that case we propose to calculate BS C (t) in the validation set of step (c) only for those clusters that were present in the training set of step (a). It is assumed that BS C (t) cannot be calculated for a minority of the clusters of the validation dataset only such that the 63.2/36.8 proportion of the training and validation set is not too strongly disrupted. A too strong disruption of this proportion invalidates the above internal validation scheme.
A credible/confidence interval can be constructed around the internally validated point estimate by combining the above internal validation procedure and the bootstrap procedures of Section 3. More specifically, we start by sampling K bootstrap datasets. For each of these bootstrap datasets, the above procedure is repeated resulting in K validated measures M (k) .632 (t) and M (k) .632+ (t) with k = 1, . . . , K. These validated measures M (k) .632 (t) and M (k) .632+ (t) are then used to construct a 95% credible/confidence interval by applying step (e) of the percentile method (see Section 3).

Introduction
Up until two decades ago, amalgam restorations were commonly used as a material for dental fillings. To this end, a clinical study was set up to investigate the effect of different factors on the longevity of amalgam restorations. The 12 covariates and 15 derived variables of the study are the four cavity wall treatments "CSA," "CWF," "Copalite," and "Silver Suspension," the alloy of the amalgam (levels: "NTD," "Tytin," and "CNG"), the caries status, the type of the restoration (levels: "MO/DO" and "MOD"), the type of tooth (levels: "premolar" and "molar"), the position of the tooth in the mouth (levels: "upper left," "upper right," "lower left," and "lower right"), the operator (levels: "operator 1," "operator 2," and "operator 3"), gender, and age. The amalgam dataset is composed of three clinical trials and in each clinical trial the four cavity wall treatment effects and the three investigated amalgam alloys were combined into four treatment modalities assigning each modality randomly to four (or 8 or 12) teeth in a 2 x 2 (factorial) design within patient (Kreulen et al., 1998). Note that this study was mainly designed to evaluate significance of the four cavity wall treatment effects and the three investigated amalgam alloys. In this article, we will evaluate the predictive potential of all the measured covariates.
Since the early failures were suspected to be of a different nature than the late failures and since there were only two early failures, we restrict our analysis to the late failures only, i.e., failures that occurred at least nine years after enrollment in the study. This results in 174 patients contributing with 1,347 amalgam restorations. The clustered data structure is unbalanced, with 4, 8, and 12 restorations seen in 36, 89, and 33 patients, respectively. Note that 16 patients contributed to the study with a number of teeth differently from 4, 8, or 12. The median follow-up time is 14.48 and the true failure time is observed for 187 amalgam restorations only, leading to a censoring percentage of 86.2%. In Kreulen et al. (1998), it is mentioned that, in addition to these 187 failures (called true failures in their paper), 20 false failures were observed. A competing risk approach might therefore be needed here. However, we limited our analysis to the true failures only, since the 20 false failures were not identified in the dataset, but more importantly the current analysis merely serves to illustrate our proposed measures.
Based on the simulation results described in Section C of the supplementary material (available online), we have chosen to use the Bayesian gamma frailty generalized gamma model as the survival model. Different censoring models for the IPCW weights were tested with the Bayesian generalized gamma model consisting of all 12 covariates resulting in the lowest Brier score estimates. For the computation of IBS C , IBS M , and IBS 0 , BS C (t), BS M (t), and BS 0 (t) were integrated from 9 to 15 years, since, as recommended by Graf et al. (1999), censoring is not too heavy at this time point. The .632+ bootstrap crossvalidation estimators were chosen for the internal validation of BS C (t) and BS M (t). The R code that is used to perform the below analysis can be found in the supplementary material (available online). For reasons of confidentiality, we could not provide the amalgam dataset such that a fictive dataset is supplied with this code.

The Analysis
6.2.1. Model selection. The objective of this analysis was to find the best predicting survival model. A model selection step was therefore performed to find the covariate combination that results in this best predicting model. To this end, the all-possible-regression procedure was applied, meaning that the predictive ability of the corresponding survival model is evaluated for each of the 2 12 − 1 different subsets of the 12 covariates. Based on the simulation results of Section D of the online supplementary material (IBS C is robust against misspecification of the frailty distribution, but IBS M not), we have chosen IBS C as selection criterion. To avoid the effect of overfitting, the IBS C values were internally validated. We chose the all-possible-regression procedure, though other methods could be applied as well. From Fig. 1, one can see that the best predicting model is a model with five covariates, containing "CSA," "CWF," and "Copalite" cavity wall treatments, caries status, and gender as covariates.
6.2.2. Predictive ability of the final model. The predictive ability of the selected model will be investigated in more detail now. The apparent and validated point estimate and 95% percentile credible interval of the Brier scores (B = 100, K = 100) are shown in Table 1. We see that IBS M improves upon the null model and that IBS C improves upon both the null model and IBS M . Indeed, even after internal validation, IBS M attains an explained variance of 17.7% and IBS C an explained variance of 31.2%. This means that the marginal (conditional) survival curves are on average 17.7% (31.2%) closer to the observed event status than the survival curve of the null model. The high predictive ability of the model can thus be attributed to a combination of strong covariate effects and a strong clustering effect. Therefore, if one would succeed in a future study to fully capture the clustering effect in covariates and if also the same covariates were measured as the ones included in this model, the explained variance of IBS M of this future study can potentially attain 31.2%. Note that all the apparent and validated 95% percentile credible intervals show a strong overlap and that the validated 95% percentile credible intervals are wider than the apparent 95% percentile credible intervals. Also, the drop in explained variance before and after internal validation is more pronounced for IBS C than for IBS M . Due to a low mean of the relative overfitting rate for both IBS C and IBS M (33.4% and 3.0% respectively), the .632 estimate of both IBS C and IBS M only differ mildly from their corresponding .632+ estimate ( IBS C,.623 = 0.0191 versus IBS C,.623+ = 0.0194 and IBS M,.623 = 0.0231 versus IBS M,.623+ = 0.0232).
In the left panel of Fig. 2, the apparent and validated time-varying estimates of the Brier scores are shown. Up until 14 years after the start of follow-up, all curves are very similar and barely differ from zero. From 14 years onwards, the apparent and validated BS C (t) are clearly lower than the apparent and validated BS M (t) and the apparent BS 0 (t),   (Williams, 1995).
Ticks are added to visualize the observed failure times. and the apparent and validated BS M (t) are clearly lower than the apparent BS 0 (t). From the right panel of Fig. 2, it is clear that these time patterns are due to the shape of the observed failure time distribution. Indeed, the majority of the observed failures occur after 14 years after the start of follow-up and since the survival curves of the null model and the considered survival model remain close to 1 up until 14 years after the start of follow-up, the Brier scores result in low values.
terms of the clusters of the external dataset are available, such that BS C (t) cannot be calculated for that cluster. This underlines the importance of a good internal validation procedure for this measure. A topic of future research might be to estimate a frailty term of such a new cluster. In this article we extended the Brier score for frailty models with a two-level clustering structure, since this is the most common clustering structure in survival analysis (Duchateau and Janssen, 2008, chaps. 2 and 6). In practice, however, more complicated clustering structures may be encountered. In Van Oirbeek and Lesaffre (2012) for example, it is shown how the concordance probability and the Brier score can be extended for nested two-level and three-level multilevel binary regression models.