Objective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal Distributions

Survival models such as the Weibull or log-normal lead to inference that is not robust to the presence of outliers. They also assume that all heterogeneity between individuals can be modeled through covariates. This article considers the use of infinite mixtures of lifetime distributions as a solution for these two issues. This can be interpreted as the introduction of a random effect in the survival distribution. We introduce the family of shape mixtures of log-normal distributions, which covers a wide range of density and hazard functions. Bayesian inference under nonsubjective priors based on the Jeffreys’ rule is examined and conditions for posterior propriety are established. The existence of the posterior distribution on the basis of a sample of point observations is not always guaranteed and a solution through set observations is implemented. In addition, we propose a method for outlier detection based on the mixture structure. A simulation study illustrates the performance of our methods under different scenarios and an application to a real dataset is provided. Supplementary materials for the article, which include R code, are available online.


INTRODUCTION
Frequently, standard survival models do not accommodate all features of real applications. Datasets often exhibit more "rare" or "tail" observations than predicted by usual models. Hence, models such as Weibull or log-normal lead to inference that is not robust to the presence of outliers (Barros, Paula, and Leiva 2008). A second, related issue is the existence of specific individual factors that result in heterogeneity of the survival times, which cannot be captured by covariates (Marshall and Olkin 2007). Therefore, the typical assumption that the survival times correspond to realizations of random variables T 1 , . . . , T n that have the same "thin-tailed" distribution (possibly depending on a set of covariates) can be inappropriate. An example of such a dataset is the Veterans' Administration (VA) lung cancer data presented in Kalbfleisch and Prentice (2002), for which the previous literature found strong evidence of influential observations and unobserved heterogeneity related to outliers (e.g., Barros, Paula, and Leiva 2008;Heritier et al. 2009).
We consider the use of infinite mixtures of lifetime distributions as a solution for these issues. Mixture modeling can be interpreted as the introduction of a random effect on the survival distribution. This idea has been mentioned by previous authors (e.g., Marshall and Olkin 2007), but is not yet much used in applied work.
In particular, this article explores the shape mixtures of lognormal (SMLN) distributions for which the shape parameter is assigned a mixing distribution. This new class covers a wide range of shapes, in particular cases with fatter tail behavior than the log-normal. It includes the already studied log-Student t, log-Laplace, log-exponential power, and log-logistic distributions among others. This article puts the earlier literature into a common (more general) framework and develops objective Bayesian inference methods, which should be attractive for Catalina Vallejos (E-mail: cata.vallejos.m@gmail.com) is Ph.D. student and Mark Steel (E-mail: m.steel@warwick.ac.uk) is Professor, Department of Statistics, University of Warwick, Coventry CV4 7AL, United Kingdom. The authors are grateful to three referees, an associate editor, and Javier Rubio for insightful and constructive comments. Catalina Vallejos acknowledges research support from the University of Warwick and from the Department of Statistics of the Pontificia Universidad Católica de Chile.
practitioners. The proposed priors do not require the elicitation of hyperparameters and can be used in the (frequently encountered) setting in which no reliable prior information is available. They also provide baseline comparison when such information is available.
Section 2 introduces the use of mixture families of distributions with particular focus on the SMLN family. Covariates are introduced through an accelerated failure times (AFT) model. The interpretation of the regression coefficients is not affected by the mixing distribution. This is an advantage over proportional hazards models with frailty terms, in which the interpretation of the regression parameters is conditional on the random effect, and the proportional hazards property is not preserved after mixture (Wienke 2010). Section 3 analyzes aspects of Bayesian inference for models in the SMLN family. This addresses the existence of censored observations. Nonsubjective priors based on the Jeffreys' rule are proposed and conditions for the propriety of the posterior distribution are provided. In addition, we highlight that the use of point observations can affect the existence of the posterior distribution and a solution through set observations is considered. Section 3 also discusses some implementation details and proposes a method for outlier detection that exploits the mixture structure. A simulation study in Section 4 illustrates the performance of the proposed framework under different scenarios. We also show that, even for small sample size or a high proportion of censoring, standard Bayesian model comparison criteria can successfully detect departures from the log-normal model. In Section 5, we apply our models to the VA lung cancer dataset. SMLN models fit these data better than the log-normal model and uncover strong evidence of the presence of heterogeneity that is not accounted for by the available covariates. Finally, Section 6 concludes. All proofs are contained in the Appendix without mention in the text.

MIXTURES OF LIFE DISTRIBUTIONS
Let T 1 , . . . , T n be the survival times of n independent individuals. Usually, T 1 , . . . , T n are assumed to have the same "thintailed" distribution such as a log-normal or a Weibull (possibly depending on a set of covariates). However, this is often not appropriate in the face of unobserved heterogeneity between the survival times. This heterogeneity can be interpreted as unobserved individual effects, and can also be related to the presence of outlying observations. This article considers the use of mixtures of life distributions to account for unobserved heterogeneity and add robustness to the presence of outliers. The distribution of T i is defined as a mixture of life distributions, if and only if its density function is where f (·|ψ, i = λ i ) represents the density function associated with a lifetime distribution, which depends on the values of ψ and λ i (underlying distribution), and λ i can be understood as a random effect associated with each individual (frailty term). The mixing distribution has cumulative distribution function P i (·|θ ) with support L and depends on a parameter θ . If L is a finite set of values, the distribution of T i is a finite mixture of life distributions. However, here we focus on the case where i is a continuous positive random variable (usually L = R + ), in which case f (·|ψ, θ) can be interpreted as an infinite mixture of densities.
The intuition behind the underlying model carries over to these mixtures. Conditional on the mixing parameters, the base model applies. Therefore, if there are any theoretical or practical reasons underpinning this model (without mixing), the same reasons hold for the mixture model in the presence of unobserved heterogeneity. We specifically focus here on mixtures generated from log-normal distributions. The log-normal distribution arises as the limiting distribution when additive cumulative damage is the cause of the death or failure. Using the previous argument, mixtures generated from log-normal distributions can be justified in the same context. Besides mixtures of log-normal distributions, a large number of mixture families can be generated using (1). For example, Jewell (1982) explored mixtures of exponential and Weibull distributions. Barros, Paula, and Leiva (2008) and Patriota (2012) considered an extended class of Birnbaum-Saunders distributions that can be represented as in (1). The latter family is motivated by models for crack extensions where the mixing distribution accounts for dependence between the cracks.

The Family of Shape Mixtures of Log-Normals
Definition 1. A random variable T i has a distribution in the family of shape mixtures of log-normals (SMLN) if and only if its density can be represented as where f LN (·|μ, σ 2 λ i ) corresponds to the density of a log-normal distribution with parameters μ and σ 2 /λ i , and λ i is a realized value of a random variable i that has distribution function P i (·|θ ) defined on L ⊆ R + (possibly discrete). Denote T i ∼ SMLN P (μ, σ 2 , θ). Alternatively, Equation (2) can be expressed as a hierarchical representation, which corresponds to The SMLN family can be interpreted as a mixture of lognormal distributions with random shape parameter or as the exponential transformation of a random variable distributed as a scale mixture of normals. This family includes a number of distributions that have been proposed in the context of survival analysis. Table 1 lists some of them. In particular, the log-Student t distribution was introduced by Hogg and Klugman (1983) and the log-Laplace appeared in Uppuluri (1981). The log-exponential power was proposed by Vianelli (1983) and used in Martín and Pérez (2009). The log-logistic distribution was introduced by Shah and Dave (1963) and is used regularly in survival analysis, hydrology, and economics. This list can be increased by varying the mixing distribution. For example, all the mixing distributions used for scale mixtures of normals listed in Fernández and Steel (2000) can be used in this context. For identifiability reasons, the mixing distribution must not have separate unknown scale parameters (unknown scale parameters are allowed as long as they are linked to other features of the mixing distribution, for example, i ∼ Gamma(θ ,θ )). As illustrated in Figures 1 and 2, the SMLN family allows for a wide variety of shapes for the density and the hazard function. For example, while the hazard rate of the log-normal distribution has an increasing initial phase, the log-Laplace and log-logistic distributions produce a monotone decreasing hazard rate for some values of σ 2 .
Whereas all positive moments exist for the log-normal, this is not necessarily the case for the shape mixtures: in particular, we can show that no positive moments exist for the log-Student t for any finite value of ν, and the log-Laplace and log-logistic models only allow for moments up to 1/σ . The log-exponential power distribution with α > 1 does possess all moments.

The AFT-SMLN Model
An important aspect of survival modeling is the inclusion of covariates. Throughout, we will condition on the covariates that are assumed to be constant in time. An AFT model is introduced. The AFT-SMLN model expresses the dependence between the covariates and the survival time by replacing the parameter μ with x i β, so that 2, . . . , n,(4) where x i is a vector containing the value of k covariates associated with individual i and β ∈ R k is a vector of parameters. This can also be interpreted as a linear regression model for the logarithm of the survival times with error term distributed as a scale mixture of normals. As the median of T i in (4) is given by e x i β , e β j is interpreted as the (proportional) marginal change of the median survival time as a consequence of a unitary change in covariate j. This interpretation is not affected by the mixing distribution.

The Prior
Bayesian inference will be conducted using objective priors that are generated by Jeffreys' rule. This is one of the most common choices in the absence of prior information and has interesting invariance and information-theoretical properties. The next theorem presents the Fisher information matrix for the AFT-SMLN model, which is the basis for the Jeffreys'-style priors.
The expressions involved in k 1 (θ ), k 2 (θ ), k 3 (θ ), and k 4 (θ ) are complicated (see the proof) and thus I (β, σ 2 , θ) is not easily obtained from this theorem for any arbitrary mixing distribution. Indeed, it is usually more efficient to compute I (β, σ 2 , θ) directly from f (·|β, σ 2 , θ). However, this structure facilitates a general representation of the Jeffreys'-style priors: Corollary 1. Under the same assumptions as in Theorem 1, it follows that the Jeffreys', independence Jeffreys' (which deals separately with the blocks for β and (σ 2 , θ)), and independence I Jeffreys' (which deals separately with β, σ 2 , and θ ) priors are, respectively, given by The three nonsubjective priors presented here can be written as where π (θ ) is the factor of the prior that depends on θ . For the Jeffreys' prior, p = 1 + (k/2) and p = 1 for the other two priors. If θ does not appear (e.g., log-normal, log-Laplace, and log-logistic models), this prior simplifies to π (β, σ 2 ) ∝ (σ 2 ) −p . Note that the result in Corollary 1 also specifies the prior for θ . The implied priors for the special cases of the log-Student t and the log-exponential power (derived directly from the specific likelihood functions) are explicitly presented in the proof of Theorem 3. To obtain meaningful Bayes factors (BFs) between models, priors with an improper component π (θ ) for θ are discarded. For the examples explored throughout this article, this argument discards the independence I Jeffreys' prior for the log-Student t model.

The Posterior Distribution
The three priors presented in Corollary 1 do not correspond to proper probability distributions and therefore the propriety of the posterior distribution must be verified. At this stage, we also introduce the existence of censoring (assumed to be noninformative). In the following, posterior propriety is verified for the AFT-SMLN model under the priors in Corollary 1.
Theorem 2. Let t 1 , . . . , t n > 0 be the survival times (possibly censored) of n independent individuals, realizations of random variables distributed as in (4). Assume the prior given in (9), with π (θ ) dθ = 1. Without loss of generality, assume that only the first n o observations are uncensored. Define X o = (x 1 , . . . , x n o ) and suppose that the rank of X o is k.
(i) For p = 1, a sufficient condition for posterior existence is n o > k.
(ii) For p = 1 + k/2, a sufficient condition for the posterior propriety is n o > k and Theorem 3. Under the assumptions in Theorem 2 and provided that n o > k, it follows that (i) For the log-Student t AFT model, the posterior is proper under the independence Jeffreys' prior. However, the posterior does not exist for the Jeffreys' prior. (ii) For the log-Laplace AFT model, log-exponential power AFT model, and log-logistic AFT model, the propriety of the posterior can be verified with any of the three proposed priors.
Theorem 3 tells us that the log-Student t AFT model does not lead to valid Bayesian inference in combination with the Jeffreys' prior (the independence I Jeffreys' prior was already discarded in Section 3.1). The other models can be combined with all priors considered here; of course, the absence of θ in the log-Laplace and log-logistic models implies that the independence Jeffreys' and independence I Jeffreys' priors coincide in those cases.

The Problem With Using Point Observations
Continuous sampling models assign zero probability to particular (point) values. In spite of this, the standard statistical analysis is based on point observations. This situation can cause problems in the context of Bayesian inference. The assessment of the propriety of the posterior distribution is usually conducted without taking into account events that have zero probability. Hence, the propriety of the posterior on the basis of a specific sample of point observations can be precluded. As argued in Fernández and Steel (1998), this issue introduces the risk of having senseless inference. The following theorem illustrates the problem of the use of point observations in the context of the AFT log-Student t model.
Theorem 4. Adopt the same assumptions as in Theorem 2 and assume that n o > k. If the mixing distribution is Gamma(ν/2,ν/2) and s (k ≤ s < n o ) is defined as the largest number of uncensored observations that can be represented as an exact linear combination of their covariates (i.e., log(t i ) = x i β for some fixed β), a necessary condition for the propriety of the posterior distribution of (β, This result indicates that it is possible to have samples of point observations for which no Bayesian inference can be conducted, unless π (ν) induces a positive lower bound for ν. For the log-Student t model, we only use the independence Jeffreys' prior, so that p = 1 and (11) is violated whenever s > k. When no covariates are taken into account (k = 1), s coincides with the largest number of (uncensored) tied observations. Theorem 4 highlights the need for considering sets of zero Lebesgue measure when checking the propriety of the posterior distribution based on point observations.
In the context of scale mixture of normals, Fernández andSteel (1998, 1999) proposed the use of set observations as a solution to this problem. We now extend this to the SMLN family. The use of set observations is based on the fact that, in practice, it is impossible to record observations from continuous random variables with total precision and every observation can only be considered as a label of a set of positive Lebesgue measure. For instance, if the recorded value for the survival time is t i , it really means that the actual survival time is between t i − l and t i + r , where l and r are determined by the accuracy with which the data were recorded (e.g., if the data are recorded in integers, l = r = 0.5). This is equivalent to considering the observation as interval censored. As explained in Section 3.4, the implementation is done through data augmentation, which does not involve a large increase of the computational cost. This procedure also naturally deals with left or right censored observations by taking, respectively, ( l , r ) = (t i , 0) or ( l , r ) = (0, ∞). The following theorem indicates that the use of set observations can ensure a proper posterior distribution in situations where a particular sample of point observations might not.
Theorem 5. Adopt the same assumptions as in Theorem 2 and assume that n o > k. Replace the uncensored observations by The posterior distribution of (β, σ 2 , θ) given t is proper if and only if the marginal likelihood under point observations is bounded from above for any t o ∈ E, except for a set of zero Lebesgue measure.

Implementation
Bayesian inference was implemented through Markov chain Monte Carlo (MCMC) using the hierarchical representation (3) of the SMLN family and the data augmentation idea of Tanner and Wong (1987). Throughout, we use the prior presented in (9). An adaptive Metropolis-within-Gibbs sampling scheme with Gaussian random walk proposals is used (Roberts and Rosenthal 2009). Both censored and set observations are accommodated through data augmentation (as in Fernández andSteel 1999, 2000). This introduces an additional step in the sampler in which, given the current value of the parameters and mixing variables, point values of the survival times in line with the set observations are simulated. In this case, this can be easily done by sampling log(t i ) from a truncated normal distribution. Regardless of the mixing distribution and provided that n > 2 − 2p, the full conditionals for β and σ 2 are normal and inverted gamma distributions. However, the full conditionals for 1 , . . . , n and θ are generally not of a known form for arbitrary mixing distributions. For the i 's, we have π (λ 1 , . . . , λ n |β, σ 2 , θ, t) = n i=1 π (λ i |β, σ 2 , θ, t), where t = (t 1 , . . . , t n ) are the simulated survival times and For the log-Student t and log-Laplace models, Equation (12) has a known form. In the first case, it is a Gamma((ν + 1)/2,1/2 {(log(t i ) − x i β) 2 /σ 2 } + ν ). In the second, it corresponds to an Inverse Gaussian(σ/| log(t i ) − x i β|,1). If the mixing distribution has no closed form (as, e.g., for the log-exponential power and log-logistic distributions), the acceptance probability in these Metropolis-Hastings steps is not easily computable. For the log-logistic model, we implement the rejection sampling algorithm proposed in Holmes and Held (2006, p. 163), using the fact that (2 √ i ) −1 has the asymptotic Kolmogorov-Smirnov distribution. In the case of the logexponential power model, we adopt the mixture of uniforms representation used in Martín and Pérez (2009). This replaces We restrict the range of α to (1, 2), which is consistent with the SMLN representation. The case α = 1 is excluded but is covered by the log-Laplace model. We decompose the posterior distribution π (β, σ 2 , α, u 1 , . . . , u n |t) as n i=1 π (u i |t, β, σ 2 , α) × π (β, σ 2 , α|t) and the full conditionals for the U i 's are For π (β, σ 2 , α|t), the full conditionals of β, σ 2 , and α can easily be derived from the marginal likelihood (after integrating out the u i 's) of t given (β, σ 2 , α). None of them has a known form and adaptive Metropolis-Hastings steps are implemented. Additionally, for censored and set observations, log(t i ) is sampled from a truncated uniform distribution. In terms of setting up a sampler, it might be easier to simply start directly from the log-exponential power or log-logistic distribution, rather than its interpretation as a scale mixture. However, we would lose the inference on the mixing variables i (or U i ), which is particularly important in identifying outlying observations (Lange et al. 1989;Fernández and Steel 1999). This is further discussed in Section 3.6. Also, the use of these mixing representations facilitates dealing with censored and set observations. See supplementary material A for more details and R code.

Model Comparison
We consider several standard model comparison criteria. First, we use BFs, defined as the ratio between the marginal likelihoods of the models. Marginal likelihoods will be computed using the methodology proposed by Chib (1995) and Chib and Jeliazkov (2001). The latter was developed for a nonadaptive scheme. Using the stabilized proposal variances, we estimate the marginal likelihood from shorter nonadaptive chains for which the starting values are defined as the converged parameter values of the original chains. Second, the deviance information criteria (DIC) developed by Spiegelhalter et al. (2002) is also provided. It is given by DIC = E(D(β, σ 2 , θ, y)|y) + p D with p D = E(D(β, σ 2 , θ, y)|y) − D(β,σ 2 ,θ, y) (effective number of parameters), D(β, σ 2 , θ, y) = −2 log(f (y|β, σ 2 , θ)) (deviance function), and whereβ,σ 2 , andθ are the posterior medians of β, σ 2 , and θ , respectively. This is computed using the marginal likelihood (integrating out the mixing parameters).
Low DIC suggests a better model. In addition, models are compared by the quality of their predictions. We use the conditional predictive ordinate (CPO) (Geisser and Eddy 1979). For observation i, CPO i is defined as where the expectation is with respect to π (β, σ 2 , θ|t) and f (·|t −i ) is the predictive density given t −i . The density function f (·|t −i ) is replaced by the survival function S(·|t −i ) for right censored observations (Hanson 2006;Banerjee et al. 2007). Larger values of CPO i indicate better predictive accuracy for the observation i. Geisser and Eddy (1979) also proposed PsML = n i=1 CPO i as an estimator of the marginal likelihood (also called pseudo marginal likelihood). Higher values of PsML indicate a better overall predictive performance of the model. pseudo Bayes factors (PsBFs) can be easily computed as ratios of PsMLs.

Detection of Influential Observations and Outliers
A robust model will have no (or few) influential observations. Influential observations can be detected using K i = KL(π (β, σ 2 , θ|t), π(β, σ 2 , θ|t −i )), where KL(·, ·) denotes the Kullback-Leibler divergence function (Peng and Dey 1995;Cho et al. 2009). As suggested in McCulloch (1989), we transform K i in terms of its calibration index In relation to the Kullback-Leibler divergence, the effect of removing observation i is equivalent to assigning probability p i to an event which has true probability 0.5. A large value of p i suggests that observation i is influential.
In addition, the existence of outlying observations will be assessed using the posterior distribution of the mixing variables. Extreme values (with respect to a reference value, λ ref ) of the mixing variables are associated with outliers (see also West 1984). Formally, evidence of outlying observations will be assessed by contrasting the models M 0 : i = λ ref versus M 1 : i = λ ref (with all other j , j = i free). Evidence in favor of each of these models will be measured using BFs, which can be computed as the generalized Savage-Dickey density ratio proposed in Verdinelli and Wasserman (1995). The evidence in favor of M 0 versus M 1 (i.e., against observation i being an outlier) is where the expectation is with respect to π (θ |t, i = λ ref ). When the parameter θ does not appear in the model, this simplifies to the original Savage-Dickey density ratio where the expectation is with respect to π (β, σ 2 |t). The main challenge of this approach is the choice of λ ref . Intuitively, if there is no unobserved heterogeneity, the posterior distribution of the mixing parameters should not be much affected by the data. Therefore, the mixing distribution (which can be interpreted as a prior distribution for i ) will then be close to the posterior distribution of i . Following this intuition, we  Figure 3. Bayes factor for outlier detection as a function of |z|. The log Bayes factor has been rescaled by 2 to apply the interpretation rule proposed in Kass and Raftery (1995). The dotted horizontal line is the threshold above which observations will be considered outliers.
performance of the reference values by plotting the inverse of the BF in (15) for the log-Student t and in (16) for the log-Laplace and log-logistic models against a standardized log survival time z (given β, σ 2 , and θ ). This is defined as log(t) minus its mean, divided by its standard deviation (i.e., σ √ E (1/ |θ )). For the log-Student t, log-Laplace, and log-logistic mod- π , respectively. As expected, large values of |z| lead to evidence in favor of an outlier. The log-Student t model with very large number of degrees of freedom requires exceptionally large |z| values to distinguish it from the log-normal case.
The log-exponential power model is a special case. As explained in Section 3.4, Bayesian inference for this model is implemented through a mixture of uniforms representation with mixing parameters denoted by U i . The models for outlier detection in terms of U i are M 0 : The expectation of U i given α is 1 + 1/α and, according to the intuition presented previously, we might use this value as u ref . With this rule, u ref is a function of α, which lies in (1.5, 2). In practice, this choice detected large amounts of outliers (even for datasets generated from the log-normal model). We estimate π (u ref |t) by averaging π (u ref |t, β, σ 2 , α) in (13) using an MCMC sample from the posterior distribution of (β, σ 2 , α). Hence, if the value of (β, σ 2 , α) is such that is equal to zero and the BF in favor of the observation i being an outlier is computed as infinity. Simulated datasets indicated that the means and medians of | log(t i )−x i β| σ α , i = 1, . . . , n, are around 0.6, regardless of the model from which the data were generated. We then adjust the reference value to u ref = 1 + 1/α + 0.6. This choice performed much better with simulated datasets (e.g., using lognormal data no outliers were detected). The resulting BFs as a function of z = log(t)−x β σ (1/α) (3/α) (see Figure 3) are not much affected by the value of α.
For all models, moderate changes to the reference values do not have a large impact on the outlier detection curves in Figure 3.

SIMULATION STUDY
A simulation study is designed to evaluate the performance of the proposed mixture scheme versus a log-normal model (with no unobserved heterogeneity) under different scenarios. Two independent covariates, x 1 ∼ Ber(0.5) and x 2 ∼ Unif(0, 1) are simulated, and an intercept is added (k = 3). Throughout, we use β = (4, 0.5, −1) and σ 2 = 0.1 (which are in the range of usual empirical values). Datasets are simulated from the following models: (i) log-normal, (ii) log-Student t with ν = 5, (iii) log-Student t with ν = 20, (iv) log-Laplace, (v) logexponential power with α = 1.2, (vi) log-exponential power with α = 1.8, and (vii) log-logistic. Four different scenarios are defined through sample size (n = 100, 500) and percentage of censoring (PC = 10%, 70%). These rather small sample sizes are often observed in survival datasets. For each model, 100 independent datasets are simulated under each scenario. In all cases, survival times are rounded to integers to reflect the usual inaccuracy in the data-recording process. Independent censor-ing times are sampled from a uniform distribution in (0, C) where the value of C is tuned to control the percentage of censoring. Detailed results of the simulation study are displayed in supplementary material B.
For AFT models, β is usually the parameter of interest and its interpretation is not affected by our mixing scheme. We compare the performance of different SMLN-AFT models based on the posterior median of β. The choice between one of the three Jeffreys'-rule-based priors suggested in Section 3.1 is not very critical for the estimation of β as all priors produce very similar inference for the regression parameter. Of course, estimation is more accurate when the data provide more information, that is, for n = 500 and PC = 10%. There are no major differences between log-normal datasets and those generated by an SMLN model with weak unobserved heterogeneity (log-Student t with ν = 20 and log-exponential power with α = 1.8). In such cases, the log-normal model correctly estimates β. Crucially, fitting SMLN models to log-normal datasets is harmless. The β estimates are concentrated around the true value, although they are slightly more spread out when using a log-Laplace model (which has a very dispersed mixing distribution). As expected, if the data display strong unobserved heterogeneity, mixture models tend to outperform the log-normal one. For those cases, SMLN models produce more accurate estimates of β in terms of both bias and spread, especially under large amounts of censoring. This is even the case when using a different mixing distribution than the one that generated the data. These differences are largest for the log-Laplace datasets and diminish for milder cases of unobserved heterogeneity, like the log-logistic case.
The Bayesian model comparison criteria described in Section 3.5 are applied to each dataset to assess their effectiveness. Here, we focus on the Jeffreys' and independence Jeffreys' priors (both types of independence Jeffreys' priors lead to similar results). The performance of BF is better (and more in line with the other criteria) under the independence Jeffreys' prior, except for the log-logistic data. Under the Jeffreys' prior and with log-normal data, BF assigns relatively little support to the log-normal model when n = 100 (especially with high PC). For k = 3, the Jeffreys' prior favors small values of σ 2 , much more than the independence Jeffreys' (the difference increases with k). When the dataset provides little information (small n and/or large PC), the prior has a strong influence on posterior inference. We might, thus, underestimate σ 2 and the fitted lognormal model will have too small a spread to accommodate the data, even though they were generated by the log-normal model. Predictive criteria are less affected by this. Overall, DIC, BF, and PsBF point in the same direction, largely successfully detecting the presence and absence of unobserved heterogeneity. However, very mild forms of unobserved heterogeneity (log-Student t with ν = 20, log-exponential power with α = 1.8) are often indistinguishable from the log-normal model. Stronger unobserved heterogeneity is more easily detected (even when n = 100 and PC = 70%). Jointly, these criteria successfully indicate the existence of unobserved heterogeneity. Even in the worst scenario, the log-normal model is correctly detected more than 60% of the time if we use the independence Jeffreys' prior.
In this case, we correctly classify data generated by the log-Laplace model in at least 60% of the cases when n = 100 and at least 82% of the cases for n = 500. With log-logistic datasets, the right model is detected in at least 70% of the simulations with n = 500 under either prior. The rate of correct detection is lower for the log-Student t and log-exponential power models, for which an extra parameter needs to be estimated. The DIC and PsBF criteria do best overall: under both priors they correctly identify models with moderate or strong heterogeneity on the basis of 500 observations with low censoring in at least 57% of the cases.

APPLICATION: THE VA LUNG CANCER TRIAL
The dataset (presented in Kalbfleisch and Prentice 2002) relates to a trial in which a therapy (standard or test chemotherapy) was randomly applied to 137 patients who were diagnosed with inoperable lung cancer. The survival times of the patients were measured in days since treatment and the following covariates were used: the treatment that is applied to the patient (0: standard, 1: test); the histological type of the tumor (squamous, small cell, adeno, large cell); a continuous index representing the status of the patient at the moment of the treatment (the higher the index, the better the patient's condition); the time between the diagnosis and the treatment (in months); age (in years); and a binary indicator of prior therapy (0: no, 1: yes). The data contain nine right censored observations. This dataset has been previously analyzed from a frequentist point of view using traditional models such as the Cox, Weibull, log-normal, and log-logistic regressions (see Lee and Wang 2003;Heritier et al. 2009). These models all suggest that the status of the patient at the moment of treatment and the histological type of the tumor are the relevant explanatory variables for the survival time. Nevertheless, evidence of influential observations has been found. Barros, Paula, and Leiva (2008) illustrated that the inference produced by a log-Birnbaum-Saunders model is greatly modified when dropping observations 77, 85, and 100. They proposed a log-Birnbaum-Saunders Student t distribution as a more robust alternative for this dataset because it allows for fatter tails and accommodates heterogeneity in the data. This distribution can also be represented through a mixture family as in (1), so our methodology could be extended to include this distribution. Heritier et al. (2009) detected observations 17 and 44 as influential when fitting a Cox proportional hazard model and proposed the use of an adaptive robust estimator instead. Table 2 presents a summary of the posterior distribution of (β, σ 2 ) when a log-normal AFT model is fitted. This is based on 10,000 draws, recorded from a total of 400,000 iterations with a burn-in period of 200,000 and a thinning of 20. The use of different starting points and the usual convergence criteria strongly suggests convergence of the chains (see supplementary material C). The Jeffreys' and the independence Jeffreys' priors produced similar results. Bayesian inference was conducted on the basis of point and set observations, using l = r = 0.5 for uncensored observations. For this model, inference on point and set observations is quite similar. The use of point observations does not produce problems for the log-normal model. However, we know that set observations avoid potential problems with the inference for other models (see Section 3.3), so we will focus on the analysis with set observations in the rest of the article. Results suggest that the main covariate effects are due to the tumor type and patient status, and are roughly in line with the maximum likelihood results for the log-Birnbaum-Saunders AFT model in Barros, Paula, and Leiva (2008), although the effect of the test treatment is less clearly negative.
We then use the AFT-SMLN model (4) with the continuous mixing distributions presented in Table 1. Bayesian inference is conducted under the Jeffreys'-type priors introduced in Corollary 1. We adopted the same total number of iterations, burn-in, and thinning as for the log-normal model. We compare the models through BFs, PsBFs, DIC, and the CPO predictive performance, summarized in Figure 4 and Table 3. Clearly, all these criteria provide evidence in favor of mixture models. For the log-Student t model, this evidence is also supported by the fact that inference on ν favors relative small values. Similarly, the log-exponential power model suggests values of α far from 2. From plots (unreported) of the prior and posterior distributions, it is clear they differ and that the latter is strongly driven by the data itself. Overall, the log-logistic model seems the best candidate for fitting this dataset. This is in line with the results in Lee and Wang (2003) in which, using a maximum likelihood approach, the log-logistic model is preferred to the log-normal and other standard models.
The choice of prior and mixing distribution is not too critical for the inference about β (only results under the independence Jeffreys' prior are reported). For mixture models, the posterior distribution of β is somewhat different from that for the lognormal model (see Table 4). In particular, the effect of the test treatment is less pronounced. The results on β are relatively close to the classical ones reported in Barros, Paula, and Leiva (2008) using the log-Birnbaum-Saunders Student t model and to the ones in Lee and Wang (2003) using the log-logistic model. Estimates of σ 2 cannot be compared because it has a different interpretation for each model. Table 3 also indicates that the number of influential observations is smaller for the SMLN models than for the log normal model, which is consistent with the superior ability of the SMLN models to accommodate unusual observations. For all priors considered, observations 12, 77, 85, 95, 100, and 106 are detected as influential observations for the log-normal model (with no mixture) when using the threshold p i ≥ 0.8 (in fact, p i ≥ 0.9 for observations 85 and 106). In contrast, for all the mixture models, observations 77, 95, and 100 do not appear as influential observations (both thresholds). For any of the mixtures, observation 106 is always considered as influential (p i ≥ 0.9). Observations 12 and 85 are flagged up as influential by some of the mixtures. These results are roughly in line with the results in Barros, Paula, and Leiva (2008), where observations 77, 85, and 100 were also identified as (strong) influential observations when fitting a log-Birnbaum-Saunders model. Using a log-Birnbaum-Saunders-t model, they also label observations 12, 77, 95, and 106 as (mild) influential observations.
The posterior distributions of the mixing parameters (not reported) vary substantially between the patients, suggesting heterogeneity in the data. Figure 5 formalizes this by presenting the BF in favor of being an outlier for each of the 137 observations. There is clear evidence for the existence of outlying observations under the suggested priors for all models. Although all priors present similar results, this evidence is slightly stronger for the Jeffreys' prior. The choice of the mixture model does not greatly affect the conclusions. The analysis suggests that, regardless of the prior, observations 77 and 85 are very clear outliers. Patients 77 and 85 had an uncensored survival time of 1 day (the lowest value observed in the dataset), were under the standard treatment, and had a squamous type of tumor.  Table 4. Summary of the posterior distribution under the independence Jeffreys' prior for various SMLN models on the basis of set observations. β 0 : Intercept, β 1 : Treat (test), β 2 : Type (squamous), β 3 : Type (small cell), β 4 : Type (adeno), β 5 : Status, β 6 : Time from diagnosis,

CONCLUDING REMARKS
We recommend the use of mixtures of life distributions as a convenient framework for survival analysis, particularly when standard models such as the Weibull or log-normal are not able to capture some features of the data. This approach intuitively leads to flexible distributions on the basis of a known distribution by mixing over a parameter. This can also be interpreted through random effects or frailty terms. These mixture families can accommodate unobserved heterogeneity or outlying observations. In particular, the SMLN family is proposed. This family of mixtures of life distributions is based on the log-normal model and allows us to fit data with a variety of tail behavior. The mixing is applied to the shape parameter of the log-normal distribution and the resulting distribution is quite flexible and can be adjusted by choosing the mixing distribution. This makes this family applicable in a wide range of situations. Under mixture modeling, we recommend AFT regressions instead of the wellknown proportional hazard models. Mixtures of AFT models provide a clearer interpretation of the regression parameters, which does not depend on the mixing distribution. In addition, the estimation of the regression coefficients is not much affected by the choice of mixing distribution. The latter is illustrated with both simulated and real data.
We consider objective Bayesian inference under the AFT-SMLN model. We propose three different Jeffreys'-type priors. These priors are improper and therefore the propriety of the posterior distribution needs to be verified. Section 3.2 provides conditions for the existence of the posterior distribution based on an arbitrary mixing distribution. In particular, Theorem 3 provides some extra guidance and results for specific mixing distributions. In addition, the problem associated with the use of point observations is explored. The use of set observations is considered as a solution, which can easily be implemented in an MCMC sampling scheme. We recommend the use of set observations throughout. Set observations might also be helpful in other contexts. For example, the issues of the Cox proportional hazard model with ties in the data are well known. Heritier et al. (2009) ignored ties when analyzing the real dataset used here, but that strategy might lead to serious loss of information if applied routinely. Different methods have been proposed for dealing with ties in the Cox regression model (see Kalbfleisch and Prentice 2002, p. 104), but they might lead to biased estimations (Scheike and Sun 2007). Set observations are a natural solution that takes into account the imprecision with which the data were recorded.
We also propose a methodology for outlier detection that is based on the mixing structure. Outliers are associated with extreme values of their corresponding mixing variable and the evidence for outlying observations is formalized by means of BFs. We provide recommendations for the (critical) choice of a reference value.
A simulation study shows that standard Bayesian model comparison criteria can fairly easily identify the need of incorporating unobserved heterogeneity into the model, even with rather small sample sizes and a considerable amount of censoring. Ignoring unobserved heterogeneity can lead to biased or less precise inference for the regression parameters, whereas inference with SMLN models works well even in the absence of unobserved heterogeneity. The best results in terms of identifying the correct model are obtained for the independence Jeffreys' prior and the model selection criteria DIC and PsBF. Our methodology was also applied to the VA lung cancer data, for which previous studies have found evidence of influential observations. For these data, we uncover strong evidence of unobserved heterogeneity, which is mostly driven by outlying observations.
The mixing framework proposed here can be used with any proper mixing distribution, whether the induced survival time distribution has a closed-form density function or not. The proposed MCMC inference scheme does not rely on a closed form expression for the survival density with the mixing variables integrated out, so our Bayesian inference can be used much more widely than in the examples illustrated here. The challenge then may be that the Jeffreys'-type prior for the parameter(s) of the mixing distribution needs to be derived from the expression in Theorem 1 (rather than from the integrated survival density) and this may not be trivial in general.

APPENDIX: PROOFS
Theorem 1. Taking the negative expectation of the second derivatives of the log-likelihood, the expressions k 1 (θ), k 2 (θ), k 3 (θ ), and k 4 (θ ) are given by (A.4) Corollary 1. The proof follows directly from Theorem 1 using the structure of the determinant of the Fisher information matrix and its submatrices.
Theorem 2. Define t o = (t 1 , . . . , t no ) and D o = diag(λ 1 , . . . , λ no ). The contribution of the censored observations to the likelihood function is a factor in [0, 1]. Hence, the marginal likelihood of the complete sample (f T (t)) is bounded above by the marginal likelihood of the noncensored observations (f To (t o )). Therefore, a sufficient condition for existence of the posterior distribution of (β, σ 2 , θ) is f To (t o ) < ∞. After some algebraic manipulation, f To (t o ) is equal to  t 1 ), . . . , log(t no )) . Provided that t i = 0 for all i ∈ {1, . . . , n o }, using Fubini's theorem for the integral (A.5) and integrating first with respect to β, we have 2σ 2 also proper. In fact, for the log-Laplace model, −1 1 is Gamma distributed and all its positive moments are finite. For the log-logistic model, it can be shown that i = √ 1/(4 i ) has an asymptotic Kolmogorov distribution with density function g(ω i ) = 8ω i ∞ s=1 (−1) s+1 s 2 e −2s 2 ω 2 i , for ω i > 0. Therefore, for k > −2, it follows that For the log-exponential power model, the Fisher information matrix was derived by Martín and Pérez (2009) and is given by ⎛ Therefore, the components depending on α of the Jeffreys', independence Jeffreys', and independence I Jeffreys' priors are, respectively, π J (α) = α(α − 1) (1 − 1/α) (1/α)