The generalized odd log-logistic-G regression with interval-censored survival data

The article proposes a new regression based on the generalized odd log-logistic family for interval-censored data. The survival times are not observed for this type of data, and the event of interest occurs at some random interval. This family can be used in interval modeling since it generalizes some popular lifetime distributions in addition to its ability to present various forms of the risk function. The estimation of the parameters is addressed by the classical and Bayesian methods. We examine the behavior of the estimates for some sample sizes and censorship percentages. Selection criteria, likelihood ratio tests, residual analysis, and graphical techniques assess the goodness of fit of the fitted models. The usefulness of the proposed models is red shown by means of two real data sets.


Introduction
Some statistical techniques, such as linear or mixed generalized linear models, do not consider the censored observations during the modeling, which cause a loss of relevant information.In this case, survival analysis allows, besides considering all the data, obtaining significant additional information, because the omission of the censored data can lead to biased conclusions in the results.Censoring and missing data are characteristic of survival data with some important peculiarities, among them the classification of censoring as right, left or interval-censored.Interval-censored data occur in situations where information exists of the occurrence of the event of interest in a random interval rather than at a specific time point.
In such studies, periodic evaluation occurs when it is not possible to observe the time variable T exactly for the individual who was examined, but it is known that the value belongs to an interval, i.e.T ∈ (U, V] for U < T ≤ V.According [26], exact failure times along with right censoring and left censoring are special cases of interval-censored.Table 1 reports a scheme regarding the structure of interval-censored data and its special cases.Several statistical approaches have been proposed in the literature involving data with interval-censored, especially in parametric, non-parametric and semi-parametric cases.The works of Finkelstein and Wolfe and Finkelstein [13,14] stood out for being among the first published papers about regressions with interval-censored, while Odell et al. [30] introduced a parametric regression with interval-censored based on the log-Weibull distribution.Betensky and Finkelstein [4] proposed a method to interval-censored AIDS data; Zhao et al. [40] addressed methods to estimate the parameters of the proportional risk models when both survival times and covariates are subjected to interval-censored, and Bogaerts et al. [5] discussed how to estimate association measures for interval-censored bivariate data. In the past decade, various important studies in this area have been published.For example, Hashimoto et al. [19,20] adopted exponentiated and generalized gamma distributions for regression models with interval-censored and Wang et al. [39] introduced a Bayesian regression for odontological data with this type of censoring.Kim and Kim [22] proposed a regression model in the presence of competitive risks.Ma et al. [27] developed a model allowing dependence between the interval-censored mechanisms and the failure times.Mao et al. [28] defined semiparametric regressions for competitive risk and interval data using the most stable EM algorithm to estimate the parameters.More recently, He et al. [21] described an additive risk model with interval-censored data employing semiparametric estimation with B-splines in hemophilia data and Basak et al. [3] proposed a robust semiparametric model for clustered survival data with interval-censored in the Bayesian context involving ensemble learning.On all the above issues, we present a parametric approach for modeling interval-censored survival data based on the generalized odd log-logistic-G (GOLL-G) family.
Although several probability distributions exist for modeling interval data in the parametric context, new distributions are justified by the fact that the traditional models, such as exponential, Weibull, log-logistic and gamma, frequently can not provide good fits to real data.In this sense, new distributions are created from some of the previously mentioned models to capture more features from the data.Among these features of the new models, we can highlight the flexibility concerning its hazard function, which is important in survival data.In this context, we propose the GOLL-G family, which is very flexible to accommodate hazard rates in U, unimodal and bimodal shapes, and also generalizes some well-known models in this area.
Recently, some authors addressed some models in the GOLL-G family.Korkmaz et al. [23] introduced the Topp-Leone generalized odd log-logistic for explaining Otis IQ scores and voltage data.Vigas et al. [38] presented the generalized odd log-logistic Neyman type A long-term for gastric adenocarcinoma data.Rasekhi et al. [33] defined the odd log-logistic Weibull-G family for lifetime and financial returns.Prataviera et al. [32] proposed the generalized odd log-logistic Maxwell for COVID-19 data.For more details, see [1,12,24,37].
Besides the censoring indicator variable, other characteristics can exist that influence the survival time, such as gender, age and treatment regimen, among others.These characteristics are called covariates and should be included in the analysis.So, we propose a GOLL-G family in the presence of covariates via a regression model with interval-censored survival data.
The inferential part is based on likelihood methods.As an alternative to the frequentist method, we also consider the Bayesian method to estimate the parameters of the regression.Frequentist and Bayesian selection criteria are thus presented, along with plots of the estimated survival functions using the Turnbull's algorithm [35] to provide indicators of the fit.
Simulations studies are conducted to evaluate the accuracy of the estimates in the GOLL-G family for both methods.Another step addressed after formulating the model is the residual analysis.The paper is organized as follows.In Section 2, we define the GOLL-G regression under interval-censored mechanism.We formulate a regression under two systematic components on this mechanism, and calculate the maximum likelihood estimates (MLEs) of the parameters.Various Monte Carlo simulations in Section 3 evaluate the adequacy of the estimates.Residual analysis is investigated in Section 4. Section 5 illustrates the importance of the GOLL-G regression under informative censoring applied to two real data sets.Finally, the major findings of our work are cited in Section 6.

The GOLLG-G regression with interval-censored data
Cordeiro et al. [8] defined the cumulative distribution function (cdf) of the GOLL-G family as where α > 0 and θ > 0 are two shape parameters, and G(t; τ ) is the baseline cdf with a q × 1 vector τ of unknown parameters.Henceforth, let T ∼ GOLL-G(α, θ , τ ) be a random variable (rv) having cdf (1).The odd log-logistic-G (OLL-G) family [16] follows when θ = 1.The exponentiated-G distribution [17] refers to α = 1.Equation (1) becomes the baseline density function when α = θ = 1.The hazard rate function (hrf) of T can be expressed as By inverting Equation (1), the quantile function (qf) of T has the form where is the baseline qf.By taking as baseline the Weibull (W) cdf where τ = (a, b) , a > 0 is the shape and b > 0 is the scale, we have T ∼ GOLL-W(α, θ , a, b). Figure 1 shows that the hrf of T is quite flexible for survival and reliability analysis.

Regression and inference
In many practical situations there are some explanatory variables associated with the survival times.In regression, the lifetime distribution of T depends on a vector of covariates x = (x 1 , . . ., x p ) .We can consider two systematic components linked to the parameters of the selected baseline distribution G(•).The observed interval-censored data consist of random intervals (u i , v i ) (for i = 1, . . ., n) to include t i with probability one, i.e.P[u i ≤ t i ≤ v i ] = 1.If v i = ∞, then it is a right-censored time for t i .
We propose the GOLL-W regression for interval-censored data, where the parameters a and b depend on x.In a similar manner, other regressions can be defined from any baseline distributions in Equation (1).
The survival functions of the extremes of t i |x i , for t i ∈ (u i , v i ], can be expressed as and Equations ( 5) and ( 6) define the GOLL-W regression for interval-censored data.The OLL-W regression corresponds to θ = 1, and the Exponentiated-W regression follows if α = 1.
Clearly, the Weibull regression is a special case when α = θ = 1.

Classical inference
Let (u 1 , v 1 , x 1 ), . . ., (u n , v n , x n ) be a set of n interval-censored observations, where u i and v i are the observed data and x i is a known vector.The total log-likelihood function for the vector ψ = (α, θ , β 1 , β 2 ) has the form where F * denotes the set of individuals with t i ∈ (u i , v i ], and C * denotes the set of individuals with t i ∈ (u i , +∞).The MLE ψ of the parameter vector ψ can be calculated by maximizing l(ψ).The command optim in R software using the BFGS algorithm determines ψ.Initial values for β 1 and β 2 are taken from the fit of the Weibull regression model (α = θ = 1).Some information criteria in decision theory penalize models with a large number of parameters.Some of these criteria are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) statistics, namely where ˆl denotes the maximized log-likelihood, d the number of the model parameters, and n the number of observations.The lower the values of these statistics, greater evidence favorable to the model in question.
We can use the likelihood ratio (LR) statistic for comparing sub-models with the GOLL-G model.We consider the partition ψ = (ψ 1 , ψ 2 ) , where ψ 1 is a subset of parameters of interest, and ψ 2 is a subset of remaining parameters.The LR statistic for testing the null hypothesis H 0 : 1 is given by w = 2{ ( ψ) − ( ψ)}, where ψ and ψ are the estimates under the null and alternative hypotheses, respectively.The statistic w is asymptotically distributed as where k is the dimension of the vector ψ 1 of the parameters of interest.

Bayesian inference
As an alternative statistical method, the Bayesian method allows to incorporate a priori distribution which represents the knowledge that one has about the parameters before carrying out the experiment.We assume that α, θ , a and β are priori independent with joint density where ψ = (α, θ, a, β ) .The parametric space of α, θ and a is defined in R + , and that one of β j is defined in R.
Then, we can consider the following prior distributions where (c, d) is the gamma distribution with mean c/d and variance c/d 2 , and N(0, e) is the normal distribution with mean 0 and variance e.Here, α k , θ k , a k and σ j are known hyperparameters for k = 1, 2 and j = 0, . . ., p.In order to obtain weakly prior distributions, i.e. with large variances, we set α k = θ k = a k = 0.1 and σ j = 10 3 (see, [34]).
The inference about the parameters is based on Markov Chain Monte Carlo (MCMC) simulations through the Just Another Gibbs Sampler (JAGS) program, since the joint posterior density is analytically intractable.We use the rjags package [31] and the R software.We control the convergence of the Metropolis-Hasting algorithm by the method given by Geweke [15] as well as trace plots.
In the Bayesian part, the Expected Akaike Information Criterion (EAIC) [6] and the Expected Bayesian Information Criterion (EBIC) [7] are the model comparison criteria adopted for discriminating inference.

Simulation studies
Two Monte Carlo simulation studies are conducted without the regression structure and with the regression structure.

Without a regression structure
The details of the first simulation study are reported below: • Consider the GOLL-W cdf for the failure times from Equations ( 1) and ( 4 Next, we describe the process of the data generation with interval-censored survival data. (1) Generate failure times t i ∼GOLL-W(α, θ , a, b); (2) Generate the censoring time c i following an exponential distribution (τ ) and calculate e) Otherwise, continue to generate uniform values up to t i be in the range.
In general, where W denotes the amount of the generated values from the uniform distribution.
The average estimates (AEs), biases, root mean squared errors (RMSEs) and coverage probabilities (CPs) from simulations of the previous algorithm are reported in Table 2.The results reveal that the AEs tend to the true parameter values (regardless of the percentage of censorship) and the RMSE values decrease when the sample size increases.So, the estimators are less biased.The AEs, biases and RMSEs have better results for lower percentage censoring.

With the regression structure
We consider the baseline log-logistic (LL) cdf where τ = (a, b) , a > 0 is the shape and b > 0 is the scale, say T ∼ GOLL-LL(α, θ , a, b).
We examine the accuracy of the estimates of the parameters of the GOLL-LL regression with interval-censored through simulations.Here, we adopt the frequentist and Bayesian methods to estimate the parameters using R and JAGS softwares.Next, we describe the steps for the simulations.λ controls the percentage of the censored observations; • The estimates are obtained from these simulated samples using the frequentist and Bayesian methods.For the first approach, the MLEs are determined using the BFGS numerical method and optim resource of the R software; • For the Bayesian method, we consider independent priors with the hyperparameter values fixed at α 1 = α 2 = θ 1 = θ 2 = a 1 = a 2 = 0.1 and σ 2 j = 100; • The choice of only B = 400 replicas is due to the large computational resource in the Bayesian approach.We assume in the simulations a vague but informative prior, because, although it does not obtain little information, the prior is still informative and the posterior is proper; • We discarded the first interactions in the chain to eliminate the effect of the initial values using a posteriori MCMC sample of size 35,000 with 20 jumps; • The generating process is similar to that one discussed in Section 3.1.
Tables 3 and 4 give the simulation results with different sample sizes and censoring percentages for the GOLL-LL regression with interval-censored data.Some conclusions from the simulations are addressed below.
• For the frequentist and Bayesian methods, the AEs converg to the true values of the parameters when n increases.• Similarly, the RMSEs decrease when n increases.
• The AEs, biases and RMSEs become worse when the censoring percentage increases for both methods.• The AEs, biases and RMSEs are similar for both frequentist and Bayesian methods, regardless of the sample size and/or the censoring percentage.

Residual analysis
Following the same idea pointed out by Hashimoto et al. [19], the deviance residuals for the GOLL-G regression (in presence of interval censoring) can be expressed as where δ i is the censoring indicator, and rm i are the martingale residuals defined by Here, F is the set of individuals with t i ∈ (u i , v i ] and C is the set of individuals with right censoring, i.e. t i ∈ (u i , ∞).Some simulations are carried out to investigate the behavior of the empirical distribution of the deviance residuals in the GOLL-LL regression with interval-censored data following the same procedure discussed in Section 3.2 under the conditions: • We generated B = 400 random samples for n = 200, 400 and 600 and censoring percentages of 0% and 30%; • The MLEs are calculated from these simulated samples using the R software; • The average deviance residuals are calculated from these samples.
Figures 2 and 3 display the QQ-plots of the average residuals versus the expected values of the quantiles of the standard normal distribution.Some conclusions from the simulation results: • The distribution of the residuals is close to the standard normal distribution since a linear pattern of points follows when n increases regardless of the censoring percentage; • The distribution of the deviance residuals is better approximated for the normal distribution when the censoring percentage decreases; • The normal probability plot with simulated envelope [2] shows the adequacy of the fitted regression.
Under the same conditions of the previous simulations, for each resampling b i (i = 1, . . ., 400), two statistical tests of normality are applied, Shapiro-Wilk (SW) and Shapiro-Francia (SF), in order to verify the behavior of the deviance residuals in the GOLL-LL regression model.The nortest package was used to provide the SW and SF tests.For each resampling, the p-value of each test was calculated by comparing it with the significance level of 5%, and counting the proportion of times (P) that these statistics followed a normal distribution, in addition to verifying the descriptive summary of this proportion.Tables 5  and 6 report the results.

Applications
We provide two applications using the GOLL-G family for interval-censored data with and without the presence of explanatory variables.The calculations are performed with the R software.The Weibull, log-logistic and gamma distributions are taken for baseline distributions in these applications.The first two cdfs are in Equations ( 4) and ( 8), and the two-parameter gamma density has the form where τ = (a, b) , a > 0 is the scale parameter, b > 0 is the shape, and (a) = ∞ 0 w a−1 e −w dw is the gamma function.For this special case, we write T ∼ GOLL-Ga(α, θ, a, b).

Application 1: oral health data
The first data set refers to a longitudinal prospective oral health study conducted in Flanders (Belgium) of 4430 students (2297 boys and 2133 girls) from 1996 to 2001.The teeth of each child was annually examined by trained dentists.For more details, see [36].Gómez et al. [18] addressed an analysis restricted to the distribution of the time of emergency of the first left premolar (24th teeth in dental European notation).Here, it is also excluded 44 cases for which the variable predecessor state of the primary tooth is missing.So, we consider the random variable age t i (in years) until the appearance of 24 permanent teeth.Among the 4386 observations, 2775 are interval censored and 1611 are right censored.
The parameters of the fitted distributions are estimated by maximum likelihood.The AIC and BIC measures in Table 7 show that the GOLL-W and GOLL-Ga models are appropriate for the current data.So, they are useful for modeling interval-censored data.
The numbers in these tables reveal that the estimates and SEs are close under both frequentist and Bayesian methods.The LR statistics given in Tables 10 and 11 indicate that the GOLL-W and GOLL-Ga models are superior to their two sub-models for a significance level of 5%.
Figure 4(a,b) display the survival curves estimated by the Turnbull's method [35] and the estimated survival functions for the GOLL-W and GOLL-Ga distributions, respectively.These plots reveal that the estimated survival functions for both models are in agreement with the Turnbull's survival curves.In summary, these models provide good fits to the current data.

Application 2: HIV-1 infection data
The data set refers to a prospective multicentric study with the objective of evaluating the HIV-1 infection rate among patients suffering from hemophilia (see, [10]).These individuals are at risk of contracting HIV-1 from transfusions of products derived from the blood of infected donors.This study involves analysis of survival data with interval censoring because it is not known exactly when a patient contracted HIV-1.The data set under analysis came from 544 patients, 38% with interval censoring.For more details, see [10,25,29].We consider the following variables (for i = 1, . . ., 544): • t i : time interval (in months) when the serum conversion occurred; • x i1 : patient treated with low dose of blood (0= no; 1=yes); • x i2 : patient treated with medium dose of blood (0= no; 1=yes); • x i3 : patient treated with high dose of blood (0= no; 1=yes).Figure 5 displays the estimated survival curves generated with the Kaplan-Meier estimate and Turnbull's algorithm for three covariates.The first analysis aims to find a distribution that can model the HIV-1 infection data considering only the interval response.In order to select the most suitable model, we adopt the frequentist and Bayesian methods for selection criteria.For the frequentist part, we use the AIC and BIC criteria, while for the Bayesian part, we consider the EAIC and EBIC measures.The selection criteria in Table 12 reveal that the GOLL-LL distribution has the lowest values regardless of the frequentist or Bayesian criteria.Then, it provides a better fit among the other distributions to the HIV infection data.Additionally, the estimated survival functions for the fitted GOLL-W, GOLL-Ga and GOLL-LL distributions and the Turnbull's estimated curve are reported in Figure 6.Again, the GOLL-LL distribution is an adequate model for these data.
The results reveal that the estimates and their SEs are close for both frequentist and Bayesian methods.In addition, all covariates are relevant regarding the significance of the parameters.
Figure 7 reports the traceplots 1 of the parameters from the fitted regression M 2 to the HIV data.According to these plots, it is noted that the generated chains present a satisfactory stability.
Figure 8(a,b) display the plots of the deviance residuals and the generated envelope for the regression M 2 .We conclude the following facts: • There is a better randomness in Figure 8(a).These points indicate that the regression M 2 is the most suitable model for the HIV data.

Goodness of fit:
The empirical survival curve and the estimated survival functions in Equations ( 5) and (6) for the regression M 2 displayed in Figure 9 reveal that this regression is suitable for the HIV-1 data.Interpretations of the findings from the M 2 model: The interpretations follow as: • There is a significant difference (5%) between the low-dose blood and untreated patient with respect to the time interval in which the serum conversion occurred.
• There is a significant difference (5%) between patients treated with a medium dose of blood and those not treated with respect to the time interval in which the serum conversion occurred.• There is a significant difference (5%) between patients treated with a high dose of blood and those not treated with respect to the time interval in which the serum conversion occurred.• We also noted that these differences are more significant for individuals treated with medium and high doses of blood compared to untreated individuals (see, Figure 9(b,c)).

Conclusions
We developed new regressions considering interval censoring based on the generalized odd log-logistic-G (GOLL-G) family from the classical and Bayesian methods.Some simulations investigated the means, biases and mean squared errors of the estimates for some sample sizes and censoring percentages.It is noted that the means of the estimates converge to the true parameter values and the root mean squared error values decrease when the sample size increases regardless of the censorship percentages.We defined an extension of the deviance residuals for interval-censored data to evaluate the model assumptions.Some simulations showed that the empirical distribution of these residuals is approximately normal.The usefulness of the GOLL-G family was proved by means of two real data sets.So, based on the comments above, some possible future works are: • A simulation study with other baseline distributions via frequentist and Bayesian analysis using the GOLL-G family and taking the truncated normal distribution at zero as prior for the positive parameters (see, [11]); • Influence measures in a Bayesian approach.
Several future works can be developed considering two systematic components in interval censored regression models.We can extend the research presented by Hashimoto et al. [22].In this case, the extension will be related to the regression models with cure fraction (long-term individuals) considering two systematic components for interval censoring data.Other future works can be developed, for example, for the GOLL-G regression models with random effects for interval censored data with two systematic components to cope with group structure or correlated data.Also, we can consider these regression models for interval censored data under informational censoring, and finally, the GOLL-G regression model with interval censoring when we have multicollinearity problems.

Notes
1.The traceplot corresponding to one or more Markov chains provides a visual way to inspect sampling behavior and assess mixing across chains.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Plots of the hrf for the GOLL-W model.
), say T ∼ GOLL-W(α, θ, a, b), where α = 1.0, θ = 2.5, a = 3 and b = 4.5; • The random values of T are generated from Equation (3), where Q G (u) = b [− log(1 − u)] 1/a , and sample sizes n = 200, n = 400 and n = 800; • The censoring times C are generated from the exponential distribution Exp(λ), where λ > 0 controls the percentage of the censored observations; • We generate B = 1000 random samples for each n with two percentages of censorship (30% and 60%) of right censored observations.The MLEs of the parameters are determined from the simulated samples using the BFGS algorithm and the numerical resource Optim of R software; • The initial values of the optimization process are chosen close to the true parameters.
we construct the times representing the case of the censor on the right assuming that U i = c i and V = ∞; (5) For δ i = 1: (a) Consider D (g=0) = 0 as the first lower limit of the range; (b) Generate D (g=1) ∼ Uniform (d, e) and assume that U i = 0 and V i = D (g=1) , and check that t i ∈ [U i , V i ]; (c) Otherwise, generate D (g=2) ∼Uniform( d, e), assume that U i = 0 + D (g=1) and V i = D (g=1) + D (g=2) , and check if t i ∈ [U i , V i ]; (d) If item 5.3 does not occur, generate D (g=3) ∼Uniform( d, e) assuming that

•
We consider the GOLL-LL distribution for the failure times with the true parameters α = 2.5, θ = 3, a = 4 and b i = exp(β 20 + β 21 x i ) for each individual (i = 1, . . ., n); • The values of x i are generated from a Bernoulli with success probability 0.5, β 20 = −0.6 and β 21 = 0.8; • The censoring times C are generated from the exponential distribution Exp(λ), where

Figure 2 .
Figure 2. QQ-plots for the deviance residuals in the GOLL-LL regression with 30% censoring percentages.

Figure 3 .
Figure 3. QQ-plots for the deviance residuals in the GOLL-LL regression with 0% censoring percentages.

Figure 4 .
Figure 4.Estimated survival functions for the fitted model and the empirical survival function.(a) GOLL-W distribution and (b) GOLL-Ga distribution.

Figure 5 .
Figure 5.Estimated survival curves by Turnbull's method for the covariates: (a) low dose, (b) medium dose and (c) high dose.

Figure 6 .
Figure 6.Estimated survival functions for the fitted GOLL-W, GOLL-Ga and GOLL-LL distributions and the empirical survival function.

Figure 8 .
Figure 8. Residual plots for the regression M 2 .(a) Deviance residuals and (b) Normal probability plot for the deviance residuals with envelope.

Figure 9 .
Figure 9.Estimated survival functions and the empirical survival curve for the covariates: (a) low dose, (b) medium dose and (c) high dose.

Table 1 .
Interval-censored and its special cases.

Table 3 .
Results for the GOLL-LL regression with interval-censored data from frequentist and Bayesian methods and censoring percentage 30%.

Table 4 .
Results for the GOLL-LL regression with interval-censored data from frequentist and Bayesian methods and censoring percentage 60%.

Table 5 .
Summary of the SW and SF statistics for censoring percentages of 30%.

Table 6 .
Summary of the SW and SF statistics for censoring percentages of 0%.

Table 8 .
Findings from the fitted GOLL-W distribution.

Table 9 .
Findings from the fitted GOLL-Ga distribution.

Table 10 .
LR tests for the GOLL-W distribution.

Table 11 .
LR tests for the GOLL-Ga distribution.

Table 14 .
Estimation results from the fitted regression M 2 via frequentist method to the HIV infection data.

Table 15 .
Estimation results from the fitted regression M 2 via Bayesian method to the HIV infection data.