The new Neyman type A generalized odd log-logistic-G-family with cure fraction

The work proposes a new family of survival models called the Odd log-logistic generalized Neyman type A long-term. We consider different activation schemes in which the number of factors M has the Neyman type A distribution and the time of occurrence of an event follows the odd log-logistic generalized family. The parameters are estimated by the classical and Bayesian methods. We investigate the mean estimates, biases, and root mean square errors in different activation schemes using Monte Carlo simulations. The residual analysis via the frequentist approach is used to verify the model assumptions. We illustrate the applicability of the proposed model for patients with gastric adenocarcinoma. The choice of the adenocarcinoma data is because the disease is responsible for most cases of stomach tumors. The estimated cured proportion of patients under chemoradiotherapy is higher compared to patients undergoing only surgery. The estimated hazard function for the chemoradiotherapy level tends to decrease when the time increases. More information about the data is addressed in the application section.


Introduction
In the analysis of survival data, the event of interest may not happen for a high portion of individuals, even after a long period. For example, when the event of interest is the recurrence of a determined type of cancer after the administration of a treatment, some of the patients might not suffer a relapse. In this situation, a significant contingent of people cannot be susceptible to the event under study. Empirically, this characteristic is observed in the estimated Kaplan-Meier (KM) survival function, which presents a right tail at an approximately constant level and strictly greater than zero for a considerable period. CONTACT   For example, Martinez et al. [19] analyzed 201 patients with gastric adenocarcinoma (the most common type of stomach cancer) who received one of two types of therapy: chemotherapy or isolated surgery. Figure 1 shows the KM estimates, thus indicating differences in cure proportions.
In this situation, the usual survival models may not be adequate to model this type of data. There are two cure rate model formulations that have received strong attention in the literature. The first is the mixed model introduced by Boag [7] and Berkson and Gage [5], where the random number of causes has values zero and one. Some recent works can be found in Achcar et al. [2] and Ramires et al. [26,27]. The second formulation is the promotion time cure model pioneered by Yakovlev and Tsodikov [30] assuming a random number of causes or occurrences for this event. Some works applying this formulation were developed by Hashimoto et al. [15], Ortega et al. [21], Suzuki et al. [29], Barriga et al. [4] and Cancho et al. [9].
All these studies adopted well-known distributions to model the survival rate such as the Weibull, log-normal, log-logistic, generalized gamma, etc. However, these distributions do not provide good fits in several practical applications where the hazard rates are not constant, monotone or unimodal, but instead having U-shaped or bimodal forms. In this article, we adopt the generalized odd log-logistic-G (GOLL-G) family (Cordeiro et al. [12]), which is very flexible to accommodate hazard rates in the U, unimodal and bimodal shapes, and it can be appropriate to cope with other models.
We propose the GOLL-G family to model survival data with cure fraction in a scenario of competitive causes, where the random number of causes or risks is modeled by the Neyman Type A (NTA) distribution (Neyman [20]), and also considering that there is no information on the factor responsible for the occurrence but only that the lifetime is observed when the Rth risk factor (among M possibilities) is evaluated. Basically, we adopt three latent activations, using the order statistics, minimum, maximum and random. The proportion of cure individuals is related to covariables that explain its variability. We construct a log-linear regression and use the likelihood method to estimate the parameters. However, using first-order asymptotic likelihood theory in small samples, a common situation in practice, or where the regularity conditions are not satisfied, can lead to results that are hard to justify.
Thus, as an alternative to the likelihood method, we also adopt a Bayesian approach. In the next step to fit a model to a data set, it is necessary to verify the model assumptions. If the model is not adequate, its use can lead to mistaken conclusions. Therefore, to check for the existence of failure to satisfy the model assumptions and detect atypical points, we perform global influence and residual analysis.
In summary, our objective is to introduce a new family of distributions to model failure times when there are individuals of long duration (cure fraction) by considering different activation schemes for the latent variable. In this context, when we analyze this type of data we may have a better chance of choosing a more adequate distribution to model failure times.
The sections are discussed as follows. In Section 2, the model is defined with special cases. In Section 3, we discuss likelihood and Bayesian estimation of the parameters. A simulation study in Section 4 verifies some statistical properties of the likelihood estimates. Model selection is discussed in Section 5. In Section 6, our proposals are applied to gastric adenocarcinoma data. Finally, Section 7 provides some conclusions.

The NTA GOLL-G model with cure fraction
The random variable M denotes the unknown number of competing causes associated to the event under study. We can adopt several discrete distributions for M such as Poisson, binomial, etc. We consider the NTA distribution constructed as a mixture of two discrete Poisson distributions. In fact, it can be determined from a Poisson distribution whose parameter represents the product of a fixed parameter φ by another Poisson random variable with rate λ.
According to Bhalerao et al. [6], the NTA distribution can be an alternative to the negative binomial because it can also capture overdispersed data, which is a situation that often happens in counting data. Let M be a random variable following the NTA distribution with parameters λ > 0 and φ > 0, the last one interpreted as an 'index of clumping' (David and Moore [14]). The probability function and generating function (gf) of M can be expressed as If M ∼ NTA(λ, φ), the mean and variance of M are E(M) = λφ and Var(M) = λφ(1 + φ), and then φ works as a kind of dispersion that affects its skewness and kurtosis. For φ small, M is approximately distributed as Poisson with rate λφ.
Let Z j be the time for the jth competing cause (for j = 1, . . . , M) generate the event under study. Consider that Z j represents the activation time that produces this event due to the jth cause (for j = 1, . . . , M). Also, we assume that the Z j s are independent of M. For fixed M, the Z j s are i.i.d. random variables with cumulative distribution function (cdf) F(t) and survival function S(t) = 1 − F(t). Furthermore, competitive causes may have different activation schemes. Then, similarly to Cooner et al. [11], we assume that it is necessary the incidence of R of M causes (R ≤ M) for occurring this event. Thus, the order statistics are Z (1) ≤ · · · ≤ Z (R) ≤ · · · ≤ Z (M) and Z (R) is the Rth order statistic. Then, the time until the occurrence of the current event is T = Z (R) .
The R value can be constant or a function of M, or a discrete random variable with conditional distribution in M (Cancho et al. [8], Suzuki et al. [29]). In addition, if M = 0, the time to the event is assumed infinite (T = ∞) with P(T = ∞ | M = 0) = 1. We consider three possible activations found in the literature to construct survival models according to their configurations. Possible activation schemes for R are first activation, random activation, and last activation.
, which represents the model proposed by Rodrigues et al. [28]. The population survival function of T is and the cure proportion has the form By differentiating S pop (t) in (1) in relation to t, the associated probability density function (pdf) can be expressed as The functions corresponding to Equations (1) and (2) are improper. The population hazard function, which is also improper, is • Random activation scheme (II): here, M ≥ 1 and R is a discrete random variable with a uniform distribution in {1, . . . , M} and probability 1/M. Under this scheme, the NTA distribution of M leads to The cure proportion in the random activation scheme follows as The pdf and the hazard function corresponding to (3) are and respectively.
The cure proportion in this activation scheme is The density and hazard functions of the population corresponding to (5) are and respectively. The cure proportions are equal in the three activation schemes, but the survival, density and hazard functions depend on the configuration. An important inequality of the survival functions in Equations (1), (3) and (5) . The proof of this property is given by Kim et al. [16].
In survival analysis, the classical distributions associated with the time Z are the exponential, Weibull, log-normal, and log-logistic distributions. However, in some practical situations, the hazard function may assume non-monotone forms, i.e. bathtub, unimodal or bimodal functions. In this case, these distributions cannot be used because they do not have these characteristics. For this reason, there is a need to study other continuous distributions capable of describing hazard functions with several forms. Generalizing continuous distributions can provide greater flexibility to hazard rates and accommodate monotonous, unimodal, bimodal and bathtub shapes. Further, generalized distributions can be adopted for model discrimination. Recently, Cordeiro et al. [12] proposed the GOLL-G family to generate flexible models for Z with cdf where α > 0 and θ > 0 are shape parameters. The pdf of Z has the form where g(z; τ ) is the baseline density. Equation (8) includes the exponentiated, odd log-logistic and baseline densities when α = 1, θ = 1 and α = θ = 1, respectively.
Hereafter, let Z ∼ GOLL-G(α, θ , τ ) have density function (8). The hazard rate function (hrf) of Z can be expressed as We obtain the quantile function (qf) of Z by inverting (7) Recently, some authors studied the GOLL-G family. Prataviera et al. [24] presented the generalized odd log-logistic flexible Weibull with two applications in reparable systems. Korkmaz et al. [17] proposed the Topp-Leone generalized odd log-logistic family for Otis IQ Scores and voltage data sets. For more details about the odd log-logistic-G class, see Alizadeh et al. [3] and Korkmaz et al. [18].

Special NTA-GOLL-G models with cure fraction
Substituting the GOLL-G family under the previous activation schemes, we obtain the NTA-GOLL-G models with cure fraction in Table 1 which can be widely applied in many areas. Some special models are discussed below by taking for baseline the Weibull (W), log-logistic (LL) and Flexible Weibull (FW) distributions (all with positive support).
• Neyman type A generalized odd log-logistic Weibull (NTA-GOLL-W) model with cure fraction It is defined from the Weibull cdf G(t; τ ) = 1 − exp{−(t/b) a } with shape a > 0 and scale b > 0. For α = θ = 1, we have the Neyman type A Weibull (NTA-W). For θ = 1, Table 1. Three NTA-GOLL-G models with cure fraction. Table 2. Some NTA-GOLL-G models with cure fraction for three types of activation.
we obtain the Neyman type A odd log-logistic Weibull (NTA-OLL-W) and, for α = 1, it gives the Neyman type A exponentiated-Weibull (NTA-EW). For each previous model, we have three extra models depending on the type of activation as shown in Table 2.
Figures 2, 3 and 4 display the population hazard rates of the NTA-I-GOLL-W, NTA-II-GOLL-W and NTA-III-GOLL-W distributions, respectively. They can be bimodal, monotonically increasing or decreasing and upside-down bathtub and really show the flexibility of these models. redThe programs for the construction of these plots are given as Supplemental Material file.
It is worth mentioning that the introduced family serves to test different distributions to model failure times. Thus, we can test several of them, i.e. we do not have only a single model. In the application of Section 6, we consider three distributions. However, we could have taken several others such as the log-normal, inverse Gaussian, Dagum, Lomax and generalized gamma distributions.

Inference methods
In many health problems, the death time and cure time depend on explanatory variables such as the age, gender, pathological findings, tumor size, lymphovascular invasion, among   others. Let T i be the time-to-event having the GOLL-G cdf F(t) given by (7) and C i be the right censoring time. We observe t i = min{T i , C i }.
We define the systematic component φ i = exp(x T i β) for the agglomeration index φ i in terms of the explanatory vector x i (for i = 1, . . . , n), where β = (β 1 , . . . , β k ) T is the vector of coefficients. We can find the total log-likelihood function under different types of activation for any baseline G in the GOLL-G family.
The total log-likelihood function for the parameters ψ = (λ, α, θ , τ T , β T ) T (under non informative censoring) from n pairs of times and covariables (t 1 , x 1 ), · · · , (t n , x n ) can be expressed as where F and C denote the uncensored and censored sets of observations, respectively, and the functions S pop (·) and f pop (·) depend on the activation system. These functions are given in Equations (1) and (2) for the first activation, (3) and (4) for the random activation and (5) and (6) for the last activation. We can obtain the following expressions for l(ψ): • First activation • Random activation • Last activation where and g(·) and G(·) are baseline functions, and r is the number of failures.

Bayesian approach
The estimation of the parameters can be addressed using Bayesian methods. In this context, we specify prior distributions (by assuming two parameters for the baseline G) for β, λ, α, θ, a and b under different types of activation. In this context, we consider that these parameters have independent priors, namely π(ψ) = π(λ)π(α)π(θ )π(a)π(b) p j=0 π(β j ). (12) We consider the following priors: the gamma distribution (0.01, 0.01) for λ, α, θ , a and b and the normal distribution N(0, 10 3 ) for β j . Combining the likelihood function L(t | ψ) and the prior distribution in (12), the joint posterior distribution for ψ can be expressed as Inference is based on Markov Chain Monte Carlo (MCMC) simulation methods. In this paper, all computational implementations are performed using the JAGS software (Plummer [23]). The convergence of the chains is controlled based on the methods by Cowles and Carlin [13] as well as trace plots. The programs for calculating the estimates are found in Supplemental Material file.     We obtain the Average Estimates (AEs) and Root Mean Square Errors (RMSEs) from two methods. For the Bayesian estimates, we consider independent priors, where the values of the hyperparameters are fixed at λ 1 = λ 2 = α 1 = α 2 = θ 1 = θ 2 = a 1 = a 2 = b 1 = b 2 = 0.01 and σ 2 j = 100. We simulate two chains of size 35,000 for each parameter with a burn sample size of 5000 to eliminate possible effects of the initial values, and to avoid autocorrelation problems. We consider a spacing of size 10, thus obtaining an effective sample of size 6000 upon which the posterior inference is based.

Simulation study
We use a R script to generate the observed times according to the proposed activation scheme (minimum, random or maximum). Tables 3, 4 and 5 list the estimated cure proportions from the classical and Bayesian methods for two levels of x (p (0) 0 and p (0) 1 ) under these schemes for the NTA-GOLL-W model. The estimated proportions tend to the true parameters for both levels of x and the RMSEs decay toward zero when n increases for both methods. Further, the estimated cure proportions using the Bayesian analysis is better under frequentist for the first and last activation schemes. However, the frequentist results are better for the random and last activation schemes in terms of RMSEs. Among the   Tables 3, 4 and 5 provide the results of the different scenarios of the simulations considering the maximum likelihood and Bayesian estimation methods. They indicate that the MLE of the cure proportion has a lower RMSE than the estimate from the Bayesian method. We can also note that the RMSEs tend to decrease in both estimation methods when the sample size increases. Thus, these methods can be adopted to estimate the parameters of the regressions. Additionally, diagnostic plots were used to assess convergence and to observe if there is some problem in the chain of the proposed model for one sample simulated in the same conditions of the previous simulations. In this perspective, the NTA-III-GOLL-W model is used for the generation of the times and estimation of the parameters with a sample of size n = 800. Figure 5 displays the trace plots of the chains simulated data for the NTA-III-GOLL-W model. It can be noted in accordance with these plots that the sample generated presents satisfactory stability. The convergence of the chains is monitored by the methods of Cowles and Carlin [13]. The trace plots for the parameters of the NTA-III-GOLL-W model are displayed in Figure 5, which indicate convergence of the chains. The plots were about one sample generated with n = 800.

Checking model
A variety of techniques can be adopted for choosing the best model among several models fitted to a data set. The models are compared based on the well-known deviance −l = −l(ψ), Akaike information criterion (AIC) and Bayesian information criterion (BIC). The global influence measures of the ith observation on the estimates are GD i (ψ) = (ψ (i) −ψ) T [L(ψ)](ψ (i) −ψ) (Cook, [10]) and LD i (ψ) = 2[l(ψ) − l(ψ (i) )], whereψ (i) is the MLE when the ith observation is omitted from the sample.
The quantile residuals (qrs) for the fitted regressions (defined in Section 3) are • First activation • Random activation • Last activation where −1 (·) is the inverse of the standard normal cdf.

Application
We use the data set from Martinez et al. [19] referring to 201 patients with gastric adenocarcinoma to illustrate the applicability of the proposed regression. Adenocarcinoma is responsible for most cases of stomach tumors. The censored observations correspond to patients who died from other causes (or are still alive) at the end of the study (nearly a total of 53%). For more details, see Ortega et al. [22]. The time (in months) after surgery until death is the response variable and the covariates are: x i1 : therapy type (0 = adjuvant chemoradiotherapy, n = 125; 1=surgery alone, n = 76); x i2 : gender (0 = male, n = 124; 1 = female, n = 75) and x i3 : age (years) registered for each patient (i = 1, . . . , 201). Our interest is to verify the effects of the covariates on the cure fraction based on the proposed regression with different distributions and activation schemes. Figure 6(a,b) display the KM curves for each covariable. It is clear that there is more evidence on the difference in estimated survival curves by the KM method about the covariate type of therapy than gender. The total time test (TTT-Plot) introduced by Aarset [1] for these data in Figure 7(a) reveals a unimodal hazard rate and Figure 7(b) gives the KM-Plot. Based on these plots, we note a cure fraction far above zero. We fit the NTA-GOLL-W regression (with cure fraction) under the activation schemes (listed in Table 2) to these data with the systematic component log[φ(x i )] = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 , and the cure fraction given by (for i = 1, . . . , 217) We compare the NTA-GOLL-G regression (with cure fraction) (Section 5) for some special models under three criteria. These statistics reported in Table 6 reveal that the NTA-II-GOLL-W regression provides the lowest values of the statistics. So, this regression can be chosen initially to fit these data. Based on the figures in Table 8 we can again note that   the SEs of the MLEs are smaller than those of the Bayesian method in agreement with the results in the simulations. The likelihood ratio (LR) statistic is adopted to verify if any sub-models fit better than the NTA-II-GOLL-W regression. The values of the LR statistics reported in Table 7 indicate that the NTA-II-GOLL-W regression is favorable in three hypothesis tests. These results are supported by the plots in Figure 8. The qrs in the qqnorm plot are closer to the first angle bisector and in the Worm Plot they are more located inside the region of the two bans with few fluctuations. Table 8 gives the estimates of the parameters for this regression using the classical and Bayesian methods, where it is noted that only the therapy type is significant. The scenarios used for the priors of the model are similar to those for the simulations obtaining positive results in the diagnosis of the chains for the convergence.   To detect possible influential observations after considering the NTA-II-GOLL-W regression as the most appropriate model, a global influence study is performed to calculate GD i (ψ) and LD i (ψ) defined in Section 5. Figure 9 provides the plots of these measures, where there are no influential or discrepant observations. Considering that this regression is more appropriate, we note that there are no influential or discrepant observations and only the treatment is significant. The parameters of the regression are again re-estimated with only this covariate in Table 9.  The estimated cure proportions for both treatment levels arep (x 1 =0) =p 0 = 0.42766 andp (x 1 =1) =p 1 = 0.41009, respectively. Figure 10 gives the estimated hazard function versus the empirical hazard function as well as the estimated survival function of the final regression according to the types of treatment. Findings • The numbers in Table 8 indicate that sex and age are non-significant under a 5% level.
• The type of therapy variable is significant, which means a significant difference between chemoradiotherapy and surgery alone. • We note from Figure 10(a) that the estimated risk function for the chemoradiotherapy level tends to decrease when time increases and presents a good adjustment in relation to the empirical risk function. • We observe from Figure 10(b) that the estimated risk function for the surgery alone level has a unimodal form, i.e. it grows until approximately 18 months and after that the function tends to decrease. • Figure 10(c) reveals a good fit of the proposed mochemoradiotherapydel. Note that chemoradiotherapy patients have a longer survival time than patients who have had surgery alone. • The estimated proportion of cure patients under chemoradiotherapy is 43%, while for patients undergoing only surgery it is 41%.
• The agglomeration index for patients under chemoradiotherapy is φ x 1 =0 = 3.0728 and for patients with surgery alone is φ x 1 =1 = 13.8156. So, the agglomeration rate for patients with surgery alone is approximately four times higher when compared to patients under chemoradiotherapy. In other words, the number of cancer cells ('clones') that the patient kept active is four times more crowded in patients under surgery alone than patients under chemoradiotherapy.

Concluding remarks
We propose the new Neyman type A generalized odd log-logistic-G (NTA-GOLL-G) family with cure fraction. This model is based on the different activation schemes in which the number of factors M has the Neyman type A discrete distribution and the time until the occurrence of the event of interest follows a generalized odd log-logistic-G family. The classical and Bayesian methods are used to estimate the parameters of the models. We also compare the performance of the estimates of the proposed model based on the Mean Estimate (ME) and Root Mean Square Error (RMSE) through a simulation study.
These simulations indicate that the new model can be useful to fit real data. We also discuss the influence measures using the global influence theory in the proposed regression. The approach is applied to a real data set, which indicates the usefulness of the new regression concerning the model selection and residual analysis. Thus, it is expected that this regression will be useful for fitting other data sets. Therefore, through the considerations mentioned above, the possible future works are: • A simulation study with other baseline distributions via frequentist and Bayesian analysis using the GOLL-G family; • A simulation study for the residuals in the long-term context using the GOLL-G family; • Influence measures in a Bayesian approach; • Consider linear and non-linear effects on systematic components (semiparametric regressions).