A Lindley–binomial model for analyzing the proportions with sparseness and excessive zeros

Proportional data arise frequently in a wide variety of fields of study. Such data often exhibit extra variation such as over/under dispersion, sparseness and zero inflation. For example, the hepatitis data present both sparseness and zero inflation with 19 contributing non-zero denominators of 5 or less and with 36 having zero seropositive out of 83 annual age groups. The whitefly data consists of 640 observations with 339 zeros (53%), which demonstrates extra zero inflation. The catheter management data involve excessive zeros with over 60% zeros averagely for outcomes of 193 urinary tract infections, 194 outcomes of catheter blockages and 193 outcomes of catheter displacements. However, the existing models cannot always address such features appropriately. In this paper, a new two-parameter probability distribution called Lindley–binomial (LB) distribution is proposed to analyze the proportional data with such features. The probabilistic properties of the distribution such as moment, moment generating function are derived. The Fisher scoring algorithm and EM algorithm are presented for the computation of estimates of parameters in the proposed LB regression model. The issues on goodness of fit for the LB model are discussed. A limited simulation study is also performed to evaluate the performance of derived EM algorithms for the estimation of parameters in the model with/without covariates. The proposed model is illustrated through three aforementioned proportional datasets.


Introduction
Discrete data in the form of proportions have been used to construe occurrences in many potential fields, including biology, clinical trials, engineering, insurance, public health, engineering, ecology, econometrics, etc.Generally, the binomial models are often used to analyze such kind of discrete data.However, in many practical situations, these data often exhibit extra-variation (over/under dispersion).Other issues arisen from such kind of data include the excessive zeros and sparse observations.When those issues are not properly addressed, the analysis using usual binomial models may not provide a good fit to the proportional data [13] and fail to explain the kinds of variation to the actual data.
Therefore, the statisticians have widely addressed the phenomenon of large variation in the proportional data, and the most popular model used for over-dispersed proportional data is the beta-binomial (BB) model, which was proposed originally by Williams [33].Afterward, Crowder [5,6] analyzed the proportions using the beta-binomial ANOVA.Paul [25] applied the beta-binomial model to the analysis of the proportions of affected foetuses.Recently, Menssen and Schaarschmidt [21] obtained the prediction intervals for overdispersed binomial data.Najera-Zuloaga et al. [24] analyzed health-related quality of life data using the beta-binomial regression model approaches.Furthermore, Deng and Paul [7] proposed score tests for zero inflation and overdispersion based on the zero-inflated beta-binomial models.Luo and Paul [18] considered the estimation for zero-inflated betabinomial regression model with missing response data.Ascari and Migliorati [2] proposed a new beta-binomial model for overdispersed binomial data with outliers and an excess of zeros.Generally, the BB model can account for overdispersion, allowing the binomial probability to vary in terms of a beta distribution.In particular, the BB model enriches (and encompasses) the binomial model with an additional precision/dispersion parameter, which admits an interesting interpretation in terms of intraclass correlation as well [26,27].
However, it is no model that works well for the various proportional data at all time.For example, the features such as sparseness and zero-inflation presented in proportional data cannot be appropriately addressed using the existing models.There are such features in the hepatitis, whitefly and catheter management data sets.The hepatitis data set was given by Keiding [15] and exhibits the sparseness with 19 out of 83 annual age groups contributing non-zero denominators of 5 or less out of 83 groups.Also, there are 36 zero seropositives out of 83 annual age groups in this data set.Therefore, this dataset indicates not only extreme sparseness but also excessive zeros.The whitefly data are obtained from the experiment that is about the efficacy of the pesticide on whiteflies.van Iersel et al. [32] studied the purpose of controlling silver leaf whiteflies by using a subirrigation system.They conducted this study to determine the effectiveness of controlling silver leaf whiteflies on poinsettia with imidacloprid, which was delivered by a subirrigation system.The number of surviving whiteflies combining with the total number of whiteflies in each experiment formed as the proportional data.Whitefly data set consists of 640 observations with 339 zeros (53%).Compared with other observations (each of the other observations has an average of 3.5%), obviously there exist excessive zeros in the whitefly data.The data on catheter management study are collected in the randomized clinical trial, in which the indwelling urinary catheter users were taught the awareness and self-monitoring skills.Each patient was asked up to six times about three binary outcomes and the total number of times asked varies from 1 to 6. Therefore, the outcomes can be considered as the times of urinary tract infections (UTIs), catheter blockages, and catheter displacements during total six asking times and thus three sets of proportional data with binomial denominator six (6) were obtained.Further, there are 83 (43.0%), 127 (65.5%), and 140 (72.5%)zeros in the 193 outcomes of urinary tract infections (UTIs), 194 outcomes of catheter blockages, and 193 outcomes of catheter displacements, respectively.Thus there are very high volumes of zeros in catheter management data set.The results for analyzing these three datasets using the binomial and zero-inflated binomial models have shown that the fits of these models to hepatitis, whitefly and catheter management data are not very appropriate.Therefore, instead of the existing models, an alternative model should be considered to fit the proportional data with extra variation, which the existing models may not be able to handle appropriately.Such model can be obtained by compound binomial distribution with a nonnegative distribution.The target distribution is Lindley distribution, which was first introduced by Lindley [17].This distribution is quite popular for modelling lifetime data and has a wide applicability in survival and reliability because of its closed forms for the survival and hazard functions and also its good flexibility of fit.Furthermore, many researchers have proposed and studied new classes of distributions which compound with a family of Lindley distributions.For example, Sankaran [28] proposed a compound Poisson distribution, which is known as the Poisson-Lindley distribution, by mixing the Poisson distribution with Lindley distribution.He gave some examples of real data sets that the Poisson-Lindley distribution provided a good fit.Zamani and Ismail [35] proposed the negative binomial-Lindley distribution, by mixing the distributions of negative binomial and Lindley, and found that this two-parameter negative binomial-Lindley distribution is particularly suitable in explaining count data with excess zeros, based on the application to accident and insurance claims data.Then Calderin-Ojeda and Gó mez-Déniz [4] extended the negative binomial Lindley distribution from univariate to multivariate, which provides a tractable model with attractive properties that makes it suitable for application in any field where overdispersion is observed in count data.Bhati et al. [3] introduced a new generalized Poisson-Lindley distribution, which is compounding Poisson distribution with two-parameter Lindley distribution.They proved that this new distribution is a good alternative to Poisson distribution and Poisson-Lindley distribution for the righttailed data set.Tajuddin et al. [30] proposed a four-parameter negative binomial-Lindley distribution to model over-and under-dispersed count data with excess zeros.There are many other applications with compound Lindley distributions, which indicate the Lindley distribution is definitely popular and useful.
Nevertheless, to our knowledge, there is no research that focuses on the model, which is derived by compounding the binomial model with a member of Lindley distribution family.The purpose of this paper is to propose such distribution for proportional data, which is called as Lindley-binomial (LB) distribution by compounding the binomial distribution with a Lindley distribution.The distribution that we actually used for compounding with binomial distribution is a two-parameter Lindley distribution, which is also used for compounding with other distributions by many researchers.Therefore, the proposed distribution in this paper is a new generalized two-parameter Lindley-binomial distribution and is a good alternative to BB distribution.This distribution allows to enrich the variance structure so as to account for multiple causes of overdispersion.The great variety of possible shapes of the LB distribution with right/left-tailed behaviors directly demonstrate the flexibility of the corresponding model and address the presence of outliers, sparseness as well as excessive zero observations without requiring ad hoc extra components accounting for them.This is possible because the two-parameter Lindley distribution is the mixture of two gamma distributions and thus the LB model dedicates one of its mixture components to a particular group of observations (e.g.zero values and/or outliers) automatically and provides interesting information about the possible sources of extra variation.
The remainder of this paper is organized as follows.In Section 2, we propose a Lindley binomial distribution, which is inspired by good statistical properties of Poisson Lindley distributions.We then derive the probabilistic properties such as probability mass function, mean, variance and moment generating function for Lindley binomial distribution.The likelihood-based statistical inference about parameters of interest is studied in Section 3.Moreover, the Fisher scoring algorithm and EM algorithm are given to compute the estimates of parameters for the proposed Lindley binomial regression model.Meanwhile, Pearson chi-squared residuals and deviance residuals are presented to assess the goodness of fit for the proposed and existing models.In Section 4, simulation studies are performed to evaluate the performance of proposed EM algorithm for the computation of estimates of parameters in the proposed LB model with/without covariates.Hepatitis data, whitefly data and Catheter management study data are analyzed as an application of the proposed methodology in Section 5 with the concluding remarks in Section 6.

Lindley-binomial distribution and its properties
In this section, to define the Lindley binomial distribution, we briefly review the twoparameter Lindley distribution.As first given by Shanker [29], a random variable X follows a two-parameter Lindley distribution, denoted as X ∼ L 2 (α, θ) if the probability density function f (x; α, θ) of X has the following form: The pdf of this two-parameter Lindley distribution can be also shown as a mixture of exponential distribution (θ ) (or gamma distribution (1; θ)) and gamma distribution (2; θ) as follows: with a different mixture proportion π = θ α+θ , f 1 (x; θ) = θ e −θ x and f 2 (x; θ) = θ 2 xe −θ x .It can easily be seen that at α = 1, the two-parameter Lindley distribution reduces to the one parameter Lindley distribution, which was first proposed by Lindley [17].Furthermore, this distribution reduces to exponential distribution with mean θ −1 and gamma distribution with mean 2θ −1 for α = 0 and α = ∞, respectively.Therefore, for two-parameter Lindley distribution, the parameter α can account for the mixture proportions if this distribution is considered as the mixture of two gamma distributions.

The definition of Lindley-binomial distribution
Now in what follows, based on the binomial distribution and two-parameter Lindley distribution, we give the definition for Lindley-binomial distribution as follows.
From Definition 2.1, Lindley-binomial distribution is induced by compounding the binomial distribution with two-parameter Lindley distribution and the probability mass function with corresponding properties are presented in Proposition 2.1.The derivation of Proposition 2.1 is given in Supplemental Material.Proposition 2.1: Let Y be a random variable which follows a two-parameter Lindley binomial distribution LB 2 (m, α, θ).Then the probability mass function of Y has the following form: where y = 0, 1, 2, . . ., m; θ > 0 and θ + α > 0.
Note that for α = 0(π = 1), Lindley binomial distribution becomes a beta-binomial distribution BB(m, θ , 1) with the pmf as where B(β 1 , β 2 ) is the beta function defined as In this sense, Lindley binomial distribution can be considered as the partial generalization of beta binomial distribution.

The probabilistic properties of Lindley-binomial distribution
In this section, we present the probabilistic properties of Lindley binomial distribution.At first, we may see the shapes of this probability mass functions for various values of parameters.
The probability mass function of LB 2 (m, α, θ) distribution with different values of parameters are given in Figure 1 and Figure S1 of the supplemental file.These figures are plotted to explore the effect of one parameter to the probability value given that other parameter is fixed.Based on Figure 1, the pmf of LB 2 (m, α, θ) distribution has the lowest mass at zero and the probability is significantly small at zero when θ is large such as θ = 10 or 100.Moreover, LB 2 (m, α, θ) distribution has the ability in fitting data with large frequency at the right endpoint.However, when θ is quite small, the pmf of LB 2 (m, α, θ) is right-tailed and has the highest mass at zero.Thus this proposed distribution is an alternative model to adequately fit the proportional data with the large frequency at left endpoint or at right endpoint.Also when the value of α is fixed, the maximum point of pmf changes from zero to the binomial denominator m as the value of θ changes from small to large.From Figure S1 in the supplemental file, when the value of θ is fixed, the shapes of pmf almost keep same even the value of α changes from small to large, which means θ is a shape parameter in LB 2 (α, θ) distribution.
Next, we discuss the probabilistic properties.We first give the following proposition.

Proposition 2.2:
Let Y be a random variable which follows a two-parameter Lindley binomial distribution LB 2 (m, α, θ).Then the rth factorial moment of Y is given by Now from Proposition 2, we have the expectation and variance of Lindley binomial random variable Y as follows: Further, the moment generating function of Y can be derived using the law of total expectation: and the probability generating function and characteristic function of Y can be obtained in the same way.
Since the proposed Lindley binomial distribution will be used to fit the extra dispersed proportional data, the index of dispersion for this distribution should be addressed.The index of dispersion, which is also called variance-to-mean ratio, is a very useful tool to indicate whether a set of observations is clustered or dispersed compared to a standard statistical model.The index of dispersion for the LB 2 (m, α, θ) distribution comes out to be The index of dispersion given in (3) cannot be directly used for finding out whether LB 2 can adequately fit under-dispersed data, over-dispersed data or both of them.Thus, values of are plotted as a function of parameters α and θ in Figure 2 to show the ability of the LB 2 distribution in fitting the data with either over-dispersion or under-dispersion.
Based on the plot in the left panel of Figure 2, it is obvious that the index of dispersion for the LB 2 distribution can be either greater than one or less than one depending on the choice of the parameters.The plot in the right panel of Figure 2 indicates that as the value of θ increases, the value of decreases and approaches to zero.Therefore, from these plots one could conclude that the LB 2 distribution can adequately fit over-dispersed data or under-dispersed data by choosing different values of parameters θ and α.

Likelihood based inferences for Lindley-binomial regression model
In Section 2, Lindley binomial distribution is defined based on the two-parameter Lindley distribution L 2 (α, θ) given in (1).However, the probability mass function has a little complicated form and thus results in the complexity of statistical inference for Lindley binomial model.To simplify the procedure of inferences, by reparameterizing to Lindley binomial distribution based on the mixture model ( 2) and setting π = θ/(θ + α), φ = 1/θ , we can have a following expression for the probability mass function of Lindley-binomial random variable Y: where y = 0, 1, 2, . . ., m; 0 < π < 1 and φ > 0. For convenience, we denote above form of Lindley binomial distribution as LB(m, π, φ).Furthermore, in terms of new parameters π and φ, we have the expressions for the factorial moment, mean, variance and moment generating function for Lindley binomial random variable Y as follows: and . ., n are unknown parameters.Suppose that y i is the realization of random variable Y i , then the observed data and associated binomial denominators would be represented by y obs = {y 1 , . . ., y n } and m obs = {m 1 , . . ., m n }.Based on the probability mass function of Y given by ( 4), the likelihood function for the parameters (π , φ) = (π 1 , . . ., π n , φ 1 , . . ., φ n ) can be obtained as and thus the log-likelihood function is

MLEs of parameters for LB regression model
Based on the discussion above, we now derive the maximum likelihood estimates of parameters for LB regression model.

The formulation of LB regression model
Let Y 1 , . . ., Y n be independent random variables from the Lindley binomial distribution and Y i follows the Lindley binomial distribution LB(m i , π i , φ i ), where for i = 1, . . ., n, m i are the known binomial denominators, π i and φ i are the unknown parameters.Further, let w i and x i be the covariates associated with the proportional parameters π i and scale parameter φ i , respectively.Now suppose y i is the realization of the random variable Y i , then the observed data and associated binomial denominators would be represented by y obs = {y 1 , . . ., y n } and m obs = {m 1 , . . ., m n }.To investigate the relationship between the parameters in LB model and covariates, the following regression model can be used to establish the association of parameters π i and φ i with covariates w i and where w i = (1, w 1i , . . ., w pi ) and x i = (1, x 1i , . . ., x qi ) are not necessarily identical covariate vectors associated with the subject i; (i = 1, . . ., n), and α = (α 0 , α 1 , . . ., α p ) and β = (β 0 , β 1 , . . ., β q ) are the vectors of the regression coefficients associated with π i and φ i (i = 1, 2, . . ., n), respectively.Therefore, based on ( 5) and ( 6), the log-likelihood function for the regression coefficients θ = (α, β) has the following form: The primary purpose of the following sections is to estimate the parameter vectors α and β.

MLEs of parameters via Fisher scoring algorithm
In this section, the Fisher scoring algorithm is derived to calculate the MLEs of the parameters α and β.Now, based on Equation ( 7), the first partial derivatives of log-likelihood with respect to α and β are ∂ (α, β|y obs , m obs ) ∂α The maximum likelihood estimates α and β are the solutions of Equations ( 8) and ( 9) equaling to zero.However, there are no closed forms for these estimators and these non-linear equations do not seem to be solved directly.Their computations should be performed numerically using nonlinear optimization algorithms.Generally, Fisher scoring algorithm is a commonly used method to calculate maximum likelihood estimation and has a good stability even in multiparameter cases.The expected Fisher information matrix should always be positively definite, when the model is not over-parameterized Lauritzen [16].Therefore, in what follows, we discuss Fisher scoring algorithm for computing the MLEs of parameters.The Newton-Raphson algorithm can be derived in similar way.To apply Fisher scoring algorithm, the Hessian matrix should be obtained first as follows: Then the Fisher information matrix J(α, β) = −E∇ 2 (α, β|y obs , m obs ) is Now let (α (0) , β (0) ) be the initial values of the MLEs ( α, β).If (α (t) , β (t) ) denote the tth approximation of ( α, β), then the (t + 1)th approximation can be computed by ∂ (α (t) , β (t) |y obs , m obs ) ∂α ∂ (α (t) , β (t) |y obs , m obs ) ∂β and the MLEs of (α, β) could be (α 1) || is less than a threshold value.It should be pointed out that Fisher information matrix in LB model is intractable.However, it could be replaced with the observed information matrix in above Fisher scoring algorithm.

MLEs of parameters via the EM algorithm embedded with Fisher scoring algorithms at each M-step
In this section, we will develop the EM algorithm embedded with Fisher scoring algorithms at each M-step for MLEs of parameters in the proposed LB regression model.Note that the Fisher scoring algorithm possesses quadratic convergence and is sensitive to initial values.When the initial value (α (0) , β (0) ) of Fisher scoring algorithm is sufficiently near ( α, β), it converges very fast.However when the chosen initial value of (α (0) , β (0) ) is far from the true value of (α, β), it might not converge.Furthermore, as we mentioned before, Fisher scoring algorithm does not work because Fisher information matrix is intractable from the likelihood function for the observed sample in LB model.Therefore in terms of the mixture property of two-parameter Lindley distribution, the EM algorithm can be developed to compute the estimates of parameters for Lindley binomial model.In fact, the EM algorithm for maximum likelihood estimation uses the data likelihood as the objective function for choosing parameters.Sometimes this algorithm may not work well in all cases and the penalized methods may be used to modify the objective function.The typical penalized methods include the ridge penalty, the Bayes-inspired penalty and the logistic regression penalty (Hoerl and Kennard [14]; Hastie, Tibshirani and Friedman [11]; Morris [23]; Moreno and Lele [22]).Since the ridge penalty term only includes no-intercept parameters in the model and the logistic regression penalty involves the absolute values of parameters, we select the Bayes-inspired penalty for the estimation of parameters in the proposed EM algorithm.The modified objective function has the following form: where τ is a tuning parameter trading off the likelihood and penalty terms.Therefore we will derive the EM algorithm for computing the MLEs of parameters in the proposed Lindley binomial model with Bayes-inspired penalty.As we know, the EM algorithm is a popular tool for estimating maximum likelihood estimation in joint statistical models by iterating between E-step and M-step.The E-step represents the expectation of the log-likelihood.The M-step computes parameters maximizing the expected loglikelihood found on the E-step.Then the unobserved latent variable is determined by these estimated parameters in the next E-step.
To establish the EM algorithm for the computation of MLEs for the parameters in LB regression model, we first set up the stochastic representation for the two-parameter Lindley distribution.Based on the reparameterization for the LB model, the pdf of two-parameter Lindley random variable has the form: where f 1 (λ; φ) and f 2 (λ; φ) are the pdfs of gamma(1, φ) random variable U and gamma(2, φ) random variable V, respectively.Based on (11), the random variable can be stochastically represented as where Z follows the Bernoulli distribution with success probability π , that is This latent variable Z specifies to which mixture component each observation belongs.Therefore for a given , there is an associated latent variable Z and the distribution function of can be rewritten as Let p(z; π) represent the probability mass function of Z. Then the joint probability function is From the above expression, the full conditional distribution of Z is given by where from Evin et al. [9], For observed sample y i with i = 1, 2, . . .n from LB(m i , π i , φ i ) distribution, based on (12) we introduce independent latent variables Z i and i : We denote the latent/missing data by Y mis = {z i , λ i } n i=1 and the complete data by Y com = {Y obs , Y mis } = Y mis , where z i , λ i are the realizations of Z i and i , respectively.Thus the complete-data likelihood function is given by and the complete-data log-likelihood function is proportional to Hence, based on the regression model ( 6) and ( 14), the complete-data log-likelihood for regression parameters α and β is proportional to (15) and the log-likelihood function with penalty has the form as Now, the first and negative second partial derivatives of the complete-data penalized loglikelihood function ( 16) are given by where , I p+1 and I q+1 are the (p + 1) × (p + 1) and (q + 1) × (q + 1) identity matrices, respectively.Note that J (τ ) com (α) is actually the complete-data Fisher information matrix associated only with the parameter vector α and the covariate matrix w, since it depends on neither the observed responses nor the latent/missing data.However, J (τ )  com (β) is associated not only with the parameter vector β and the covariate matrix x but also with the latent variables λ = (λ 1 , . . ., λ n ).
Now, the M-step is to separately calculate the MLEs of α and β via two Fisher scoring algorithms as follows: The E-step is to replace the latent variables z, λ in ( 17) and ( 18) by their conditional expectations: and where and where The derivation of ( 21) and ( 22) is given in Supplemental Material.Now, let α and β are the estimates of the parameters α, β, respectively.Then the asymptotic covariance matrices for α and β can be obtained as ĉov( α) = J −1 com ( α), ĉov( β) = J −1 com ( β) and thus the corresponding confidence intervals for the components of α and β can be constructed by using the Wald-type method.For the value of τ , the EM algorithm can be carried out for each generated data set in simulation and τ can be chosen via maximizing the likelihood for the simulated data.
Remark: For the LB model without covariate, the MLEs of parameters π and φ can be easily obtained from aforementioned Fisher scoring algorithm and EM algorithm based on the reduced model and setting π = e α0 /(1 + e α0 ) and φ = e βo .The details for these algorithms are omitted.

The issues for hypothesis testing and goodness-of-fit testing in LB model
In this section, we first discuss the hypothesis testing for Lindley binomial model.Since Lindley binomial distribution is derived via compounding the binomial model with the two parameter Lindley binomial distribution, which is the mixture of gamma distribution (1, θ) and gamma distribution (2, θ), one may like to know if Lindley binomial model is obtained via compounding the binomial distribution with a single gamma distribution.Therefore, the hypotheses we are interested in are H (1)  0 : π = 0 versus H (1)  1 : and The hypothesis ( 23) is to test if the LB model is derived via the binomial model compounding with the gamma distribution (2, θ)) and ( 24) is to test if the model follows the specific beta binomial model BetaBin(m, θ , 1).Now, based on the likelihood method of model (4), the LRT statistics for testing the hypotheses ( 23) and ( 24) have the following forms, respectively: where ( π, φ) are the unconstrained MLEs of (π , φ), which can be obtained via the Fisher scoring algorithm or EM algorithm given in Sections 3.1.1and 3.1.2for the LB regression model with π = exp(w α)/(1 + exp(w α)) and φ = exp(x β).φ1 , and φ2 are the MLEs of φ under the null hypotheses H (1)  0 and H (2)  0 , respectively, which can be derived in analogous algorithms as the unconstrained MLEs of (π , φ).
Under H (1)  0 and H (2)  0 , the LRT statistics T 1 and T 2 approximately follow the chi-squared distribution with one degree of freedom and the corresponding p-value can be computed as where t 1 and t 2 are the observed values of T 1 and T 2 , respectively.Further, the LRT method can also be used to test the general null hypothesis H 0 : π = π 0 .
LB(m, π, φ) , Next we consider to assess the goodness of fit (GOF) for the LB model.There are numerous methods for testing the GOF of a model for proportional data.The most standard tests are residual deviance and Pearson's χ 2 -test.We first discuss the Pearson residual for assessing the GOF in the commonly used models with proportional data.These models are binomial BIN(m, π), beta binomial BB(m, α, β), zero-inflated binomial ZIB(m, ω, π) and Lindley binomial LB(m, π, φ) models.The Pearson residual, defined as the raw residual normalized by the estimated standard deviation of the response variable, can be expressed as , where μi is the fitted value for y i and V(y i ) is the estimated value of variance for y i .
The following table presents the Pearson residuals for aforementioned proportional models.
Note that all considered models are assumed to involve no covariates.For the case that the covariates are involved in the models, the corresponding parameters would depend on the covariates.Further, based on Pearson residuals given in Table 1, Pearson chi-squared statistic for GOF test is defined as Under a correctly specified model, X 2 1 follows an approximate chi-square distribution χ 2 n−p , where n is the sample size, and p is the number of parameters.
Next we consider the deviance residuals for the proportional models.The general deviance statistic, which is defined as twice the difference between the log-likelihood for the saturated and fitted models, can be expressed as follows: where p(y i | θs ) is the log-likelihood function for the saturated model and θs is the parameter estimates for the saturated model, in which there are as many estimated parameters as data points [1,19].By definition, a saturated model leads to a perfect fit to the data and has the highest log-likelihood among all models.log[p(y i | θ)] represents the log-likelihood function of the fitted model and θ denotes the parameter estimate for the fitted model.Deviance residual represents the contribution of individual observation to the deviance D(y, μ), which is defined as the signed square root of the corresponding component for D(y, μ) and can be written as where . Also, we list the deviance residuals for the commonly used proportional models in the following table.Same as the Pearson Chi-squared statistic, the deviance statistic lows an approximate chi-squared distribution χ 2 n−p under the correctly specified models.Therefore the goodness of fit in the proposed LB model can be assessed using Pearson chi-squared statistic and deviance statistic.
Moreover, for the proportional data with equal number of binomial denominators, other chi-squared statistic can be used to test the goodness of fit for the aforementioned models.This statistic has the following form: where m is the value of binomial denominator for each experiment; O y is the number of observations with y successes in all experiments; E y is the expected number with y successes from the fitted model, which can be calculated as E y = np y where n is the total number of observations and py is the expected probability that y successes are observed in m Bernoulli trials.Furthermore, py can be obtained as py = P(Y = y) = p(y| θ), y = 0, 1, . . ., m and p(y| θ)(y = 0, 1, . . ., m) is the expected probability calculated from the fitted model.Moreover, X 2 2 follows the approximate chi-squared distribution χ 2 m−p under the correctly specified models, where p is the number of parameters in the fitted model.
On the other hand, the likelihood ratio or maximum likelihood statistical significance test G is increasingly being used in situations where chi-squared test X 2  2 is previously recommended [20].The formula for G has the following form: where O y and E y are the same as that in chi-squared X 2 2 .Note that this G-test statistic is twice Kullback-Leibler divergence of the theoretical distribution from the empirical distribution.We may call it as Kullback-Leibler divergence statistic for the goodness of fit test.Also, the test statistic G has the same approximate chi-squared distribution as X 2  2 .For diagnosing the models with proportional data, we can compare the residuals and calculate the values of Pearson chi-squared statistic X 2  1 , deviance statistic D 2 and statistic X 2  2 , Kullback-Leibler divergence statistic G (with equal number of binomial denominators) among all proportional models and thus find the best fitted model.Meanwhile, Akaike information criterion (AIC) and Bayesian Information Criterion (BIC) can serve as the diagnosis tools for the selection of proportional models.

Simulation study
In this section, we carry out a limited simulation study to evaluate the performance of the proposed statistical methods in Section 3 for the LB model.We first examine the accuracy of EM algorithm for computing the MLEs for different parameter settings in the proposed LB models without covariates via simulation studies.Then we investigate its accuracy for the computation of regression parameters in the LB model with covariates.
From each generated sample the MLEs of parameters π and φ are calculated via EM algorithm ( 17)-( 20) and the corresponding standard errors are obtained by using the asymptotic variances v ar( π) = π(1 − π)/(n + τ π(1 − π)) and v ar( φ) = φ2 /(2n − n π + τ φ2 ).Next with repeating times G = 1000 for the parameters π and φ, the 1000 samples are independently generated and the corresponding 1000 EM MLEs and 1000 standard errors for parameters π and φ 1 are obtained.Further, in Table 3 and Table S1 of the supplemental file, MLE is the average of the 1000 estimates via the EM algorithm ( 17)- (20); MSE is the average of 1000 standard errors.As seen in Table 3 and Table S1 of the supplemental file, in most of scenarios the biases are small and the MLEs are very close to the corresponding true values of parameters in most cases, although the biases of EM estimates for some scenarios are a little large with sample size n = 50, 75 and 100.However, by comparing the MLE and MSE, there is no significantly difference between the true values and the estimated values of parameters, which demonstrates that the proposed EM algorithm has very good performance.
From Table 4, Tables S2 and S3 of the supplemental file, one can see that the simulation results for LB regression model are consistent with that for LB model without covariate.The estimates for the parameters in LB regression model obtained via the EM algorithm are accurate although the biases are a little large for the scenarios that π = 0.75 and n = 50, 75.

Real-data application
In this section, we apply the proposed Lindley binomial model to analyze three proportional data sets with extreme sparseness and excessive zeros

Incidence of hepatitis A in Bulgaria
The data set used in this section is the incidence of hepatitis A in Bulgaria by age.This data set was given by Keiding [15] and exhibits the sparseness with 19 out of 83 annual age groups contributing non-zero denominators of 5 or less out of 83 groups.Farrington [10] presented an analysis of this data set fitting to generalized linear models.They used the number of seronegatives as response variable with binomial errors.For the illustration of our proposed LB model, we consider the number of seropositives as binomial response Here, the mixture parameter π in LB model and the zero-inflated parameter ω are assumed to be fixed for different subjects.The analysis of the data using these two regression models is given in Table 6, from which LB model obviously demonstrates the superiority to ZIB model.Therefore based on the results given in Tables 5 and 6, we may conclude that the LB model has a better performance than binomial, BB and ZIB models for fitting hepatitis data.

Whitefly data
This whitefly data set is the result of the experiment that is about the efficacy of the pesticide on whiteflies given by [32].In the whitefly data, the purpose of controlling silver leaf whiteflies was studied by using a subirrigation system.The study was designed to determine the effectiveness of controlling silver leaf whiteflies on poinsettia with imidacloprid, which was delivered by a subirrigation system.Imidacloprid is a resilient and powerful chemical that has low toxicity to mammals and is used to control silver leaf whiteflies on poinsettia.At the first week of this experiment, researchers placed m adult whiteflies (here m is considered as the binomial denominators with range 6-15, mean = 9.5 and SD = 1.7) in clip-on leaf cages attached to one leaf per plant and then recorded the number of surviving whiteflies 2 days later, which is considered as the response variable.To measure reproductive inhibition, the fly cages were removed after obtaining the survival count but the position of each cage was marked.In the coming week, m adult whiteflies were placed in clip-on leaf cages attached to one leaf on the same plant and the number of surviving whiteflies were recorded.Therefore the number of surviving whiteflies combining with the total number of whiteflies in each experiment formed as the proportional data.There are 640 observations in the final data set with 339 zeros(53%).Compared with other observations (each of the other observations has an average of 3.5%), obviously there exist excessive zeros in the whitefly data.Therefore, this data set is an appropriate data set to be used to test the ability of the proposed LB model fitting data with zero inflation.
To consider the effects of covariates, we apply our proposed LB regression model and ZIB regression model to analyze the whitefly data.To compare our proposed model with ZIB mode, we consider the following covariates: Here, the mixture parameter π in LB model and the zero-inflated parameter ω in ZIB model are assumed to be fixed for different subjects.The computational procedures of EM algorithm presented in Section 3.1.3are used to calculate the MLEs of the regression coefficients.The calculation results based on LB regression model and ZIB regression model are summarized in Table 7.One can see that all indices of model diagnosis demonstrate that the proposed LB regression model outperforms the ZIB regression model.

Catheter management study
The data on catheter management study was used as an example in [12].The purpose of this randomized clinical trial was to teach indwelling urinary catheter users the awareness and self-monitoring skills.It was conducted in New York state, and 202 subjects were recruited and randomized to the intervention and control groups.The primary outcomes of interest are whether the subjects experienced urinary tract infections (UTIs), catheter blockages, and catheter displacements during the last 2 months, as well as the corresponding counts of these experiences.Thus each patient was asked up to six times about three binary outcomes.Due to the death or dropout, some patients were asked for less than six times.So the total number of times asked varies from 1 to 6.However, the outcomes with the asking show that there are structural zeros in the three outcomes.For the purpose of illustration, we analyze these three proportional data sets by using our proposed LB model and compare it with ZIB and BB models.Table 8, Tables S4 and S5 of the supplemental file present the results from the analysis of three proportional datasets based on the ZIB, BB and LB models without covariates.In the analysis of these proportional data, the binomial denominators for all outcomes of urinary tract infections (UTIs), catheter blockages, and catheter displacements are same.Instead of Pearson chi-squared test X 2 1 and deviance test D 2 , the chi-squared test X 2  2 and Kullback-Leibler divergence G are used to assess the GOF for all three models.
From the results given in Table 8, Tables S4, S5 and Figure S2 of the supplemental file, one can see that in terms of likelihood, AIC, BIC, chi-squared X 2  2 and Kullback-Leibler divergence G, Lindley binomial model has the best performance among three models for fitting of outcomes of urinary tract infections (UTIs), catheter blockages and beta binomial model shows the superiority to other two models for fitting of outcomes from catheter displacements.These results also show that although there exist zero-inflations in three proportional data sets, the zero-inflated binomial model may not fit the data very well and the existence of zero inflation in the data does not means there exist the structural zeros.

Concluding remarks
In this paper, a new model for proportional data, called 'Lindley binomial model' has been proposed.The model is defined by compounding the binomial distribution with Lindley distribution.It can also be regarded as an extension of binomial model and can be used to fit the proportional data with extra variation such as the over/under dispersion, sparseness and zero inflation and thus is more flexible model for analyzing the proportional data.Specially, this model may be more appropriate to the binomial data with big probability at zero or at the binomial denominator.The Fisher scoring algorithm and EM algorithms are derived for the computation of the estimates of parameters in the proposed regression model.The simulation results demonstrate that the proposed EM algorithm has an excellent performance for the computation of MLEs of parameters in the proposed Lindley-binomial model with/without covariates.Hepatitis data, whitefly data and catheter management study data are used to demonstrate the proposed model and inferential methods in the proposed Lindley-binomial model.The results show the Lindley-binomial model has the advantage for the analysis of proportional data with sparseness and excessive zeros.
However, there is no model that can fit all kinds of proportional data.For example, the proportional data may display large frequencies of both zeros and right endpoints.Our proposed LB model can only address the single endpoint inflation (zero inflation or rightendpoint inflation).Therefore the proposed model has the limitation for the application.Deng and Zhang [8] and Tian et al. [31] proposed the endpoint inflated binomial model with the statistical properties to fit such data.We may consider the different way to account for the endpoint inflation.In fact, we are thinking about to extend the proposed Lindley binomial model by compounding the binomial model to an analogue of Lindley models.In current research, we actually compound the binomial distribution with the mixture of two gamma distributions with same rate parameter θ .Our idea for this new distribution is to compound the binomial distribution with mixture of two gamma distributions with different rate parameters like gamma distribution (1, θ 1 ) and gamma distribution (2, θ 2 ).If this is working for the bimodal data, it will be another alternative to fit the proportional data.We will be doing such research in the future.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Pmf plots of LB 2 for different θ values with fixed α values.

Figure 2 .
Figure 2. The plots for the index of dispersion with different values of α and θ.

x = ( 1 ,
plant, block, trt(i.e.treatment), week, trt × block, trt × week) The ZIB regression model and LB regression model are as follows: LB model: Y i ∼ LB (m i , π, φ i ) and log φ i = x β LB and ZIB model:

Table 1 .
Pearson residuals for commonly used models for proportional data.

Table 2 .
Deviance residuals for commonly used models for proportional data.
i − m i π) 2y i log

Table 5 .
Results of model fitting for the Hepatitis in Bulgaria dataset., for which, there are 36 zero seropositives out of 83 annual age groups.Therefore, this data set exhibits not only extreme sparseness but also excessive zeros.For the purpose of comparison, we perform the statistical analysis for this data set using the Lindley binomial (LB) model as well as the simple binomial model, beta-binomial (BB) model and zero-inflated binomial (ZIB) model and compare the proposed LB model with these existing proportional models via the GOF test statistics given in Section 3.2.Table5presents the values of log-likelihood, AIC, BIC, Pearson chi-squared statistic X 2 1 and deviance statistic D 2 with the values of estimated parameters for the model fitting with the aforementioned four models.The estimates of parameters in BB model are computed using 'bb.mle'package in R programming environment.However, it should be pointed out that the estimated values of parameters heavily depend on the initial values of parameters in 'bb.mle' package for beta binomial model.Based on the results in Table 5, although the Pearson chi-squares statistic for BB model is smallest among the four models, the LB model gives the largest log-likelihood −154.9336, and smallest AIC 313.8672, smallest BIC 318.7048 and smallest deviance statistic 189.9486.Therefore, the proposed Lindley-binomial model shows the variable

Table 6 .
Results for the analysis of Hepatitis data using LB and ZIB regression models.theproportionaldata with sparseness and excessive zeros.Next, we consider assessing the mixture parameter π in LB model for this dataset.The likelihood ratio test (LRT) can be used to test if this parameter equals zero or one.For hepatitis data, the value of LRT is 2[ LB ( π, φ) − LB (0, φ)] = 2(−154.9336−(−155.4260))=0.9848,which strongly support the null hypothesis H 0 : π = 0.This also shows that the data may come from the distribution which compounds the binomial with the single gamma(2, 1/φ) distribution.Furthermore, since the estimate of mixture parameter π in LB is 0.0384 and φ = 1.1375, the estimated values of corresponding original parameters are α = (1 − π)/ π φ = 22.0166 and θ = 1/ φ = 0.8791.On the other hand, via LRT, we have 2[ ZIB ( π, φ) − BIN (p)] = 2(−191.8077−(−240.4100))=97.2046,whichstrongly support the existence of zero-inflation in hepatitis data.From the above results, one can see the LB model can also be used to account for the zero-inflation in proportional data, which can be demonstrated in top two panels of FigureS1in the supplemental file.Now we further use the regression models to demonstrate the superiority of our proposed model.Since the data involve excessive zeros, we only compare the LB regression model with ZIB regression model.The LB regression model considered here has the form as Y i ∼ LB (m i , π, φ i ) and log φ i = β 0 + log(age i )β 1 and the ZIB regression model has the form as

Table 7 .
[34]lts for the analysis of whitefly data using of ZIB and LB regression models.timelessthansix are discarded.Therefore, the outcomes can be considered as the times of urinary tract infections (UTIs), catheter blockages, and catheter displacements during total six asking times and thus three sets of proportional data with binomial denominator six (6) were obtained.Further, there are 83 (43.0%), 127 (65.5%) and 140 (72.5%)zeros in the 193 outcomes of urinary tract infections (UTIs), 194 outcomes of catheter blockages, and 193 outcomes of catheter displacements, respectively.These high volumes of zeros suggest that there may be the zero-inflation issue and zero-inflated binomial model can be used to fit such data.Ye et al.[34]combined the repeated binary outcomes and applied Wald, LRT, score and a new statistic to test zero inflation for binomial responses.All testing results

Table 8 .
Results of model fitting for catheter blockages in Example 2.