A multivariate zero-inflated binomial model for the analysis of correlated proportional data

In this paper, a new multivariate zero-inflated binomial (MZIB) distribution is proposed to analyse the correlated proportional data with excessive zeros. The distributional properties of purposed model are studied. The Fisher scoring algorithm and EM algorithm are given for the computation of estimates of parameters in the proposed MZIB model with/without covariates. The score tests and the likelihood ratio tests are derived for assessing both the zero-inflation and the equality of multiple binomial probabilities in correlated proportional data. A limited simulation study is performed to evaluate the performance of derived EM algorithms for the estimation of parameters in the model with/without covariates and to compare the nominal levels and powers of both score tests and likelihood ratio tests. The whitefly data is used to illustrate the proposed methodologies.


Introduction
Count and proportional data have been used in a wide variety of fields of study, including education, sociology, psychology, biology, toxicology, epidemiology, insurance, public health, engineering, ecology, econometrics, agriculture, manufacturing and horticulture. When analysing such data, generalized linear models are extensively used. However, these data often present a larger number of zero observations than what would normally arise from the standard count and proportional distributions. When those issues are not properly addressed, the analysis using usual GLMs such as binomial and Poisson models even the over-dispersed GLMs may not provide a good fit and fail to explain the kinds of variation to the actual data. Therefore, statisticians proposed so-called zero-inflated models to fit such data.
The work of zero-inflated models has a long history that could be traced back to at least the 1960s when Cohen [5] and Johnson and Kotz [13] discussed zero-inflated Poisson (ZIP) models without covariates for count data. Later the ZIP models with covariates were studied by Lambert [15] for application to defects in manufacturing. The zero-inflated negative binomial (ZINB) models were studied by Deng and Paul [7] for the count data with both zero-inflation and over-dispersion. Hall [9] and Vieira et al. [23] proposed zeroinflated binomial (ZIB) distributions for modelling proportional data with extra zeros. The zero-inflated beta-binomial (ZIBB) models were also applied by Deng and Paul [7]. Moreover, the score tests for zero-inflation in a generalized linear model were studied by Broek [3] and Deng and Paul [6,7]. Hall and Berenhaut [10] developed the score test for heterogeneity and over-dispersion in zero-inflated Poisson and binomial regression models. Jansakul and Hinde [12], Ridout et al. [21], and Min and Gzado [19] also compared the assessing power for testing zero-inflation among the likelihood ratio test, the Wald test and the score test. Most recently, Song [22] established simultaneous statistical modelling of excess zeros, over/underdispersion, and multimodality. Alevizakos and Koukouvinos [1] used the zero-inflated binomial processes with a double exponentially weighted moving average statistic to monitor quality characteristics of high-yield processes. In such processes where a large number of zero observations exists in proportional data, the ZIB models are more appropriate than the ordinary binomial models. Furthermore, Alqawba and Diawara [2] proposed a Markov zero-inflated count time series models based on a joint distribution through copula functions.
As previously stated, most available studies in the area of zero-inflation are concentrated on univariate distributions. As the more complex data frequently arose in many subjects, statisticians have extended univariate distribution to their multivariate analogues (e.g. Fang et al. [8]). Johnson and Kotz [13] introduced multivariate Poisson distribution for modelling several types of defects. Li et al. [17] studied several possible ways to construct multivariate zero-inflated Poisson (MZIP) distribution. Liu and Tian [18] purposed Type I MZIP distribution with comparison to the MZIP distribution in Li et al. [17]. On the other hand, the multivariate binomial distribution was studied by Krishnamoorthy [14]. Chandrasekar and Balakrishnan [4] obtained some properties and a characterization of multivariate binomial distribution. Furthermore, similar to the univariate case, excessive zeros are not unusual to be expected in multivariate correlated proportional data and the univariate ZIB model is typically not sufficient for modelling such data. In order to fix the over-dispersion problem, fit the multivariate proportional data well, as well as have more accurate results, a new distribution called the 'multivariate zero-inflated binomial (MZIB) distribution' is proposed in this paper. Such a distribution is developed along the approach of symmetric multivariate distributions by Fang et al. [8] and based on the stochastic representation of the univariate ZIB random variable. The random variable with this new multivariate zero-inflated binomial distribution is assumed to be a q-dimensional response vector and is generated by a mixture of a common degenerate distribution with a unit mass point at zero in R q and q independent binomial distributions. Further, the correlations among the components of multivariate zero-inflated binomial variable are intuitively addressed although these binomial components are independent. Moreover, different from the random effects ZIB model, our proposed model can give the explicit expression for the correlation coefficients among the components of multivariate zero-inflated binomial variable.
The remainder of this paper is organized as follows. In Section 2, we propose a multivariate zero-inflated binomial distribution, which is inspired by good distributional properties of Type I MZIP distribution of Liu and Tian [18] and driven by the stochastic representation of univariate ZIB random variable. We then obtain joint probability mass function, joint cumulative distribution function, and mixed moments of the MZIB distribution. The likelihood-based statistical inference about parameters of interest is performed in Section 3. Moreover, the Fisher scoring algorithm and EM algorithm are given for the computation of estimates of parameters on the proposed model with/without covariates. The score tests and the likelihood ratio tests are also developed for assessing the zero-inflation and the equality of all binomial probabilities in this section. In Section 4, simulation studies are performed to evaluate the performance of proposed score tests and the likelihood ratio tests in terms of nominal levels and powers, and of EM algorithm for the computation of estimates of parameters for the proposed MZIB model with/without covariates. The whitefly data is analysed as an application of the proposed methodology in Section 5 with the discussion in Section 6.

A multivariate zero-inflated binomial distribution
. By virtue of the stochastic representation of the univariate ZIP random variable, it could be naturally extended to a multivariate version. In what follows, we give the definition of the multivariate ZIB distribution, which has the vector form of correlation structure with a common Bernoulli variable Z.
bounded by a given upper vector m = (m 1 , . . . , m q ) is said to follow a multivariate ZIB distribution with parameters ω ∈ [0, 1) and π = (π 1 , . . . , π q ) where Z ∼ Bernoulli(1 − ω), X = (X 1 , . . . , X q ) , X r ∼ Binomial(m r , π r ) for r = 1, . . . , q, and (Z, X 1 , . . . , X q ) are mutually independent. The multivariate ZIB distribution is denoted by From the stochastic representation (1), the joint probability mass function of Y ∼ ZIB q (ω, m, π) can be expressed as where ξ ξ ξ has the degenerate distribution at mass 0. The corresponding joint cumulative distribution function is given by where y = (y 1 , . . . , y q ) is a non-negative real vector in R q , y r is the 'floor' function of y r , denoting the largest integer less than or equal to y r and is the regularized incomplete beta function. Note that although the definition of MZIB distribution based on the stochastic representation has the advantage in the derivation of its properties (see below), the limitation of this definition is that the zero-inflated parameter ω should be in the interval [0, 1]. However, it is not necessary to assume ω ∈ [0, 1]. Therefore, we can define the multivariate ZIB distribution based on the probability mass function as follows.
From this definition, one can find that it is possible to take ω less than zero, provided that with equality for zero-truncation. It is zero-deflation if ω is negative. Also, the value of ω should be less than one. (This distribution would degenerate to zero if ω = 1.) It should be pointed out that when ω < 0, the stochastic representation given in (1) does not hold. Since zero-deflation (ω < 0) seldom happens in proportional data and the current paper concentrates on assessing the zero-inflation in multivariate proportional data, we mainly consider the case ω ≥ 0, where (1) can always be used to investigate some properties of MZIB distribution and to make the statistical inference for MZIB model. We now derive the expressions for the moments using this representation. We note that the resulting formulas continue to hold for the case ω < 0, which can be checked using the probabilistic representation (2) for MZIB distribution. Now from (1), the mixed moments for Y ∼ MZIB q (ω, m, π ) can be obtained as follows: where t 1 , . . . , t q 0. Also, by setting mπ = (m 1 π 1 , . . . , m q π q ) , we have Therefore, ωm r π r m s π s and the correlation coefficient for Y r and Y s is In particular, when m r = m s = m and π r = π s = π , we obtain , r = s.
It is worthy of note that our proposed model can address the correlations among all components of multivariate zero-inflated binomial variable and give an explicit expressions of correlation coefficients (see (3)). Furthermore, from the expression of the correlation coefficient, one can see that there exist positive (negative) correlations among the components of multivariate zero-inflated binomial variable Y if the parameter ω is greater (less) than zero although the components of base variable X are independent. The correlation is induced by the imposition of same zero mass probability.

Likelihood based inferences for MZIB model
In this section, we consider the statistical inferences for MZIB model. Since the current research focuses on the statistical inference for the zero-inflation of multivariate proportional data with excessive zeros, it is assumed that the value of zero-inflation parameter ω be in the unit interval [0, 1) and all statistical inferences on MZIB model be based on the stochastic representation (1) in the following sequels. Let Y 1 , . . . , Y n be independent random vectors and . . , m iq ) are the known vectors of binomial denominators, π i = (π i1 , . . . , π iq ) are the unknown vectors of binomial probabilities and ω i are the unknown zero-inflated parameters. Now suppose y i = (y i1 , . . . , y iq ) is the realization of the random vector Y i , then the observed data and associated binomial denominators would be represented by y obs = {y 1 , . . . , y n } and m obs = {m 1 , . . . , m n }. Furthermore, for convenience, let y ·r = n i=1 y ir , m ·r = n i=1 m ir for r = 1, . . . , q and y i· = q r=1 y ir , m i· = q r=1 m ir for i = 1, . . . , n. Based on the joint probability mass function of Y given by (2), the likelihood function for the parameters (ω, π ) = (ω 1 , . . . , ω n , π 1 , . . . , π n ) can be obtained as By reparameterization, let Then the likelihood function for (γ , π ) = (γ 1 , . . . , γ n , π 1 , . . . , π n ) is so that the log-likelihood function is ln m ir y ir + y ir ln π ir 1 − π ir + m ir ln(1 − π ir ) I(y i = 0).

MLEs of parameter for MZIB model without covariates
Based on the discussion above, we first derive the maximum likelihood estimates of parameters for MZIB model without covariates. In this case, the zero-inflated parameters γ i and the probabilities π i are held fixed as γ and π . Hence the log-likelihood (5) can be simplified as ln m ir y ir + y ir ln π r 1 − π r + m ir ln(1 − π r ) I(y i = 0).
The Fisher scoring algorithm is a common method to calculate maximum likelihood estimation. Comparing with EM algorithm, it could have better stability even in multiparameter cases. In addition, the expected Fisher information matrix should always be positively definite, when the model is not over-parameterized (Lauritzen [16]). However, the Fisher scoring algorithm requires more complex calculation than EM algorithm for deriving the expected Fisher information matrix. Moreover, the expected Fisher information matrix could not be tractable for the complicated models. Since the estimation under multivariate zero-inflated binomial distribution is multi-parameter case, the Fisher scoring algorithm should be studied. Now, based on the equation (6), the score vector is for r = 1, . . . , q. In order to apply Fisher scoring algorithm, the Hessian matrix should be obtained first as follows: Then, the Fisher information matrix J(θ ) = −E∇ 2 (γ , π | y obs ) is The derivation for the formulas of J γ γ , J γ π , J πγ and J ππ is given in the supplemental file. Now let θ (0) be the initial values of the MLEsθ of θ = (γ , π ) . If θ (t) denote the tth approximation ofθ, then the (t + 1)th approximation can be computed by is

MLEs of parameters via the EM algorithm
In this subsection, we will develop the EM algorithm to compute the MLEs of parameters in the proposed MZIB model. Although the Fisher scoring algorithm possesses quadratic convergence, it may not guarantee the MLEs of ω and π r , r = 1, 2, . . . , q to be included in the unit interval [0, 1). When the initial value (γ (0) , π (0) ) of Fisher scoring algorithm is sufficiently near (γ ,π ), they converge very fast. However, it is sensitive to initial values under MZIB distribution. When the chosen initial value of (γ (0) , π (0) ) is far from (γ ,π ), they might not converge. Therefore, the expectation-maximization (EM) algorithm is given for the calculation of MLEs in the MZIB model. The EM algorithm is a popular tool for estimating maximum likelihood estimation in joint statistical models by iterating between E-step and M-step. The E-step represents the expectation of the log-likelihood. The M-step computes parameters maximizing the expected log-likelihood found on the E-step. Then, the unobserved latent variable is determined by these estimated parameters in the next E-step. For for r = 1, 2, . . . , q. We denote the latent/missing data by are the realizations of Z i and X ir , respectively. Thus, the complete-data likelihood function is given by and the complete-data log-likelihood function is proportional to The M-step is to calculate the complete-data MLEs, which are given by (12) and (13) with their conditional expectations: and where J = {i : y i = 0} and The detail for deriving (14) and (15) is given in A.2 of supplemental file. Note that the latent variables (11) are independent Bernoulli random variables and binomial random variables, respectively. Thus, the left-hand side of (14) must be less than or equal to n, and the left-hand side in (15) must be between 0 and n i=1 m ir = m ·r . In other words, the EM algorithm (12)-(15) can guarantee that the MLEs of {ω, {π r } q r=1 } fall within the unit interval [0, 1), resulting in a clear statistical interpretation for these parameters in the distribution. This is advantage of the EM algorithm over the Fisher scoring algorithm. However, it is worthy of note that the EM algorithm is based on the stochastic representation (1), which intuitively assumes that the zero-inflated parameter ω is in the unite interval [0, 1). It does not work for the case of zero-deflation. Now let (ω,π 1 , . . . ,π q ) denote the MLEs of (ω, π 1 , . . . , π q ) obtained via the EM algorithm (12)- (15). Actually, based on the square root of the diagonal elements of the estimated inverse Fisher information matrix [J(ω,π 1 , . . . ,π q )] −1 , the Wald-type confidence intervals for the parameters can be obtained. However, the zero-inflated parameter ω and the binomial probabilities π 1 , π 2 , . . . , π q should be restricted within the unit interval and thus some upper (or lower) limits of these confidence intervals may be larger (or less) than 1 (or 0), resulting in useless confidence intervals. Instead of the Wald-type methods, the bootstrap approach can be used to compute the bootstrap confidence interval for any component of (ω, π 1 , . . . , π q ). At first, the independent sample y * 1 , y * 2 , . . . , y * n from the distribution MZIB q (ω, m,π ) can be generated, whereω andπ are the MLEs of ω, and π based on the original sample. Then, based on the generated sample y * 1 , y * 2 , . . . , y * n , the MLE (ω * ,π * 1 , . . . ,π * p ) of (ω, π 1 , . . . , π p ) can be calculated. Independently repeating this procedure G times, the G MLE's {(ω * g ,π * 1g , . . . ,π * pg )} G g=1 of (ω, π 1 , . . . , π p ) can be obtained and thus the confidence intervals of (ω, θ 1 , . . . , θ p ) can be constructed by [ω L , ω U ], [π 1L , π 1U ], . . . , [π qL , π qU ], where ω L , π 1L , . . . , π qL and ω U , π 1U , . . . , π qU are the 100(α/2)% and 100(1 − α/2)% percentiles of {(ω * g ,π * 1g , . . . ,π * pg )} G g=1 , respectively.

The formulation of MZIB model with covariates
Again, let Y 1 , . . . , Y n be independent random vectors and are the known vectors of binomial denominators, π i = (π i1 , . . . , π iq ) are the unknown vectors of binomial probabilities and ω i are the unknown zero-inflated parameters. Further, let v i and u i be the covariates associated with the zeroinflated parameters ω i and binomial probabilities π i = (π i1 , . . . , π iq ) (i = 1, 2, . . . , n), respectively. Now suppose y i = (y i1 , . . . , y iq ) is the realization of the random vector Y i , then the observed data and associated binomial denominators would be represented by y obs = {y 1 , . . . , y n } and m obs = {m 1 , . . . , m n }. To investigate the relationship between the parameters (ω, π ) and covariates v i and u i , we consider the following regression model: are not necessarily identical covariate vectors associated with the subject i; α = (α 0 , α 1 , . . . , α p ) and β r = (β r0 , β r1 , . . . , β rp ) are corresponding regression coefficients. The primary purpose of this section is to estimate the parameter vector θ = (α , β 1 , . . . , β q ) .

MLEs via the EM algorithm embedded with Fisher scoring algorithms at each M-step
Now, the complete-data likelihood function in Section 3.1.2 now becomes and the complete-data log-likelihood function is proportional to The first and negative second partial derivatives of the complete-data log-likelihood function are given by π 1r ), . . . , m nr π nr (1 − π nr )]. Note that J com (α) is actually the complete-data Fisher information matrix associated only with the parameter vector α and the covariate matrix v and J com (β r ) is actually the complete-data Fisher information matrix associated only with the parameter vector β r (r = 1, 2, . . . , q) and the covariate matrix u, respectively, since they depend on neither the observed responses nor the latent/missing data. Now, the M-step is to separately calculate the MLEs of α and β via two Fisher scoring algorithms as follows: The E-step is to replace the latent variables z, {x r } q r=1 in (16) and (17) by their conditional expectations: and where y r = (y 1r , . . . , y nr ) , Note that (18) and (19) can be derived in the same way as (14) and (15). Now, letα andβ r , r = 1, . . . , q are the estimates of the parameters α, β r , r = 1, 2, . . . , q, respectively. Then the asymptotic covariance matrices forα andβ r , r = 1, . . . , q can be obtained aŝ cov(α) = J −1 com (α),ĉov(β r ) = J −1 com (β r ), r = 1, 2, . . . , q and thus the corresponding confidence intervals for the components of θ can be constructed by using the Wald-type method.

Hypothesis testing in MZIB model without covariates
In what follows, based on the likelihood methods we derive score test statistics S 1 and S 3 and likelihood ratio test statistics S 2 and S 4 . S 1 and S 2 are used to test the presence of zero-inflation in multivariate binomial model and S 3 and S 4 are used to test the equality of probabilities for all components in the multivariate zero-inflated binomial model. Under the marginal model, the corresponding two-sided hypotheses are (i) H 0 : γ = 0 versus H a : γ = 0; (ii) H 0 : π 1 = · · · = π q versus H a : π r = π s for at least one pair r = s.

Tests for zero-inflation in MZIB model
We should first test the presence of zero-inflation for the multivariate binomial data before the multivariate zero-inflated binomial model is used to fit such data. Based on the score test method, the test statistic for testing the hypotheses H 0 : γ = 0 versus H a : γ = 0 is given by The details of the derivation for S 1 are given in A.3 of supplemental file. Under the null hypothesis H 0 : γ = 0, the test statistic S 1 has an approximately χ 2 distribution with one degree of freedom. The corresponding p-value is given by where s 1 is the observed value for S 1 . When p 1 < α, we reject the null hypothesis H 0 at the α level of significance. Otherwise, we fail to reject H 0 . Now for the purpose of comparison, we also give the likelihood ratio test (LRT) for testing the zero-inflation in MZIB model. The LRT statistic has the following form: . . , y ·q /m ·q ) are the MLE's of π under the null hypothesis and (γ ,π ) are the unconstrained MLE of (γ , π), which can be obtained via the Fisher scoring algorithm or EM algorithm given in Sections 3.1.1 and 3.1.2.
Under H 0 , the LRT statistic S 2 has an approximately chi-squared distribution with one degree of freedom and the corresponding p-value can be computed as where s 2 is the observed value for S 2 .
Note that the advantage of the score test is that the parameters are estimated only under the null hypothesis not the alternative hypothesis and thus the score test statistic S 1 has a closed form, which results in easy computation and application. However, as one can see from the simulation results in Section 4, the score test exhibits some limitation for the application even if the dimension of binomial random vector is moderate. Therefore, the score test is recommended for testing the zero-inflation in MZIB model only for the lower dimensions.

Tests for the equality of probabilities in MZIB model
From Section 2, we know that there exists a correlation between any two components for the multivariate zero-inflated binomial model. Another question of interest for the multivariate binomial model is whether all or part components of multivariate binomial distribution share the same binomial probability. Therefore, we develop an approach to testing the equality of probabilities for all/part components in the multivariate zero-inflated binomial model. The null and alternative hypotheses to be tested are H 0 : π 1 = · · · = π q versus H a : π r = π s for at least one pair r = s.
Same as in Section 3.3.1, the score test statistic can be developed for testing the equality of probabilities for all components in the multivariate binomial variable. It has the following form: whereγ andπ are the MLEs of γ and π, respectively, under the null hypothesis H 0 : π 1 = · · · = π q .γ andπ can be obtained via the Fisher scoring algorithm from the following maximum likelihood equations under the null hypothesis: or via the EM algorithm which can be derived in the similar way given in Section 3.1.2. Furthermore, the score function U(γ , π) of (γ , π) has the form for r = 1, . . . , q. The expected information matrix I(γ , π ) with parameters γ and π = (π 1 , . . . , π q ) is where under the null hypothesis π 1 = π 2 = · · · = π q = π , and U(γ ,π ) and I(γ ,π) are the estimated values of the score function U(γ , π ) and the expected information matrix I(γ , π) at γ =γ and π =π =π · 1 q , respectively.
Under the null hypothesis, the score statistic S 3 has an approximately χ 2 distribution with q−1 degrees of freedom and the corresponding p-value can be computed as where s 3 is the observed value for S 3 . Similar to Section 3.3.1, the likelihood ratio method can also be used to test the equality of parameters in MZIB model. The LRT statistic is whereγ 0 ,π 0 are the estimators of parameters γ , π under the null hypothesis andγ ,π are the estimates of parameters γ , π under the alternative hypothesis. Although the MLEs for the parameters γ and π under both hypotheses do not have closed forms,γ 0 ,π 0 can be computed in the same way given in the derivation of S 3 andγ ,π can be calculated via the Fisher scoring algorithm or EM algorithms given in Sections 3.1.1 and 3.1.2 under the alternative hypothesis. Further, under the null hypothesis, the LRT statistic S 4 in (25) follows an approximately χ 2 distribution with q−1 degrees of freedom. The corresponding p-value is given by where s 4 is the observed value for S 4 . Moreover, if the null hypothesis H 0 in (21) is rejected, then the following hypotheses could be tested: for the k 1 th, k 2 th,. . . ,k q * th components (q * < q). The likelihood ratio test statistic for testing the hypotheses in (26) is given by where (γ * ,π * ) are the maximum likelihood estimates of (γ , π) under H * 0 : π k 1 = π k 2 = · · · = π k q * and can be computed via the Fisher scoring algorithm or EM algorithm. The parameters (γ ,π ) are the unconstrained maximum likelihood estimates of (γ , π ) and can be computed via the same algorithms. Moreover, under the null hypothesis H * 0 the test statistic S * 4 has an approximately χ 2 distribution with q * − 1 degrees of freedom. The corresponding p-value is where s * 4 is the observed value for S * 4 .

Simulation study
In this section, a limited simulation study is carried out to evaluate the performance of the proposed statistical methods in Section 3 for the multivariate ZIB distribution. We first examine the accuracy of point estimates and confidence interval estimates for different parameter settings in the proposed multivariate ZIB models with/without covariates via simulation studies. Next, we establish the validity of four proposed test statistics under a finite sample situation. In terms of the nominal levels and powers, the performance of score test statistics and LRT statistics for the presence of zero-inflation and the equality of probabilities for all components in the multivariate ZIB models are investigated. All simulation studies on test methods are based on the multivariate zero-inflated binomial distribution without regressors to keep the model simple and the study more focused.

Accuracy of point estimates and interval estimates for MZIB model without covariates
Note that the proposed q-dimensional multivariate ZIB distribution has q + 1 parameters. We expect that the proposed distribution can yield better data fitting without sacrificing statistical accuracy too much. To evaluate the accuracy of point estimates and confidence intervals for zero-inflated parameter ω and the probability parameters π 1 , . . . , π q in the multivariate ZIB model without covariates, we consider five cases for the dimension: q = 2, 3, 4, 5 and 6. The sample size is chosen as n = 50, 100. Parameter configurations can be found in Table 1. First, the procedure for generating the random number ω, m, π) is given as follows:   ω, m, π ) for i = 1, . . . , n.
Calculate the MLEs from the generated sample via EM algorithm (12)- (15) and the 95% bootstrap confidence intervals with repeating times G = 1000 for the parameters ω, π 1 , π 2 , . . . , π q . Next, the 1000 samples are independently generated and the corresponding 1000 EM MLEs and 1000 bootstrap confidence intervals of ω, π 1 , π 2 , . . . , π q are obtained. Further, in Table 2, MLE is the average of the 1000 estimates via the EM algorithm (12)-(15); width and CP of the confidence intervals are the average width and coverage proportion of 1000 bootstrap confidence intervals. As seen in Table 2, the bias are small and the MLE's are very close to the corresponding true values of parameters and the coverage probabilities are all around 0.95, although the coverage probabilities of zeroinflated parameter ω is a little less than the nominal level for q = 5 and 6 with n = 50. We also conducted the simulation study for moderate number of dimension (e.g. q = 10)(the results do not be reported here). The obtained estimates of parameters, width and CP of confidence intervals are consistently close to the true values of parameters and nominal confidence coefficient, which demonstrates the proposed EM algorithm has very good performance even for a moderate dimensional number of multivariate binomial data.
The sample size is selected as n = 100, 300 and 500. Then via EM algorithm (16)- (19) and Wald-type methods the MLEs for the parameters α and β 1 , . . . , β q and its MSEs, the 95% confidence intervals and its widths can be calculated from the generated samples. Further, similar to the case in Table 2, with repeating times G = 1000, the average values of biases and MSEs of MLEs, widths and coverage probabilities of confidence intervals for the parameters α and β 1 , . . . , β q are given in Table 3 with q = 2, 3. The simulation results with q = 4, 5 can be seen in Table A1 of supplemental file.
From the results given in Table 3 and in Table A1 of the supplemental file, one can see that the proposed EM algorithm for the estimation of regression parameters has a good performance. The biases are small and the MLEs are very close to the corresponding true values of parameters and the coverage probabilities are all around 0.95, although the algorithm shows a little liberty for the estimated parameter α with q = 2 and n = 100.

Tests for zero-inflation in MZIB model
In this subsection, the performance of both the score test and the likelihood ratio test for testing zero inflation in the multivariate ZIB distribution is conducted by a simulation study. All the simulations in this subsection are performed with G = 10,000 replications and covariates are not considered for the purpose of simplicity. The dimensions q = 2, 3, 4 are considered for multivariate zero-inflated binomial distribution with sample sizes of n = 50, 100, 200, 300, 400, 500. For assessing the powers of proposed test statistics, the zero-inflation parameter is designed to be ω = 0.0, 0.01, 0.05, 0.1, 0.2. Since we are not interested in the inference of binomial probabilities π and binomial denominators m, we first generate two vectors π and m randomly. For a given set of parameters (n, ω, m, π ), the samples with multivariate ZIB distribution ZIB q (ω, m, π ) could be generated in the same procedure as that in Section 4.1 and the estimates of parameters (ω, π ) for multivariate zero-inflated binomial distribution under LRT alternative hypothesis are computed by EM algorithm.
The simulation results are summarized in Table 4 and in Table A2 of the supplemental file. The results in Table 4 are computed by using the binomial probabilities π and m, which are randomly generated from 0.05 to 0.20, and 5 to 15, respectively. The results in Table A2 of the supplemental file are computed by using the binomial probabilities π and m, which are randomly generated from 0.01 to 0.15, and 5 to 20, respectively. These results show the comparison between score test statistic S 1 and LRT statistic S 2 side by side for both controlling nominal level and powers. Both test statistics hold the nominal level at α = 0.05 well. The empirical powers of both tests for detecting zero-inflation increase as the dimensions in multivariate ZIB distribution and the zero-inflated parameter increase. We also perform the simulation study for multivariate zero-inflated binomial distribution with q = 5+ dimensions. However, not like the EM algorithm for the estimates of parameters, even for a very small zero-inflation parameter ω (ω = 0.01) both tests show great  powers for testing zero-inflation, but both tests do not hold nominal level well for 5+ dimensional multivariate zero-inflated proportional data. Sometimes both the score test and the likelihood ratio test require a fair size of sample for calculating the inverse of the expected information matrix. Overall, there is not much difference in power between the score test and likelihood ratio test.

Test for equality of probabilities in MZIB model
In this subsection, the performance of both score test and likelihood ratio test for testing equality of multivariate zero-inflated binomial distribution parameters π 1 = · · · = π q is conducted by a simulation study. All the simulations in this subsection are performed with G = 10,000 replications and covariates are not considered for the sake of simplicity. Only one zero-inflation parameter  parameters (n, ω, m, π), the multivariate zero-inflated binomial data could be generated by a similar procedure given in Section 4.1.
The simulation results are summarized in Figure 1 and in Table A3 of the supplemental file. Figure 1 displays the comparisons of performance for empirical levels and empirical powers at the nominal level α = 0.05 between the score test (solid line) and the likelihood ratio test (dotted line) for testing the equality of parameters π based on two-dimensional and three-dimensional zero-inflated binomial models. Similar conclusions can also be obtained from Table A3 in the supplemental file.
From the simulation results, both the score test and the likelihood ratio test maintain the nominal level α = 0.05 well. The powers of both tests for detecting equality of multivariate zero-inflated binomial parameter π are very close in both two-dimensional and three-dimensional simulations with no influence of sample size. Also, both tests show great detecting power even with a very small difference in π . Overall, there is no difference between the score test and likelihood ratio test in testing the equality of π in the multivariate zero-inflated binomial distribution.

A real example
In this section, we illustrate the application of the proposed multivariate zero-inflated binomial model to whitefly data. van Iersel et al. [11] studied the purpose of controlling silver leaf whiteflies by using a subirrigation system. The study was designed to determine the effectiveness of controlling silver leaf whiteflies on poinsettia with imidacloprid, which was delivered by a subirrigation system. Imidacloprid is a resilient and powerful chemical (e.g. Natwick et al. [20]), that has low toxicity to mammals, and is used to control silver leaf whiteflies on poinsettia. At the first week of this experiment, researchers placed m adult whiteflies (here, m is considered as the binomial denominators with range 6-15, mean = 9.5 and SD = 1.7) in clip-on leaf cages attached to one leaf per plant and then recorded the number of surviving whiteflies 2 days later, which is considered as the response variable. To measure reproductive inhibition, they removed the fly cages after obtaining the survival count but marked the position of each cage. In the coming week, they placed m adult whiteflies in clip-on leaf cages attached to one leaf on the same plant and recorded the number of surviving whiteflies. This study lasted for consecutive 12 weeks on 54 plants. Therefore, the data can be considered to consist of the 12-dimensional binomial variables, that is, the observed data can be expressed as    Table A4 of the supplemental file. It can be seen that the percentage of zeros in this data set is greater than 50%. Also, Figure A1 in the supplemental file shows the frequency y ij of alive whiteflies in 3D image for whitefly data. Now the proposed multivariate zero-inflated binomial model can be used to analyse the whitefly data. At first, the bivariate proportional dataset can be generated from week r and week s (r, s = 1, 2, . . . , 12 and r = s), denoted by D rs = {(y ir , y is ; m ir , m is ), i = 1, 2, . . . , n}. Next the generated data set D rs can be analysed using bivariate zero-inflated binomial model and the MLEsω,π r andπ s of parameters ω, π r and π s can be computed via EM algorithm. Then, the estimated correlation coefficientρ (i) rs for each observation (y ir , y i,s ; m ir , m is ) in bivariate binomial data can be computed from (3) with ω =ω, π r =π r and π s =π s : rs =ω m irπr m isπs (ωm irπr + (1 −π r ))(ωm isπs + (1 −π s )) , r = s, i = 1, 2, . . . , n.
At last, the estimated correlation coefficient for bivariate zero-inflated binomial data set D rs isρ which is the mean of estimated correlation coefficients for all observations in bivariate zero-inflated binomial data. The results for the correlation coefficients among the components of 12-dimensional binomial variables are presented in Table 5. From the results in Table 5, there certainly exist the positive correlations between any two components of the 12 binomial variables, which are induced from the existence of zero-inflation for the bivariate binomial variables. Next, since the proposed test statistics work well only for the dimension q ≤ 4, for the purpose of illustrating the proposed model, the vector of 12 variables is partitioned into four vectors of three variables and thus obtain four small multivariate proportional data sets denoted weeks 1-3, weeks 4-6, weeks 7-9 and weeks 10-12 by D 1 , D 2 , D 3 and D 4 , respectively. Due to the unbalanced design, the observations on the plants with losing are omitted and the data sets D 1 D 2 , D 3 and D 4 can be expressed as 10 , y i, 11 , y i, 12 ; m i,10 , m i, 11 , m i,12 ), i = 1, 2, . . . , 51}.
The following three-dimensional zero-inflated binomial model is used to analyse the data: Both the score test and the likelihood ratio test are applied to test the existence of zero inflation and the equality of the binomial probabilities for the three components in each dataset. If there exists the zero-inflation, the MLEs of zero-inflated parameters and binomial probabilities are also computed using the EM algorithm for four data sets D 1 , D 2 , D 3 and D 4 . The results of data analysis for these data sets are given in Table 6.
From Table 6, one can see that the values of score test statistics and LRT statistics for testing the presence of zero-inflation are very large and thus the p-values for these test are very small (near to zero), which show that there exist the zero-inflation in four threedimensional proportional data sets and thus a positive correlation among the components of these multivariate proportional variables in each data set. Now, by using the bootstrap method given in Section 3.1.2, the 95% confidence intervals of zero-inflation parameter ω for four data sets are computed as (0.0196, 0.1765), (0.2264, 0.4906), (0.1666, 0.4259) and (0.0588, 0.2545), respectively. These confidence intervals also confirm the existence of zero-inflation and positive correlation among the components in four induced data sets. Further, there exists enough evidence to show that binomial probabilities of three components are not equal for all data sets D 1 , D 2 , D 3 and D 4 at the level of significance α = 0.05 since all p-values of 8 statistics for testing the equality of binomial probabilities are less than 0.05. However, the p-values of score tests and LRTs test in weeks 7-9 are 0.03372 and 0.03377, respectively, and thus the proportional probabilities in the components of weeks 7-9 could be equal at the significant level α = 0.025. Also, the values of score test and LRT for testing the equality of probabilities are almost same in each of the four data sets. However, the score test is more sensitive to the zero-inflation than the likelihood ratio test for testing the existence of zero-inflation since the values of score test are much larger than that of the likelihood ratio test for testing zero-inflation. For the purpose of illustration, the original data are again partitioned into three data sets D 5 for weeks 1-4, D 6 for weeks 5-8, D 7 for weeks 9-12. Hence, each data set consists of four-dimensional binomial variables with four-dimensional binomial denominators. The four-dimensional ZIB model with the corresponding test statistics and EM algorithm is applied to conduct the same analysis for these three four-dimensional binomial data sets as that for four three-dimensional binomial data sets. The results are given in Table 7. The results in Table 7 are similar to that in Table 6. Since all values of statistics for testing the existence of zero-inflation are very large and thus the corresponding p-values are near to zero, the data sets D 5 , D 6 and D 7 present the zero-inflation and there exist positive correlations among the components in each of three induced four-dimensional binomial data sets. Also, the p-values of score test S 2 and LRT S 4 for data sets D 5 and D 7 are less than 0.01, which indicates that binomial probabilities of four components could be not equal even at the significant level α = 0.01 in D 5 and D 7 . However, the binomial probabilities for weeks 5-8 are probably equal since the p-values of score test S 2 and LRT S 4 are 0.1157 and 0.1186, respectively. Further the 95% bootstrap confidence intervals for zero-inflation parameter ω in three induced data sets D 5 , D 6 and D 7 are (0.0200, 0.2000), (0.1667, 0.3889) and (0.0392, 0.2157), respectively. This also shows that there exist the zero-inflation and positive correlation among the components in three induced four-dimensional proportional data sets D 5 , D 6 and D 7 .
We also computed the values of score tests and likelihood tests for all possible data sets induced from the original data, such as two-dimensional data sets 1-2 weeks, 2-3 weeks,. . . , 11-12 weeks, three-dimensional data sets 1-3 weeks, 2-4 weeks,. . . , 10-12 weeks,. . . , 11dimensional data sets 1-11 weeks and 2-12 weeks, and 12-dimensional data set 1-12 weeks (whole data). Although all tests show the strong existence of zero-inflation for these data sets, the test results may not be applicable because the proposed test statistics are not reliable for the multivariate proportional data with 5+ dimensions. However, we may use the bootstrap confidence interval to test the zero-inflation in multivariate binomial data with the moderate dimension number. Now we partition the original data into two sixdimensional data sets D 8 and D 9 and then analyse them by using the proposed MZIB model. The results are presented in Table 8.
From the results in Table 8, the values of score test statistic for testing the presence of zero-inflation are very large but unreliable. Note that since the EM algorithm is based on the stochastic representation (1) of Definition 2.1 and thus the value of zero-inflation parameter is implicitly assumed to be in the unit interval [0, 1), the smallest values of lower limits in the confidence intervals are at least zero. Further based on Definition 2.1, the hypotheses for testing the presence of zero-inflation should be upper-tailed hypothesis H 0 : ω = 0 versus H 1 : ω > 0. From EM algorithm, the 95% bootstrap confidence intervals for zero-inflation parameter ω in two induced six-dimensional data sets D 8 and D 9 are (0.0000, 0.1200) and (0.0000, 0.0980) and the 90% bootstrap confidence intervals are (0.0200, 0.1200) and (0.0000, 0.0784), respectively. By comparing zero with the lower limit of confidence intervals, the null hypothesis is rejected in data set D 8 at the level α = 0.05 but is not rejected at the level α = 0.025. However, the null hypothesis is not rejected at both levels α = 0.05 and 0.025 in data set D 9 . This means that there is no evidence for the presence of zero-inflation in data set D 9 at α = 0.05. Furthermore, by using the bootstrap method, the p-values are approximately 0.045 in D 8 and 0.126 in D 9 . Based on these p-values, we can also get the same conclusions as above.

Concluding remarks
A new model for multivariate proportional data, called 'multivariate zero-inflated binomial model' has been proposed. The model introduced a common zero-inflated parameter for all components of multivariate binomial variable, which automatically address the correlation among the components. This model can also be regarded as an extension of the widely discussed univariate zero-inflated binomial model in proportional data. The Fisher Scoring algorithm and EM algorithm are derived for the computation of the estimates of parameters in the proposed multivariate model with/without covariates. Score tests and likelihood ratio tests are also proposed to detect the existence of zero inflation and the equality of the binomial probabilities in the multivariate binomial model. The simulation results demonstrate that the proposed EM algorithm has excellent performance for the computation of MLEs of parameters even for the moderately large dimension number in the MZIB model with/without covariates, and four test statistics maintain the nominal level well for the small dimensional numbers. However, the proposed test statistics do not work well if the dimension numbers of multivariate binomial variables are greater than 5, which is the limitation of these tests. The whitefly data is used to demonstrate the proposed model and inferential methods for the multivariate binomial data. The results show the existence of correlation and zero inflation and the equality of the binomial probabilities in the subsets of whitefly data. However, it is very unlikely that all components of the binomial random vector are zero-inflated in the same way and/or by the same amount as measured by zero-inflated parameter, even for moderate number of dimension. Therefore, the proposed model has the limitation for the application. The solution to this question is to introduce more zeroinflated parameters for the components of multivariate zero-inflated binomial variable. Such model has been proposed for multivariate count data with excessive zeros. This model can obviously be extended to analyse the multivariate proportional data with excessive zeros and the corresponding research on such model can be done in the future. Moreover, the proposed score test method had a shortcoming with a large number of dimensions for multivariate proportional data. The main reason is that the denominator of score test statistics is very small, which results in the large variability of score test statistics. In fact, due to the same reason, the score test does not work well even for univariate binomial variable if the binomial denominator is much large. We are considering a modification of score test. For example, the modified score test method may work well for the moderate dimension of multivariate binomial variables. Furthermore, in addition to zero-inflation, there often exists the over-dispersion or under-dispersion in the count/proportional data. Such dispersion should be investigated and the multivariate zero-inflated beta-binomial (MZIBB) model could be used to fit such data with over-dispersion or under-dispersion. We will be doing such research in the future.