An R package AZIAD for analysing zero-inflated and zero-altered data

We introduce a newly developed R package AZIAD for analysing zero-inflated or zero-altered data. Compared with existing R packages, AZIAD covers a much larger class of zero-inflated and hurdle models, including both discrete and continuous cases. It provides more accurate parameter estimates, along with the corresponding Fisher information matrix and confidence intervals. It achieves significantly larger power for model identification and selection. To facilitate the potential users, in this paper we provide detailed formulae and theoretical justifications for AZIAD, as well as new theoretical results on zero-inflated and zero-altered models. We use simulation studies to show the advantages of AZIAD functions over existing R packages and provide real data examples and executable R code to illustrate how to use our package for sparse data analysis and model selection.


Introduction
Sparse or zero-inflated data arise frequently from a rich variety of scientific disciplines including microbiome [1,2], gene expression [3], health care [4], insurance claim [5], security [6], and more.Modeling sparse data is very challenging due to the high proportion of zero values and the skewness of the distribution.Zero-inflated Poisson (ZIP), zeroinflated negative binomial (ZINB), Poisson hurdle (PH), and negative binomial hurdle (NBH) models have been widely used to model sparse data [2,7,8].
On one hand, more and more zero-inflated models have been proposed under different circumstances.On the other hand, it becomes more and more difficult for researchers and practitioners to choose the most appropriate model for their sparse data.Some partial comparisons among existing models have been done for certain sparse data.For examples, [9] recommended ZINB and NBH models for microbiome data after comparing the performance of Poisson, ZIP, PH, NB (negative binomial), ZINB, and NBH models; while [7] indicated that two new models, zero-inflated beta negative binomial (ZIBNB) and beta negative binomial hurdle (BNBH) models, are more appropriate for microbiome data examples.
The existing literature and packages on analysing zero-inflated data, including the very recent work [7,23], are still limited for three reasons.First, only a small number of zeroinflated models were under consideration at the same time.Discrete and continuous baseline distributions were typically considered separately.Secondly, the accuracy of the parameter estimates and the power of the tests still have room for improvement.Thirdly, the confidence intervals of the parameter values were seldom provided along with their estimates, which is critical for model diagnostics, such as, testing whether the inflation or deflation of zeros exists in the data.
In this paper, we introduce our newly developed R package named AZIAD for Analyzing Zero-Inflated and Zero-Altered Data, available from the Comprehensive R Archive Network (CRAN, https://cran.r-project.org/package= AZIAD).Compared with the existing R packages for similar purposes, our AZIAD achieves the following significant improvements: (1) We not only cover discrete baseline distributions including Poisson, geometric, NB, BB, and BNB, but also cover commonly used continuous baseline distributions including normal (or Gaussian), log-normal, half-normal, and exponential distributions along with their zero-inflated and zero-altered (also known as hurdle) models; (2) By more precise specifications on solution forms under different situations and lower/upper bounds needed for numerical optimizations, our R functions provide more accurate maximum likelihood estimates (MLE) even under extreme circumstances, which further gains more power when identifying the most appropriate zero-inflated or hurdle models for a given data set; (3) Following [24], we provide not only MLEs for parameters, but also the Fisher information matrix and the corresponding confidence intervals for estimated parameters, which will facilitate the potential users from biological sciences, insurance, health studies, security, ecology, etc, to identify the most appropriate probabilistic model for their dataset and make robust statistical inference based on it.
The rest of this paper is organized as follows.In Section 2, we not only review relevant theoretical results from [24], but also provide new formulae for Fisher information matrices of the zero-inflated (ZI) and zero-altered (ZA or hurdle) models with geometric, BB, BNB, normal, log-normal, half-normal and exponential distributions, as well as ZINB.In Section 3, we summarize and use numerical examples to illustrate the improvements by using our package over existing R packages, as well as identifying the most appropriate models for real data.We interpret and discuss indistinguishable pairs of distributions in Section 4.

MLE and Fisher information for zero-altered or hurdle models
Zero-altered models or hurdle models have been widely used for modelling data with an excess or deficit of zeros (see, for example, [2], for a good review).A general hurdle model consists of a baseline distribution and a component generating the zeros.The baseline distribution could be fairly general with the distribution function f θ (y) and its zero-truncated version f tr (y | θ) = [1 − p 0 (θ)] −1 f θ (y), y = 0, where θ is the model parameter(s), and p 0 (θ) = P θ (Y = 0) is the probability that Y = 0 under the baseline distribution.Following [24], the distribution function of the corresponding hurdle model can written as: where φ ∈ [0, 1] is the weight parameter of zeros.Actually, If the baseline distribution is discrete with a probability mass function (pmf) f θ (y), such as Poisson, negative binomial (NB), geometric (Ge), beta binomial (BB) and beta negative binomial (BNB) distributions, then both f tr (y | θ ) and f ZA (y | φ, θ ) are pmfs as well.The corresponding zero-altered or hurdle models may be called as zero-altered Poisson (ZAP) or Poisson hurdle (PH), zero-altered negative binomial (ZANB) or negative binomial hurdle (NBH), zero-altered geometric (ZAGe) or geometric hurdle (GeH), zeroaltered beta binomial (ZABB) or beta binomial hurdle (BBH), zero-altered beta negative binomial (ZABNB) or beta negative binomial hurdle (BNBH) models, respectively.
If the baseline distribution is continuous with a probability density function (pdf) f θ (y), such as Gaussian (or normal), log-normal, half-normal and exponential distributions, then p 0 (θ) = 0 and f tr (y | θ ) = f θ (y).In this case, f ZA (y | φ, θ ) in (1) represents a mixture distribution consisting of a discrete component at 0 with probability φ and a continuous component with density function (1 − φ)f θ (y).We will revisit these continuous cases in Section 2.3.In this section, we focus on discrete cases.
The parameters of the hurdle model (1) include both φ and θ.In this paper, we adopt the maximum likelihood estimate (MLE) for estimating the parameters (see, for example, Section 1.3.1 in [25], for justifications on adopting MLE).Let Y 1 , . . ., Y n be a random sample from model (1).Then the likelihood function of (φ, θ ) is where m = #{i : Y i = 0} is the number of nonzero observations.According to Theorem 1 in [24], the maximum likelihood estimate (MLE) of (φ, θ ) that maximizes (2) is That is, θ is simply the MLE for the truncated model with distribution function According to Theorem 3 in [24], under some regularity conditions, the Fisher information matrix of the zero-altered distribution is where and Y follows the baseline distribution f θ (y).Note that the sample size n does not show up in (4) since the Fisher information here is of the distribution, not of the sample.
Major advantages for adopting MLE (3) and calculating the Fisher information matrix (4) include: (i) φ and θ are consistent estimators of φ and θ, respectively (Theorem 2 in [24]) ZAθ ), which provides the formulae for building up approximate confidence intervals and relevant hypothesis tests for φ and θ (see, for example, Sections 1.3.3 and 1.3.4 in [25]).For example, an approximate where is the upper α 2 th quantile of the standard normal distribution, and α ∈ (0, 1) is the desired significance level, such as 0.05.If p 0 (θ ) does not belong to the confidence interval (5), then there is a significant evidence for zero-inflation or deflation.
Therefore, calculating MLE and Fisher information accurately and efficiently is the basis of sound and thorough statistical inference.
Explicit formulae of the gradients (∂ log f θ (y)/∂θ and ∂ log p 0 (θ )/∂θ , need for finding MLEs) and the Fisher information matrices of zero-altered Poisson (ZAP) or Poisson hurdle (PH) model, zero-altered negative binomial (ZANB) or negative binomial hurdle (NBH) model, have been provided in Examples 2 and 3 of [24], respectively.In this section, we will provide explicit formulae for zero-altered geometric, beta binomial, and beta negative binomial models.
According to Theorem 3 in [24], the Fisher information matrix of the ZABNB or BNBH distribution is where As for 1 and relevant calculations, please see the arguments right after Example 2.2.

MLE and Fisher information for zero-inflated models
When data is sparse, a zero-inflated (ZI) model is more commonly used in practice, which assumes an excess of zeros (see, for example, [2] for a good review).Similar as the zero-altered (ZA) models in Section 2.1, there is a baseline distribution with distribution function f θ (y) and parameter(s) θ.We also denote p 0 (θ Different from ZA models, a zero-weighting parameter φ ∈ [0, 1] adds additional probability of zeros to the ZI model.Following [24], we write the distribution function of the corresponding ZI model as When the baseline distribution is either continuous with a pdf f θ (y) or discrete but with p 0 (θ ) = 0, the corresponding zero-inflated model ( 6) is essentially the same as the corresponding zero-altered model (1).Examples include Gaussian or normal, log-normal, half-normal, and exponential distributions.We will revisit them in Section 2.3.
Given a random sample Y 1 , . . ., Y n from the zero-inflated model f ZI (y|φ, θ ), the likelihood function of (φ, θ ) can be written as where m = #{i : According to Theorem 4 in [24], the maximum likelihood estimate ( φ, θ ) maximizing ( 7) can be obtained as follows: Solving for the MLE ( φ, θ ) may involve two maximization problems, The following first order derivatives may be needed by, for example, quasi-Newton algorithms: Thus for different models, we only need to prepare specific formulae of ∂ log f θ (y)/∂θ and ∂ log p 0 (θ)/∂θ for numerical calculations.
According to Theorem 5 in [24], under some regularity conditions, the Fisher information matrix of a general zero-inflated distribution is where and Y follows the baseline distribution f θ (y).We denote which is essentially the Fisher information matrix of the baseline distribution.
The Fisher information matrix ( 8) is more complicated than (4) for zero-altered models.To obtain the asymptotic distributions of MLEs, it is more convenient to obtain the following theorem by applying the formulae (see, for example, Formulae 4.33 and 14.13(b) in [27]) of block matrices and the Sherman-Morrison formula (see, for example, Section 2.1.4 in [28]).
The proof of Theorem 2.1 is relegated to the Supplementary Materials (Section S.1).
With Theorem 2.1, we have The relevant confidence intervals and hypothesis tests can be performed similarly as for zero-altered models.
As mentioned in Section 2.1, we need to calculating MLE and Fisher information accurately and efficiently.Explicit formulae of the Fisher information matrix of zero-inflated Poisson (ZIP) has been provided in Example 5 of [24].In this section, we provide explicit formulae of the gradients and Fisher information matrices for zero-inflated geometric (ZIGe) and zero-inflated negative binomial (ZINB) models.We relegate the corresponding formulae for zero-inflated beta binomial (ZIBB) and zero-inflated beta negative binomial (ZIBNB) models to the Supplementary Materials (Section S.2).

Zero-altered and zero-inflated models with continuous baseline distributions
As mentioned in Section 2.2, if the baseline distribution f θ (y) satisfies p 0 (θ ) = P θ (Y = 0) = 0 given Y ∼ f θ (y), then the zero-altered model ( 1) and the zero-inflated model ( 6) are the same.We call such kind of models the zero-altered-zero-inflated (ZAZI) models.
Examples include all continuous baseline distributions, as well as discrete or mixture baseline distributions satisfying p 0 (θ) = 0.For ZAZI models with a baseline distribution f θ (y), its distribution function can be written as where m = #{i : Y i = 0}.Then the MLEs maximizing (10) are φ = 1 − m/n and θ = argmax which are similar as (3) for zero-altered models.As a special case of Theorem 3 in [24], we have the following formulae for the Fisher information matrix of ZAZI distributions.
Theorem 2.2: Under regularity conditions, the Fisher information matrix of the ZAZI distribution (9) is where and Y follows the baseline distribution f θ (y).
In this section, we provide explicit formulae of gradients and Fisher information matrices for commonly used ZAZI models with continuous baseline distributions including Gaussian or normal, log-normal, half-normal, and exponential distributions.

Example 2.6 (Zero-altered-zero-inflated Gaussian model (ZAZIG)):
This model has been used by, for example, [30] for analysing longitudinal microbiome data, known as a zero-inflated Gaussian (ZIG) model.The pdf of the baseline distribution with parameters According to Theorem 2.2, the Fisher information matrix of the ZAZIG distribution is

Example 2.7 (Zero-altered-zero-inflated log-normal model (ZAZILN)):
This model, also known as zero-inflated log-normal (ZILN) model, has been used by, for example, [31] to study the effects of a prospective DUR intervention programme for randomized clinical trials.The pdf of the baseline distribution with parameters θ According to Theorem 2.2, the Fisher information matrix of the ZAZILN distribution is which is exactly the same as the one in Example 2.6.

Example 2.8 (Zero-altered-zero-inflated half-normal model (ZAZIHN)):
Also known as a zero-inflated half-normal model (ZIHN), this model has been used by, for example, [32] as a candidate distribution for modelling the animal movement distances in the wild.In our notations, the pdf of its baseline distribution with parameter θ = σ ∈ (0, ∞), known as a standard half-normal distribution, is given by According to Theorem 2.2, the Fisher information matrix of the ZAZIHN distribution is It should be noted that other forms of ZIHN models have also been used in the literature.For example, [33] utilized a more general ZIHN distribution as a prior for a hierarchical Bayesian model.The continuous component of their ZIHN is a general normal distribution N(μ, σ 2 ) truncated below by zero.In that case, θ = (μ, σ ).
Example 2.9 (Zero-altered-zero-inflated exponential model (ZAZIE)): Also known as zero-inflated exponential model (ZIE), this model has been used by, for example, [34] for modelling casualty rates in ship collision.The pdf of its baseline distribution with parameter θ = λ ∈ (0, ∞) can be written as f θ (y) = λ e −λy , y > 0. Then According to Theorem 2.2, the Fisher information matrix of the ZAZIE distribution is Note that the MLE of λ has an explicit form λ =

Model selection based on KS and likelihood ratio tests
Given so many zero-altered or zero-inflated models listed in Sections 2.1-2.3, a critical question is which model is the most appropriate one for a given dataset.
The Kolmogorov-Smirnov (KS) test has been commonly used for testing whether a random sample {Y 1 , . . ., Y n } comes from a continuous cumulative distribution function F θ (y) with specified model parameter(s) θ [35].It is based on the KS statistic ) is known as the empirical distribution function.Dimitrova et al. [36] extended the KS test for general distributions F θ (y) with known parameter(s) θ, including discrete and mixed ones.For typical applications, θ is unknown and an estimate θ is plugged in when calculating D n , which tends to overestimate the corresponding p-value [7,37,38].To overcome the biasedness of estimated p-value due to the plugged-in estimated parameters, [7] proposed a bootstrapped Monte Carlo estimate #{b|D for the p-value in their Algorithm 1, where n } instead of Y (b) .In this paper, we propose the following nested bootstrap estimate for KS test p-value based on the above argument.To make a distinction, we call [7]'s Algorithm 1 as Algorithm 1A and our Algorithm 1 as Algorithm 1B.The corresponding R functions are named kstest.A and kstest.B, respectively.According to our simulation studies in Section 3.4, we recommend kstest.B for small sample sizes such as n = 30, 50.For larger sample sizes, since the difference in terms of test power is negligible, we recommend kstest.A for less computational cost.
In practice, it is not uncommon that two or more distributions pass the KS test for the same dataset, especially when the sample size is moderate or small (see, for example, [7]).In this situation, likelihood ratio tests may be used for pairwise comparisons.More specifically, for testing H 0 : Y 1 , . . ., Y n iid ∼ f (y; θ) with unknown parameter(s) θ against H 1 : Y 1 , . . ., Y n iid ∼ g(y; δ) with unknown parameter(s) δ, the likelihood ratio test statistic in log scale (see, for example, [7]) can be defined as where θ and δ are the corresponding maximum likelihood estimates.Smaller values are in favour of the alternative distribution g(y; δ).
To overcome the possible biasedness due to plugged-in estimated parameters, we adopt the bootstrapped estimate of p-value described by Algorithm 2 of [7].More specifically, (i) bootstrap samples Y (b) = {Y .If the p-value is less than 0.05, we claim that H 1 is significantly better; otherwise, we stick to H 0 .

Zero-altered model versus zero-inflated model
Clearly a zero-altered model ( 1) and its corresponding zero-inflated model ( 6) are connected by sharing the same baseline distribution f θ (y).Under some conditions, they are actually equivalent due to the following theorem.The proof of Theorem 2.3 is relegated to the Supplementary Materials (Section S.1).Theorem 2.3 implies that if φ ZA ≥ p 0 (θ), which indicates that the data is zero-inflated, then a ZA model is equivalent to the corresponding ZI model.Given a random sample from a zero-inflated model f ZI (y | φ ZI , θ ), a KS test will conclude that some zero-altered model f ZA (y | φ ZA , θ) seems fine with the data as well.The other direction is slightly different though.That is, if the true model is a zero-altered one, only when φ ZA is at least as large as p 0 (θ ), a KS test will conclude that some zero-inflated model seems to be true as well.In Section 3.5, we will provide a numerical example such that both ZIBNB and BNBH fit the data well.

Software implementation and numerical analysis
In this section, the R implementation of the proposed package AZIZD is introduced and its numerical analysis is performed.Real data examples are used to illustrate how our package can be used in practice.
Compared with existing R packages each covering only a limited number of baseline distributions, our package covers many more discrete and continuous distributions including Poisson, geometric, negative binomial, beta binomial, beta negative binomial, normal (or Gaussian), log-normal, half-normal, and exponential distributions along with their zeroaltered and zero-inflated models, which facilitates the potential users to choose the most appropriate model from a large class of candidates for their dataset.For all the models mentioned above, we provide the corresponding Fisher information matrix and confidence intervals for estimated parameters, which allows users to run hypothesis tests and make further inference.

Improvements over existing packages on finding MLEs
From Section 2.4, we can see that an accurate MLE is a critical component for both KS test and likelihood ratio test.In this section, we first summarize the improvements of the proposed AZIAD over existing R packages on finding MLEs for zero-altered and zeroinflated models.
(1) Compared with other packages, AZIAD provides MLEs well even for extreme cases including φ = 0 or φ = 1 (see the comparison analysis in Section 3.3).( 2) Some packages encountered error message L-BFGS-B needs finite values of "fn" for some zero-inflated data when using, for example, the function dis.kstest in package iZID.This issue is solved in AZIAD by specifying appropriate lower bound and upper bound for optimizations Section 3.3.(3) Compared with package iZID which also covered MLEs for ZIBNB, BNBH, ZIBB, and BBH models, our results are more reliable (see Example 3.2 for a comparison) by applying Theorem 4 in [24].More specifically, we separate the two situations (1) m/n ≤ 1 − p 0 (θ * ) and ( 2) m/n > 1 − p 0 (θ * ) and calculate the MLEs in each case.(4) Some parameters are originally defined as positive integers, such as n in Examples 2.2 and S.1 and r in Examples 2.3 and 2.5, while in practice they could be extended to positive real numbers.In AZIAD, we seek for real-valued MLEs by default.In the mean time, we keep the option of integer-valued n or r in response to users' call.
In the rest part of this section, we use examples to illustrate how to use our package to find the MLEs and the improved accuracy by using our package.

Example 3.1:
In order to find the MLE for the parameter(s) of zero-inflated or hurdle models, the main function built in AZIAD is zih.mle(x,r, p, alpha1, alpha2, n, lambda, mean, sigma, type = c("zi","h"), dist, lowerbound = 0.01, upperbound = 10000) where x is a sequence of numbers, which could be integers for discrete cases or real numbers for continuous cases; the arguments r, p, alpha1, alpha2, n, lambda, mean, and sigma are initial values of the corresponding parameters; dist could be chosen as poisson.zihmle,geometric.zihmle,nb.zihmle, nb1.zihmle, bb.zihmle, bb1.zihmle, bnb.zihmle, bnb1.zihmle,normal.zihmle,halfnorm.zihmle,lognorm.zimle,and exp.zihmle, which correspond to zero-inflated or zero-altered Poisson, geometric, negative binomial, negative binomial with integer-valued r, beta binomial, beta binomial with integer-valued n, beta negative binomial, beta negative binomial with integer-valued r, normal, log-normal, half-normal, and exponential distributions, respectively; option type indicates the type of distribution is zero-inflated (zi) or hurdle/zero-altered (h); lowerbound and upperbound specify the searching range of parameters when maximizing the likelihood function.For instance, in order to calculate the MLE of zero-inflated geometric distribution (see Example 2.4), one may use the following R code:

Fisher information, confidence interval, and test on zero-inflation
As mentioned in Sections 2.1 -2.3, the Fisher information matrix F ZA , F ZI or F ZAZI can be calculated by AZIAD for zero-altered/hurdle, zero-inflated, or ZAZI models, respectively.Its inverse matrix is approximately the variance-covariance matrix of parameter estimates ( φ, θ ).For example, for zero-altered or hurdle models (see Section 2.1), ZAθ ) for large n.Approximate confidence intervals can be constructed for φ and θ (for example, expression (5) for φ).
In our package AZIAD, we use the function FI.ZI(x, dist= "poisson", r = NULL, p = NULL, alpha1 = NULL, alpha2 = NULL, n = NULL,lambda=NULL, mean=NULL, sigma=NULL, lowerbound = 0.01, upperbound = 10000) to calculate (the inverse of) the Fisher information matrix at the MLE and 95% approximate confidence intervals for all parameters, where x is the data in its vector form; dist can be poisson, geometric, , and bnbh for Poisson, geometric, negative binomial, beta binomial, beta negative binomial, normal/Gaussian, log-normal, half-normal, exponential, their zero-inflated versions and hurdle versions, respectively; r, p, alpha1, alpha2, n, lambda, mean, sigma provide initial values of distribution parameters; lowerbound and upperbound are predetermined ranges for MLEs, which could be extended if attained by any of the estimated parameter values.It turns out that the 95% approximate confidence interval (0.4281, 0.4899) of φ based on PH model (dist = "ph") covers the baseline zero probability p 0 (λ) = 0.4493, which indicates that neither zero-inflation nor zero-deflation is significant at the 5% level.Actually, the 95% approximate confidence interval (0.7619, 0.8741) of λ based on the regular Poisson model (dist = "poisson") covers the true value 0.8 roughly at the centre, which is better than (0.7937, 0.9920) based on the PH model.

Comparison study in terms of type I error
In this section and Section 3.4, we use simulation studies to compare the performance of our algorithms for model identification with existing R functions for the same purposes.In this section we focus on type I errors.That is, given the true distribution, say ZINB, what is the chance that the test erroneously concludes that the distribution is not ZINB due to a p-value less than 0.05.Such an error is known as a type I error [39].Ideally, such a chance is no more than 0.05, known as the size of the test.The R functions under comparison in this section include the basic function ks.test, function disc_ks_test in package KSgeneral [36], function dis.kstest in package iZID [23], and two of our functions in AZIAD, kstest.A based on [7]'s Algorithm 1 with (#b + 1)/(B + 1) replaced by #b/B, and kstest.B based on our Algorithm 1 described in Section 2.4.
For illustration purposes, we consider four zero-inflated models, ZIP, ZINB, ZIBNB, and ZIBB.For each model, we consider five different samples sizes, N = 30, 50, 100, 200, 500, and simulate B = 1000 independent datasets for each sample size.For each simulated dataset, we run the five R functions under comparison individually targeting the corresponding true model.If the p-value of the test is less than 0.05, we count it as a type I error.The ratios of type I errors (that is, the number of type I errors divided by B = 1000) are listed in Tables 3 -6.For readers' reference, we provide the R code for generating Table 4 in the Supplementary Materials (Section S.3).
According to these tables, disc_ks_test (package KSgeneral) seems to have larger type I error rates than the nominal level 0.05.One concern about function

Comparison study in terms of test power
In this section, we continue the simulation studies in Section 3.3 and compare the power of the tests based on four different R functions.More specifically, given the B = 1000 datasets simulated from, for example, the ZIP distribution, we run a KS test on whether the data comes from a ZINB distribution, and denote the test as ZIP versus ZINB.If we reject ZINB distribution for 800 times, then the empirical power of the KS test at ZINB is 800/1000 = 0.80.Given that the type I error rate does not go beyond the nominal level 0.05, the bigger the power is, the better the test performs (see, for example, [39]).We remove function dis.kstest of package iZID from power analysis since it generates too many NA's for ZINB (see Table 4) or has a type I error rate much higher than 0.05 for ZIBNB (see Table 5) and ZIBB (see Table 6).S.2) and ks.testA for large sample size such as N = 100, 200, 500, which is faster.

Example 3.5 (DedTrivedi Data):
In this example, we use a real data DebTrivedi from R package MixAll to illustrate how to use our package AZIAD to identify the most appropriate zero-inflated model.The DebTrivedi data was obtained from the US National Medial Expenditure Survey [40,41].It contains 19 variables from 4,406 individuals aged Since there are seven distributions passing the KS test, we further run our likelihood ratio test function lrt.A to compare each pair of the candidate distributions.For example, if d1 represents geometric distribution and d2 represents NB distribution, we run lrt.A(d1, d2).A p-value less than 0.05 indicates that d2 is significantly better than d1.The relevant R codes are relegated to the Supplementary Materials (Section S.3).
Table 8 shows the results of pairwise comparisons of the seven candidate distributions based on lrt.A(H0,H1) with row names indicating H 0 and column names indicating H 1 .A small p-value implies that the column distribution is significantly better than the row distribution.It should be noted that in general the p-values of lrt.A(d1,d2) and lrt.A(d2,d1) are not equal.Based on the p-values in Table 8, we conclude that ZIBNB and BNBH are significantly better than geometric, NB, NB1, ZIBB and BBH, while there is no significant difference between ZIBNB and BNBH, which confirms our conclusion in Section 2.5.Note that neither ZIBNB nor BNBH was considered by Deb and Trivedi [40] and Zeileis et al. [41].Both distributions were recommended by Aldirawi et al. [7] and Aldirawi and Yang [24] for microbiome data analysis.
We run the same procedure based on our function lrt.B as well, which is based on kstest.B.The results are consistent with lrt.A's.Similarly as in Section 3.4, we recommend lrt.B for cases with smaller sample sizes and lrt.A for cases with a sample size larger than 100.

Example 3.6 (Omic Data):
In this example, we analyse the Omic data from [42], which is a list of 229 bacterial and fungal OTUs, for identifying appropriate models.More specifically, for each of the following distributions, Poisson, geometric, negative binomial, beta We conclude that Poisson, geometric, negative binomial (NB), zero-inflated Poisson (ZIP), Poisson hurdle (PH) and geometric hurdle (GeH) are not appropriate for sparse microbial features due to their low percentages (no more than 2%), while beta binomial (BB), beta negative binomial (BNB) and their zero-inflated and hurdle versions are much more popular (at least 64%).Among them, BNBH (or ZABNB, see Example 2.3) is the most appropriate model with percentage 87%.Compared with Table 1 in [7] or Table 2 in [24], our results are more reliable although the patterns are similar.

Discussion
A major goal targeted in this paper is to identify the underlying distribution F 0 given a random sample {Y 1 , . . ., Y n } from it.Given the data {Y 1 , . . ., Y n }, ideally our tests could accomplish two tasks, (i) do not reject F 0 itself; (ii) reject any F 1 which is not F 0 .
According to the Glivenko-Cantelli theorem (see, for example, Theorem 19.1 in [43]), the empirical distribution function F n converges to F 0 uniformly and then Task (i) can be achieved by proper KS tests up to a type I error, which is confirmed by our simulation studies in Section 3.3.
Task (ii) is much more complicated.First of all, in Section 2.5, we show the equivalence of a zero-inflated model and its corresponding hurdle model given that φ ZA ≥ p 0 (θ ).Therefore, one could not make a distinction between a ZI model and its corresponding ZA or hurdle model when zero-inflation exists.We have such a real data example in Section 3.5.
Secondly, as shown by simulation studies in Section 3.4, one may not be able to reject F 1 if F 1 is a more flexible model than F 0 , especially when F 1 with fitted parameters could approximate F 0 well.In this section, we use ZIP versus ZINB as an example to illustrate why one may not be able to reject F 1 given that F 0 is the true model.
It is known that Poisson(λ) can be approximated by NB(r, p) with a large r and a p close to 1 such that λ = r(1 − p) (see, for example, [44]).To illustrate how well ZINB could approximate a ZIP distribution, we simulate random samples {Y 1 , . . ., Y N } from ZIP(φ = 0.3, λ = 10) with various sample sizes N, then fit a ZINB model.In Table 10, we list the maximum difference between the cumulative distribution function (CDF) F ZIP (y) of ZIP and the CDF F ZINB (y) of the fitted ZINB.As the sample size N increases, the maximum distance sup y |F ZIP (y) − F ZINB (y)| decreases, which indicates ZINB approximates ZIP better and better.
Another example is the KS test of ZIBNB versus ZIBB, whose empirical power shows a decreasing pattern as the sample size N increases (see Table S.2 in the Supplementary Materials).We perform a similar numerical study here for ZIBNB versus ZIBB as well.More specifically, we simulate random samples {Y 1 , . . ., Y N } from ZIBNB(φ = 0.3, r = 15, α 1 = 19, α 2 = 10) with various sample sizes N, then fit a ZIBB model.Table 11 shows a decreasing pattern of the maximum difference between the two CDFs as the sample size N increases, which indicates ZIBB can approximate ZIBNB better and better.
and B is a predetermined large number, typically B = 1000.If the estimated p-value is larger than 0.05, we say that the specified distribution passes the KS test for the given dataset.Since D n = sup y |F n (y) − F θ (y)| when θ is unknown, where both F n and θ are based on the same data {Y 1 , . . ., Y n }, a more reasonable bootstrapped version of D n would be D (b)

CI of p 0 .Example 3 . 4 (
1923346 0.2160458 With sample size N = 1000, the derived 95% confidence intervals cover the true parameter values fairly well.Test zero-inflation in Poisson data): As an illustration, we simulate a ZIP data with N = 1000, φ = 0 and λ = 0.8, which is actually a regular Poisson data with λ = 0.8 without zero-inflation.There are 459 zeros which roughly matches the probatility of zero p 0 (λ) = 0.449.Then we use the FI.ZI function with distribution Poisson Hurdle (PH) to allow both zero-inflation and deflation.

R>
is known as the trigamma function, and Y follows the baseline distribution f θ (y).

Table 3 .
Type I error rates of KS tests on whether the data comes from ZIP, based on B = 1000 simulated ZIP datasets with parameters φ = 0.3, λ = 10 for each sample size N.

Table 4 .
Type I error rates of KS tests on whether the data comes from ZINB, based on B = 1000 simulated ZINB datasets with parameters φ = 0.3, r = 5, p = 0.2 for each sample size N.

Table 5 .
Type I error rates of KS tests on whether the data comes from ZIBNB, based on B = 1000 simulated ZIBNB datasets with parameters φ = 0.3, r = 3, α 1 = 3, α 2 = 5 for each sample size N.

Table 6 .
Type I error rates of KS tests on whether the data comes from ZIBB, based on B = 1000 simulated ZIBB datasets with parameters φ = 0.3, n = 5, α 1 = 8, α 2 = 3 for each sample size N.dis.kstest(iZID) for ZINB, ZIBNB and ZIBB distributions is its significant portions of NA's due to errors.Our two functions, kstest.A and kstest.B, and R basic function ks.test have type I error rates nearly zero, which are satisfactory.

Table 7
shows that our functions, kstest.A and kstest.B, have larger power at ZIP versus ZIBNB.All tests fail at ZIP versus ZINB and ZIP versus ZIBB.It often happens for a test of a simpler model versus a more flexible model, especially when the flexible model could approximate the simpler model well (see Section 4 for a discussion on it).More simulation studies show that our functions have the largest powers at ZIBNB versus ZIP, and at ZIBB versus ZIBNB as well (see Tables S.1 -S.3 in the Supplementary Materials, Section S.4).Note that both ks.test and disc_ks_test here rely on the MLEs for ZIBB and ZIBNB provided by our package AZIAD.Overall our two functions kstest.A and ks.test.B are most reliable for KS tests involving zero-inflated models.In practice, we recommend ks.testB for small sample sizes such as N = 30, 50 (see ZIP versus ZIBNB in Table 7, ZINB versus ZIP in Table S.2, and ZIBNB versus ZIP in Table

Table 7 .
Empirical power of KS tests on whether the data comes from ZINB, ZIBNB or ZIBB, based on B = 1000 simulated ZIP datasets with parameters φ = 0.3, λ = 10 for each sample size N.For illustration purpose, we select the variable ofp, which is the number of physician office visits.In order to analyse the data, we first use KS tests to check each distribution covered by our package.Since the sample size 4,406 is fairly large, we use our function kstest.A instead of kstest.B.Out of 19 distributions under test, we have 7 models with p-value larger than 0.05, including geometric, NB (with real-valued r), NB1 (with integer-valued r), ZIBB, ZIBNB, BBH, and BNBH.

Table 10 .
Maximum difference between the CDF of ZIP and the CDF of fitted ZINB based on random samples from ZIP(φ = 0.3, λ = 10) with various sample sizes N.