A Zero-Inflated Logistic Normal Multinomial Model for Extracting Microbial Compositions

Abstract High throughput sequencing data collected to study the microbiome provide information in the form of relative abundances and should be treated as compositions. Although many approaches including scaling and rarefaction have been proposed for converting raw count data into microbial compositions, most of these methods simply return zero values for zero counts. However, zeros can distort downstream analyses, and they can also pose problems for composition-aware methods. This problem is exacerbated with microbiome abundance data because they are sparse with excessive zeros. In addition to data sparsity, microbial composition estimation depends on other data characteristics such as high dimensionality, over-dispersion, and complex co-occurrence relationships. To address these challenges, we introduce a zero-inflated probabilistic PCA (ZIPPCA) model that accounts for the compositional nature of microbiome data, and propose an empirical Bayes approach to estimate microbial compositions. An efficient iterative algorithm, called classification variational approximation, is developed for carrying out maximum likelihood estimation. Moreover, we study the consistency and asymptotic normality of variational approximation estimator from the perspective of profile M-estimation. Extensive simulations and an application to a dataset from the Human Microbiome Project are presented to compare the performance of the proposed method with that of the existing methods. The method is implemented in R and available at https://github.com/YanyZeng/ZIPPCAlnm. Supplementary materials for this article are available online.


Introduction
The overarching goal of human microbiome studies is to generate research resources to facilitate characterization of the human microbiota to further our understanding of the structure and function of the microbiome in both healthy and diseased states (Gilbert et al. 2018).Advances in the throughput and accuracy of DNA sequencing have vastly increased the number of microbiome surveys being performed, which in turn have produced massive amounts of data to be analyzed.For example, in the Human Microbiome Project, marker-gene profiling of the human microbiome across many subjects and body habitats provided insights into inter-and intra-individual variations in normal microbiota of healthy adults, and improved our ability to decipher the relationship between the human microbiome and disease (Huttenhower et al. 2012;Proctor et al. 2019).
After quality control and data preprocessing, a typical microbiome dataset consists of a matrix that relates observed counts of operational taxonomic units (OTUs) to samples, and metadata that provide information about the samples (Knight et al. 2018).However, the read counts obtained for a sample inform only the relative abundances of OTUs in the microbial ecosystem to which the sample belongs.This is implicitly acknowledged when microbiome data are converted to compositions constrained and a high proportion of zeros.Usually, the degree of sparsity observed in marker-gene surveys is higher than that seen in other high-throughput sequencing experiments (Paulson et al. 2013).The sparsity can be explained by both biological and technical reasons, such as under sampling and dropout events.For example, some taxa are very lowly represented and cannot be detected due to an insufficient sequencing depth, a low capture rate, or other technical reasons, while others are rare and absent in the majority of samples.
Many methods have been proposed to estimate microbial compositions while accounting for the excess of zeros.One approach is to replace zeros by a small nonzero value, known as a pseudo count (Aitchison 1986).The choice of the positive constant is not statistically rigorous, but rather arbitrary.A Bayesian treatment assumes a multinomial distribution for counts and a uniform Dirichlet prior for the underlying proportions, and uses posterior Bayesian estimates to replace zeros (Martín-Fernández et al. 2015).To our knowledge, most replacement methods are applied to single data points, but the samples may have much in common, and these similarities can be used to learn from the experience of others.Liu, Zhao, and Wang (2020) proposed an empirical Bayes method by fixing the unknown hyperparameters of the Dirichlet prior to specific values.They estimated the posterior Bayesian estimator empirically from the data by maximizing the evidence, that is, the marginal Dirichletmultinomial likelihood.Under a Poisson-multinomial model, Cao, Zhang, and Li (2020) proposed regularized maximum likelihood estimation of microbial compositions that borrows strength across samples and across taxa.This approach imposed a low-rank structure on the underlying compositional matrix, and connected to the growing interest in matrix denoising (Donoho and Gavish 2014) and matrix completion (Cao and Xie 2016).
The above methods assume implicitly that all microbes are present in the ecosystems and the zeros come from a sampling process, for example, the Poisson or multinomial distribution.However, not all zeros are the same, and in general sequence count data can exhibit zero-inflation.In such cases, zero-inflated models could be used, the most popular of which are zeroinflated univariate distributions for differential abundance analysis (Paulson et al. 2013;Chen and Li 2016).Unfortunately, methods proposed at the individual taxon level are typically not useful for estimating the community level microbial compositions.Zero-inflated multivariate distributions are highly demanding but less developed in the literature.Tang and Chen (2019) proposed a zero-inflated generalized Dirichlet multinomial model for linking microbial counts with clinical or environmental factors.
In addition to zero-inflation, the ubiquity of over-dispersion and complex dependence structures among microbial taxa makes the analysis of multivariate abundance data a challenging task.While helpful for addressing over-dispersion, the Dirichlet multinomial assumes the counts are negatively correlated.Due to similar habitats or symbiotic interactions, microbes may display positive associations (Weiss et al. 2016).The generalized Dirichlet multinomial relaxes this restriction of pairwise negative correlation, but has a relatively narrow range of possible positive correlation (Wang and Zhao 2017).To allow for a general correlation structure among taxon counts, Xia et al. (2013) proposed an additive logistic normal multinomial model.The underlying distribution for multivariate count data results from compounding the conditional multinomial distribution (given compositions) with Aitchison's logistic normal distribution for compositions (Aitchison 1982).In a hierarchical Bayesian framework, Billheimer, Guttorp, and Fagan (2001) combined the multinomial likelihood and the logistic normal prior for making inferences about species compositions.Since the posterior distribution cannot be expressed in closed form, exact inference is analytically intractable.They proposed approximate inference using Markov chain Monte Carlo.However, this method is computationally intensive and scales up to a small number of taxa.Recently, Zhang and Lin (2019) developed a scalable algorithm based on stochastic approximation EM and Hamiltonian Monte Carlo.Nevertheless, as pointed out by Billheimer, Guttorp, and Fagan (2001), the assumption of all sampling zeros may be a severe limitation in applications where one or more taxa are absent, or where inference of absence is of interest.
In this article, we propose an empirical Bayes approach for microbial composition estimation.Motivated by the observation of an approximate low-rank structure in the compositional matrix (Faust et al. 2012;Cao, Zhang, and Li 2020), we introduce a zero-inflated probabilistic PCA (ZIPPCA) model, which can be viewed both as a low-rank version of the additive logistic normal multinomial model that accounts for zero-inflation, and as an extension of probabilistic PCA (Tipping and Bishop 1999) from the Gaussian setting to multivariate abundance data.To find the maximum likelihood estimates of model parameters, we develop an efficient iterative algorithm based on variational approximation (Ormerod and Wand 2010;Blei, Kucukelbir, and McAuliffe 2017).To some extent, we alleviate problems in the estimation of microbial compositions from raw sequence count data, including high dimensionality, data sparsity, overdispersion, and complex co-occurrence relationships.The rest of the article is organized as follows.Section 2 reviews some commonly used methods for estimating microbial compositions.Section 3 introduces the ZIPPCA model.In Section 4, maximum likelihood estimation is addressed and a variational approximation algorithm is proposed.Section 5 characterizes theoretical properties of the variational approximation estimator.Section 6 is devoted to simulation comparison, and Section 7 is concerned with data application.Finally, Section 8 includes a discussion.

Empirical Bayes Estimation of Microbial Compositions
Suppose we have a microbiome dataset with n samples and p taxa.Let x ij denote the measured count for taxon j in sample i, and M i = p j=1 x ij the total count of all taxa for the ith sample, where i = 1, . . ., n and j = 1, . . ., p. Write x i = (x i1 , . . ., x ip ) .It is natural to assume that x i follows a multinomial distribution with index M i and vector of probabilities ρ i = (ρ i1 , . . ., ρ ip ) , where 0 < ρ ij < 1 and p j=1 ρ ij = 1.The method of maximum likelihood then yields relative abundances, or raw proportions, as the naive estimates of microbial compositions.
To account for over-dispersion that is typical of data in microbiome studies, it is often reasonable to assume that ρ i is random with a suitable distribution.A popular choice is the logistic normal (LN) distribution (Aitchison 1982).Let S p−1 be the (p − 1)-dimensional simplex and R p−1 the (p − 1)-dimensional Euclidean space.Define the log-ratio The inverse transformation has the form Assuming multivariate normality of the transformed data Billheimer, Guttorp, and Fagan (2001) proposed a fully Bayesian treatment to make inferences about microbial compositions.First, they introduced a prior over the mean vector and covariance matrix in the LN distribution.Second, they computed the posterior by marginalizing with respect to these hyperparameters as well as the latent microbial composition.Since the complete marginalization over all of these variables is analytically intractable, they used Markov chain Monte Carlo for inference, which is computationally intensive even for a small p.
In this article, we achieve a compromise between the frequentist method and the fully Bayesian procedure by assuming that the microbial composition ρ i follows a LN distribution, but the hyperparameters are set to specific values and are determined by maximizing the marginal likelihood function.This approach is known as empirical Bayes in the statistics literature (Efron and Hastie 2016) or evidence approximation in the machine learning literature (Bishop 2006).
To derive the marginal likelihood, we first calculate the joint distribution for x i and μ i , and then integrate out μ i .The resulting distribution for x i is called the logistic normal multinomial (LNM) distribution (Xia et al. 2013).Plugging the estimated parameters into the posterior distribution of ρ i given x i we can make predictions for microbial compositions using either the posterior mean or mode.Besides extra multinomial variability, the LNM distribution inherits the rich dependence structure of the LN distribution to describe the relationships among microbial taxa.Because LNM does not have a closed-form expression, maximum likelihood estimation and inference are infeasible and Monte Carlo EM algorithms are typically used instead (Xia et al. 2013;Zhang and Lin 2019).

A Zero-Inflated Probabilistic PCA Model
Microbiome datasets are high-dimensional and sparse.Furthermore, it has been empirically observed that microbial compositions across samples often exhibit a low-rank structure (Faust et al. 2012;Martino et al. 2019;Cao, Zhang, and Li 2020).Therefore, we consider a zero-inflated probabilistic PCA (ZIPPCA) model of the form where ind stands for independently distributed, and Bern, N, and MN denote the Bernoulli, normal, and multinomial distributions, respectively.The z ij are latent indicators for excess zeros, and the η j are the corresponding probabilities of zeroinflation.Here, we assume implicitly that x ij = 0 if ρ ij = 0.The latent vectors f i represent unobserved environmental factors, the coefficient vectors β j denote the corresponding factor loadings, and the terms f i β j capture the correlations across taxa.
Let S i0 = {j : ρ ij = 0} and S i1 = {j : ρ ij > 0}.Then we have Given ρ i , this model assumes that some of the microbial counts in a sample are structural zeros, x ij = 0, j ∈ S i0 , and all the others in this sample, x i1 = {x ij , j ∈ S i1 }, jointly follow a multinomial distribution with index M i and vector of probabilities ρ i1 = {ρ ij , j ∈ S i1 }.Furthermore, marginally x i1 has an LNM distribution.Nevertheless, this marginal distribution is restricted in the sense that ρ i1 is related to a k-vector of latent or unobserved variables f i = (f i1 , . . ., f ik ) via the logratio transformation with intercepts β 0j and coefficient or loading vectors β j = (β j1 , . . ., β jk ) , where we assume without loss of generality that, for some reference j ∈ S i1 , β 0j = 0 and β j is a zero vector.Typically, k p, and so the specification allows a more parsimonious explanation of the dependences between microbial taxa.We call this model ZIPPCA-LNM.The underlying compositions are given by

Classification Variational Approximation
Let β 0 = (β 01 , . . ., β 0p ) , B = (β 1 , . . ., β p ) , η = (η 1 , . . ., η p ) , and = {β 0 , B, η}.Let l( ) denote the loglikelihood function for ZIPPCA-LNM.The maximum likelihood estimate of is ˆ = arg max l( ).Maximizing the log-likelihood is hindered by the fact that the integrals involved in l( ) do not have a closed form.To address this, we may use the EM algorithm to obtain a tractable approximation to the log-likelihood.However, because the conditional distribution of the latent variables (f i , z i ) given the observed data x i cannot be solved analytically, a standard EM algorithm is not possible, and one must resort to computationally intensive variants such as Monte Carlo EM (Xia et al. 2013;Wang, Yang, and Zhao 2019).
We use instead a variational approximation (VA) approach (Ormerod and Wand 2010;Blei, Kucukelbir, and McAuliffe 2017).Treating (f i , z i ) as missing data, the complete data loglikelihood has the form Here and throughout the article, all quantities constant with respect to the parameters have been omitted.Let q(f i , z i ) be an arbitrary density function of the latent variables (f i , z i ).The central idea of VA is to utilize the fact that the Kullback-Leibler (KL) divergence between q(f i , z i ) and the posterior p(f i , and hence, The lower bound can also be derived using Jensen's inequality. The method of VA is then concerned with the problem of optimizing the log-likelihood bound over a chosen family of densities, which in turn amounts to minimizing the KL divergence between q(f i , z i ) and p(f i , z i |x i ) over this family.
Restrictions are often imposed on q(f i , z i ) to achieve computational tractability.In this article, we consider the mean-fieldtype variational family by assuming that where We maximize n i=1 l i ( , ) over to approximate the loglikelihood, and then over to obtain a VA solution for maximum likelihood estimation.Let ˆ = { πij , mi , ˆ i } and ˆ = { β0 , βj , ηj } denote the maximizer of n i=1 l i ( , ).Our empirical Bayes estimates of the underlying compositions Direct maximization of the variational lower bound is difficult numerically, because of the interaction between model parameters and variational parameters and due to the sum of terms inside the logarithm.We propose maximizing n i=1 l i ( , ) in an alternating fashion by updating {π ij }, {m i , i }, and in turn, and iterating the three steps until convergence.This is outlined in Algorithm 1.More details on the algorithm are given in the Appendix, supplementary materials.
To increase speed and convergence, we use two tricks in the optimization.First, we make the connection between the multinomial and Poisson distributions (Baker 1994;Efron and Hastie 2016) where ij }, and s = 0; 2: while variational lower bound not converged do 3: (s)   i in turn (see the Appendix, supplementary materials for details); 6: given {η in turn (see the Appendix, supplementary materials for details); Second, motivated by the resemblance between the last term in l i ( , ) and the log-likelihood in a mixture model, and inspired by classification for clustering (Celeux and Govaert 1992), we add a classification step on the basis of πij .For a prespecified threshold π 0 , we set This is a form of hard assignment.When updating the other parameters, rather than including all zero counts or discarding all of them, we extract information from zeros that are most likely to be sampling zeros.Our software uses 0.5 as a default threshold.
By construction, the sequence of values of the variational lower bound is nondecreasing.Consequently, the variational lower bound converges to some limiting value.However, since the optimization problem is not concave, there will generally be multiple local maxima and the algorithm is not guaranteed to find the largest of these local maxima.Our limited experience shows that the algorithm often converges within a reasonable time.Compared with variational approximation, the Laplace approximation and penalized quasi-likelihood are known to suffer from convergence problems (Hui et al. 2017;Hui, Müller, and Welsh 2018).

Theoretical Properties
To our knowledge, there is surprisingly little existing theory either on multivariate zero-inflated models or on variational approximation estimation.Inspired by the recent work of Westling and McCormick (2019), we connect our VA estimator to Mestimator, and study the consistency and asymptotic normality from the perspective of M-estimation (van der Vaart 2000), with detailed proofs and remarks given in the Appendix, supplementary materials.The following result is due to a connection between Poisson and multinomial distributions.
Lemma 5.1.The VA optimization problem under ZIPPCA-LNM amounts to maximize where As is clear in the proof, the α i0 are sample-specific parameters of the underlying Poisson distributions.

Consistency
Let ˆ = { β0 , βj , ηj } denote the global solution of the variational lower bound.One way to see the objective for the VA estimate is as the VA lower bound profiled over {m i , i , α i0 }, that is, where the optimal {m i , i , α i0 } is a function of , δ( , x).This allows us to use the M-estimation framework for the VA estimator.We define the profiled objective function μ( ; x) = v( , δ( , x); x).The implicit function theorem implies that for each x, the map → μ( ; x) is continuous, and that for every ball B, sup ∈B μ( ; x) is measurable.
As is noted in the Appendix, supplementary materials, condition (C1) is satisfied in the absence of zero-inflation, that is, η j = 0 for all j.A sufficient condition for (C1) is that (1 − w j )η j + (1 − η j ) exp(−w j ) ≥ 0 for all j, where w j = exp α 0 + β 0j + m β j + β j β j /2 .In general, this condition implies an interaction between the underlying compositions and the zero-inflation.

Asymptotic Normality
Let D 2 • denote the second derivative operator with respect to • and let * be a point of maximum of E 0 {μ( ; X)}.We make the following additional assumptions: (C2) For a constant C ∈ (0, 1), and for all j where b 3 (x) is a square-integrable function and r x is defined in the Appendix, supplementary materials.
Conditions (C2) and (C3) hold also in the absence of zeroinflation.For a discussion on condition (C4), see Westling and McCormick (2019).More discussions on (C1)-(C4), as well as on identifiability issues, can be found in the Appendix, supplementary materials.
Theorem 5.2.Assume that the conditions (C1)-(C4) hold, and where and The sandwich formula for constructing an asymptotic covariance matrix for the variational estimator, and a numerical illustration of the coverage of variational confidence intervals, are discussed in the Appendix, supplementary materials.Although our theoretical results shed light on the properties of the VA estimator, much work needs to be done from a possibly different perspective than M-estimation, so as to fully understand the proposed methodology.Recently, Wang and Blei (2019) established frequentist properties for variational Bayes rather than maximum likelihood under the empirical Bayes framework considered here.Their results rest on Assumption 1.3, the local asymptotic normality expansion of the variational log-likelihood.The authors used model-specific analysis to verify this condition, and discussed sufficient conditions and general proof strategies.It is known that maximum likelihood estimators in smooth parametric models are asymptotically normal, and the convergence of the local experiments to a normal limit experiment gives an insightful explanation of this fact; see sec.7.4 of van der Vaart (2000).It is interesting to apply the conditions and proof strategies of Wang and Blei (2019) to obtain asymptotic results.

Simulation Study
We conducted a simulation study to investigate the effectiveness of our proposed method (denoted by zlnm).Four simulation settings were considered, where count data were sampled from either a zero-inflated model or the multinomial model without zero-inflation.We also used these settings to compare the performance of zlnm to that of five currently available methods for composition estimation: (a) a smoothed version of raw proportions obtained by replacing zeros with 0.5 and then renormalizing each sample to sum one.Zero replacement (zr) is simple and widely used (Aitchison 1986), but it is applied to individual sample points; (b) singular value thresholding (svt), which is a version of matrix denoising (Donoho and Gavish 2014); (c) a Poisson-multinomial regularized (pmr) estimator (Cao, Zhang, and Li 2020), which is a reduced rank version of raw proportions that borrows strength among samples; (d) a Bayesian method based on Dirichlet multinomial mixtures (dmm) without zeroinflation (Holmes, Harris, and Quince 2012); and (e) a recent method called zero-inflated Poisson factor analysis (zipfa) (Xu, Demmer, and Li 2021).
We evaluated the finite-sample performance based on computation time, and the difference between the true underlying and estimated compositions as calculated using the following four criteria: 1. squared Frobenius norm error 2. average Kullback-Leibler divergence 3. Shannon's index mean squared error 4. Simpson's index mean squared error These measures are commonly used for comparing different methods of composition estimation (Cao, Zhang, and Li 2020).

Data Generated from the ZIPPCA-LNM Model
To first obtain insights into the operating characteristics of the proposed method, we used a simulation setting where count data were generated from the ZIPPCA-LNM model (1), with k = 2 latent variables.Intercepts β 0j were set to 2, and elements of loading vectors β j were sampled from U(−3.5, 3.5).
To investigate the effects of sample size (n), dimension (p), zeroinflation (η j ), and sequencing depth (M i ) on the performance, we considered two different combinations of (n, p), each with Table 1.Specification of parameters in the zero-inflated models examined in Sections 6.1 and 6.2.
(50, 100) (100, 50) The proportions of zero counts were about 50% across the configurations, and the percentages of structural zeros among zero counts in cases (a)-(c) were around 20%, 50%, and 80%, respectively.M i /α i0 refers to "M i " or "α i0 , " with M i and α i0 denoting the total count or size parameter of ZIPPCA-LNM and the sample-specific parameter of ZIPPCA-NB or ZIPPCA-Poi, respectively.three instances of (η j , M i ).Details on the simulation setup are listed in the upper panel of Table 1.
The simulation results over 100 replications are summarized in Table 2. Overall, our proposed method zlnm performed the best in estimating the underlying compositions, followed by zr, svt, and dmm.In the presence of zero-inflation, pmr performed poorly, and it also took much longer to fit the data.Failing to deal with zero-inflation and over-dispersion is the main reason for a leap in efficiency of pmr.We also see that svt slightly outperformed zr, and that these two algorithm-based methods were much faster than their model-based counterparts.Finally, except for zipfa, the performance of all methods got worse as the level of zero-inflation increased, especially for dmm, underscoring the importance of addressing zero-inflation.On the other hand, the poor performance of zipfa is largely due to the fact that the zipfa model is inherently different from the ZIPPCA model.In this simulation setting, the proportions of zero counts were close to 50% across different cases.In the Appendix, supplementary materials, we considered a setting in which the proportions were about 70%.The conclusions were qualitatively similar.

Data Generated from Other Zero-Inflated Models
In this section, we simulated count data from the following zeroinflated model where NB denotes the negative binomial distribution, and φ j is the dispersion parameter.We call this model ZIPPCA-NB.Set k = 2 and φ j = 0.1.We also considered the use of Poisson distribution as the underlying count distribution, and refer to it as ZIPPCA-Poi.This amounts to taking φ j = 0.As in the previous simulation study, two combinations of (n, p) were investigated, each with three instances of (α i0 , η j ).Further details on the simulation setup are given in the middle and lower panels of Table 1.In either ZIPPCA-Poi or ZIPPCA-NB, we define the underlying compositions as This definition exploits the Poisson-multinomial transformation.
Tables 3 and 4 summarize the results from 100 replications.The observations were similar to those in the previous setting,  except that pmr was competitive with zr and svt.We also studied the impact of zero proportion on the performance.Additional simulation results in the Appendix, supplementary materials support the superiority of zlnm.In many common studies, samples are drawn from distinct groups (antibiotics versus no antibiotics) and the underlying compositions may be very different.Also, zero-values may arise through complex dependency structures between taxa.For example, after antibiotics are given, large communities of microbes may be killed leaving only those who have developed some measure of antibiotic resistance.It thus, may be common that η j varies greatly from OTU to OTU.So far, metadata (that is, information about samples) is not utilized in the analysis.Nevertheless, the flexibility of our regression-type framework allows us to incorporate covariates or confounding factors.We have updated our R package so that users can choose to use such information.In the Appendix, supplementary materials, we conducted a simulation study to investigate the performance of our method, either without covariate-adjustment (zlnm) or with covariate-adjustment (zlnm-cov).To be more realistic, we generated η j randomly and uniformly on (0, 0.6).Simulation results therein show that zlnm and zlnm-cov performed on average better than other methods, and as expected, zlnm-cov outperformed zlnm in the presence of confounding factors, while behaved similarly in the case with no covariate.

Data Generated from the Simulation Setting of Cao, Zhang, and Li (2020)
For a fair comparison of our method with others, we adopted the simulation setting from Cao, Zhang, and Li (2020), in which count data were drawn from the multinomial with no zeroinflation, where r i , u ir , v jr , and jr were sampled from U(1, 10), N(0, 1), I(j = r)Bern(0.3)+ I(j = r), and N(0, 10 −6 ), respectively.The data generating process was repeated until z ij and hence, ρ 0ij were positive.We set k = 20, γ ∈ {1, 5}, and examined two combinations of sample size and dimension, (n, p) = (100, 50), and (n, p) = (100, 200).
From Table 5, we can see that dmm performed the best, and pmr performed well as expected, although it was computationally the slowest.Furthermore, zlnm was competitive with pmr and dmm in terms of squared Frobenius norm error and average Kullback-Leibler divergence, and it was superior to pmr in terms of other two measures of estimation accuracy.These results, together with those in the previous settings, demonstrated the overall superiority and robustness of zlnm.

Rank Selection
The rank or number of factors, k, was so far assumed to be known.When there is no prior information about k, as is usually the case, we need to select it in a data-driven fashion.There are in principle at least three ways to address this problem.The first is to use cross-validation, the second is based on using a test in a sequential manner, and the third approach is to use an information criterion such as Bayesian information criterion (BIC).
In this section, we examined the performance of a BIC type criterion.For a candidate rank set W and w ∈ W, the criterion takes the form BIC(w) = −2l w + log(n × p)g(w), where l w is the log-likelihood function, and g(w) is the number of parameters.In our case, as a function of w, g(w) = wp − w 2 + 2wn + 2p + np.Note that the log-likelihood does not have a closed-form expression.We approximated it using the variational lower bound returned by the classification VA algorithm.
We applied the BIC criterion to infer the rank in the simulation models in Sections 6.1 and 6.2, where the true rank k = 2. Figure 1 shows the frequencies of the estimated rank ( k) over 100 data replications.We can see that accurate inference on the number of factors is possible, even under model misspecification, when the degree of zero-inflation was mild or moderate.In the ZIPPCA-NB model, the BIC criterion tended to over-estimate the rank by one, especially in the presence of marked zero-inflation.From a dimension-reduction point of view, over-estimation rather than under-estimation by one is likely tolerable, since then the resulting model will still be correct.In the Appendix, supplementary materials, we also used the method of cross-validation (CV), in the same way as in Xu, Demmer, and Li (2021), to estimate the number of factors.The results of using 5-fold CV shows that, compared with BIC, CV tended to over-estimate the number of factors.

HMP Data Analysis
In this section, we present results of applying our method and its competitors to data from the Human Microbiome Project (Methé et al. 2012).In this project, specimens were collected from a cohort of healthy individuals at 15 or 18 body habitats within five major body areas, and they were used for 16S rRNA gene analysis via 454 pyrosequencing of the V1-V3 or V3-V5 hypervariable regions and across four sequencing centers.
In our study, we used the V3-V5 samples from one participating sequencing center (the J. Craig Venter Institute).For illustration, we selected three habitats within the skin (right antecubital fossa, left antecubital fossa, and left retroauricular crease), and three oral sites (subgingival plaque, saliva, and throat).We then analyzed data at the genus level.After removing samples or features whose total abundances were zero, we were left with 318 genera and 347 samples (167 for females and 180 for males).
Fitting the ZIPPCA-LNM model while controlling for gender resulted in BIC choosing five factors.Figure 2 shows estimated microbial compositions of 318 genera (rows) and 347 samples (columns) from different methods.We see that as expected zr performed much like svt.Compared with other methods, pmr failed to extract meaningful patterns and trends.In particular, the variability of microbial composition between skin and oral cavity, and within each of these two regions, was not clearly evident for pmr, and hence, the result of pmr was inconsistent with previous findings (Huttenhower et al. 2012).In contrast, the results of zr, svt, zlnm-cov, dmm, and zipfa all suggest reasonably that microbial compositions were more stable within a body area than across body areas.This is especially the case for the right and left antecubital fossae and for saliva and throat.To further compare these methods, we calculated the within sample alpha diversity using the relative abundances.We examined the Shannon's index (− j=1 ρij log ρij ) and Simpson's index (1 − j=1 ρ2 ij ).The results are summarized in Figure 3.We can see that pmr greatly underestimated variability of diversity and so was less dependable than other methods.The performance of zr was similar to that of svt.For both methods, the retroauricular crease had a lower estimated alpha diversity than oral communities, but, the diversity in antecubital fossae was alarmingly higher.dmm had similar results, except that the difference between antecubital fossae and oral cavity was much reduced.Interestingly, zlnm-cov and zipfa had the opposite behavior from zr, svt, and dmm.The results of zlnm-cov and zipfa seems more reliable because it was previously observed that the alpha diversity in the skin was lower than that in the oral cavity (Huttenhower et al. 2012).Compared with zipfa, the difference between the skin and the oral cavity revealed by zlnm-cov was more apparent.To see why zr was unreliable, we note that the overall proportion of zeros was about 90%, and hence, the ad hoc replacement of zeros by 0.5 distorted the alpha diversity in the skin.In the Appendix, supplementary materials, pairwise comparison among the six body habitats, in terms of estimated alpha diversity by zlnm-cov, was performed using t test.As expected, there was a significant difference between skin and oral communities, but within each body region, the evidence that two habitats were significantly different was not convincing.
We also followed the common practice to filter out taxa based on prevalence to reduce the number of zeros, by removing any samples or features whose proportions of nonzeros were less than 0.05.After data preprocessing, the overall zero proportion reduced to 65%.The corresponding results are shown in the Appendix, supplementary materials.Compared with Figure 3, we see that the conclusions remain virtually the same.However, one drawback of data filtering is that it often discards many samples and hence, may lead to a loss of efficiency.In our example, the number of skin samples decreased from 134 to 88.We thus, chose and also recommended not to filter the data, because ZIPPCA directly deals with zeros.

Discussion
The analysis of microbiome data is challenging due to the inherently compositional nature of the data.Recently, compositionaware methods have been proposed to address this problem of compositionality.However, since microbiome data are sparse containing an excess number of zeros, converting raw abundance data into compositions or relative abundances is susceptible to bias and can be problematic.Microbial composition estimation is further complicated by other data characteristics including the large number of taxa, over-dispersion of sequence read counts, and complicated dependence patterns among microbes.To overcome these challenges, we have introduced a zero-inflated model, ZIPPCA-LNM, that extends probabilistic PCA from the Gaussian setting to multivariate abundance data, and proposed an empirical Bayes approach for inferring microbial compositions.An efficient VA algorithm, classification VA, has been developed for fitting this model.We have demonstrated the superior performance of the proposed method over existing methods using simulated examples and on a HMP dataset across six body habitats within two major body regions.
From a theoretical perspective, methods for latent variable models with discrete data usually assume that the dimension of the units of analysis tends to infinity, but that the other dimension of features is fixed.As with these methods, we adopted the traditional asymptotic reasoning of letting the sample size n → ∞, with the number of taxa p fixed.We used the argument of Westling and McCormick (2019) to build a connection between variational approximation and profile M-estimation.Our theoretical results were built on the consistency and asymptotic normality results of van der Vaart (2000) and Westling and McCormick (2019).
Note that most published papers on latent variable models with discrete data have discussed only the applied and computational aspects.More recently, Jentsch, Lee, and Mammen (2021) studied Poisson reduced-rank models with an application to political text data, and developed some asymptotic theory in high dimensional regimes, in which both dimensions diverge to infinity.To our knowledge, this is one of the first papers to develop a rigorous theory suitable for such high dimensional Poisson regimes.However, there are two differences between our framework and theirs.First, the Poisson reduced-rank model fails to account for the excess of zeros frequently observed in sequencing studies.Second, and more importantly, the factors in their model are unknown parameters, and hence, are fixed and rather than randomly distributed.As such, the technical tool established for Poisson reduced-rank models is not applicable to our problem.Due to the complicated nature of the problem, developing a general theory, as both n and p tend to infinity, would be a substantial enough undertaking.Work along this line is in progress.
Variational inference is a widely used method for posterior approximation in which the inference problem is transformed into an optimization problem.However, most of the time, this optimization is nonconvex and challenging to solve.Numerical algorithms such as coordinate ascent are only able to achieve a local maximum, and often there is no evaluation about how close this local optimum is to the global one.Nevertheless, some recent progress has been made in the literature.For example, Fazelnia and Paisley (2018) proposed convex relaxation for variational inference, which is a general approach based on convex relaxation and semidefinite programming.Specifically, a semidefinite programming relaxation converts a nonconvex polynomial optimization of vector parameters to a convex optimization with matrix parameters via a lifting technique.For Bayesian linear regression and sparse coding models, theoretical results using graph theoretic tools guarantee tight relaxation bounds that get nearer to the global optimal solution than coordinate ascent, and experimental results show that convex relaxation significantly improves the coordinate ascent technique (Blei, Kucukelbir, and McAuliffe 2017).There is of course a price to be paid for this improvement with increased computational burden arising from the positive semidefinite constraint.It is interesting to examine the behavior of convex relaxation for variational approximation under the proposed ZIPPCA framework.
There are several future research directions worth exploring.First, while designed for abundance data in microbiome studies, the modeling framework for extracting compositions is applicable to sparse count data from a wide variety of fields, such as RNA-seq data in single-cell gene expression analysis (Pierson and Yau 2015), and word counts in text analysis (Blei and Lafferty 2007).Second, recent datasets often consist of hundreds of samples and hundreds or thousands of features, and so computational scalability is increasingly important.In practice, we may reduce the dimensionality by coalescing of OTUs.While useful for numerical analysis, it may not be very informative in a more practical scope.We carried out some simulations in the Appendix, supplementary materials with p ∈ {100, 200, 400} and n = 50.We see that on average our method performed well, and that its computational cost grew as the dimension p increased.Nevertheless, extremely high-dimensional settings are currently infeasible, and one direction is to optimize the proposed algorithm using stochastic gradient descent, or to take advantage of modern machine learning architectures.Finally, as pointed out by a referee, we may potentially do well with an algorithm based on stochastic approximation EM and Hamiltonian Monte Carlo (Zhang and Lin 2019).However, due to the complicated nature of the general Monte Carlo technique, extending it to handle zero-inflated multivariate counts is more involved and developing a theory seems more challenging than in the framework of variational approximation.

Figure 1 .
Figure 1.Choice of the number of factors by the BIC type criterion in the simulation models in Sections 6.1 and 6.2, where the true rank k = 2. Shown are the frequencies of estimated rank ( k) over 100 data replications.(a) mild zero-inflation, (b) moderate zero-inflation, and (c) marked zero-inflation.

Table 2 .
Means of squared Frobenius norm error (A), average Kullback-Leibler divergence (B), Shannon's index mean squared error (C), and Simpson's index mean squared error (D×10 −2 ), and the average computation time in seconds, are reported for various methods over 100 datasets from the ZIPPCA-LNM model (1), with two combinations of sample size and dimension, (n, p) = (50, 100) and (n, p) = (100, 50).NOTE: Three cases, all with a proportion of zero counts about 50%, but at different levels of zero-inflation, were investigated: (a) the percentage of structural zeros among zeros was about 20%; (b) the percentage was about 50%; and (c) the percentage was about 80%.zr: zero replacement and renormalization; svt: singular value thresholding; pmr: Poisson-multinomial reduced-rank-regularization; zlnm: the proposed VA method for ZIPPCA-LNM; dmm: Dirichlet multinomial mixtures; zipfa: zeroinflated Poisson factor analysis.
NOTE: Three cases, all with a proportion of zero counts about 50%, but at different levels of zero-inflation, were investigated: (a) the percentage of structural zeros among zeros was about 20%; (b) the percentage was about 50%; and (c) the percentage was about 80%.zr: zero replacement and renormalization; svt: singular value thresholding; pmr: Poisson-multinomial reduced-rank-regularization; zlnm: the proposed VA method for ZIPPCA-LNM; dmm: Dirichlet multinomial mixtures; zipfa: zeroinflated Poisson factor analysis.Table 4. Means of squared Frobenius norm error (A), average Kullback-Leibler divergence (B), Shannon's index mean squared error (C), and Simpson's index mean squared error (D×10 −2 ), and the average computation time in seconds, are reported for various methods over 100 datasets from the ZIPPCA-NB model, with two combinations of sample size and dimension, (n, p) = (50, 100) and (n, p) = (100, 50).NOTE: Three cases, all with a proportion of zero counts about 50%, but at different levels of zero-inflation, were investigated: (a) the percentage of structural zeros among zeros was about 20%; (b) the percentage was about 50%; and (c) the percentage was about 80%.zr: zero replacement and renormalization; svt: singular value thresholding; pmr: Poisson-multinomial reduced-rank-regularization; zlnm: the proposed VA method for ZIPPCA-LNM; dmm: Dirichlet multinomial mixtures; zipfa: zeroinflated Poisson factor analysis.