A review on probabilistic models used in microbiome studies

In this paper, we ﬁrst brieﬂy review the background and signiﬁ-cance of the microbiome, the technologies used for collecting microbiome data, and some public resources for downloading microbiome data. We then review the probabilistic models used in the literature in two categories: (1) for read counts from a speciﬁc feature, including Poisson, negative binomial, zero-inﬂated and hurdle models; (2) for read counts from multiple features, including Dirichlet-multinomial, generalized Dirichlet-multinomial, and zero-inﬂated models, as well as a nonparametric Bayesian model for a ﬂexible number of features. We also review comprehensive comparisons among diﬀerent probabilistic models.


Introduction
The microbiome, a dynamic ecosystem of microorganisms (bacteria, archaea, fungi, and viruses) that live in and on us, plays a vital role in host-immune responses resulting in significant effects on host health. Dysbiosis of the microbiome has been linked to diseases including asthma, obesity, diabetes, transplant rejection, and inflammatory bowel disease [7,35,37,45,53]. These observations suggest that modulation of the microbiome could become an important therapeutic modality for some diseases. For example, fecal transplants have been shown to alleviate diarrhea caused by Clostridium difficile infection and temporarily improve insulin sensitivity [16,54]. Specifically, the gut microbiome, which has been the most extensively studied human microbiome ecosystem, is highly diverse and has been shown to include thousands of different bacterial species [10,59]. This diverse community of bacteria is composed of a few species that are highly abundant and a large number of species that are found in trace amount [27]. The human microbiome can be divided into the core microbiome and the variable microbiome [52]. The core microbiome is the set of taxa or genes that present in a given body location (gut, kidney, skin, oral, etc.) in almost all humans. The variable microbiome arises from various factors such as host physiological status, host environment, host genotype, host lifestyle, and host pathobiology. Moreover, given the strong association between microbiome and various diseases, computational models have been built to predict phenotypes from microbial profiles [9,17,40,41].
The goal of this paper is to briefly introduce the microbiome data and the related probabilistic models to people who are interested in microbiome research and the corresponding analysis. In Section 2, we introduce the two technologies, 16S rRNA and metagenome shotgun sequencing, for obtaining microbial profiles of a biological sample. In Section 3, we introduce the typical format of microbiome data and some public resources for downloading microbiome data. In Section 4, we review the probabilistic models for modeling count data from each microbial feature independently. In Section 5, we review the probabilistic models for microbiome data from multiple features. In Section 6, we review a nonparametric Bayesian model for a flexible number of microbiome features. We conclude in Section 7.

Technologies used to study the microbiome
The number of studies investigating the microbiome has risen exponentially since the technological advances in high-throughput sequencing [19]. Sequencing technologies are able to identify the genetic content of microbial communities in the form of millions to billions of short DNA sequences. These technical advances have been paradigm shifting since the majority (>90%) of microbial species cannot be readily cultured using current laboratory culture techniques [32,39,48]. The most common sequencing technologies to analyze the microbiome are 16S rRNA gene sequencing and metagenome shotgun (MGS) sequencing [38].
In 16S rRNA sequencing, a 16S rRNA gene is amplified by polymerase chain reaction (PCR) with primers that recognize the highly conserved regions of the gene [44]. A limitation of this method is that the annotation is based on a putative association of the 16S rRNA gene with taxa defined as an operational taxonomic unit (OTU). In general, OTUs are annotated at higher levels, such as phylum to genus, and can be less precise at the species level. In 16S rRNA sequencing, other bacterial genes are not directly sequenced, but rather predicted based on the OTUs [22]. Due to horizontal gene transfer and the existence of numerous bacterial strains [18,36], the lack of direct gene identification potentially limits understanding of the microbiome.
An alternative sequencing approach is the metagenome shotgun (MGS) sequencing in which random fragments of a genome are sequenced. Compared to 16S rRNA, MGS is more expensive and requires more sophisticated data analysis methods [19,24,25,47]. A major advantage of the MGS sequencing is that the sequence reads can be more accurately annotated at the species level and the microbial functional profile can be constructed from the sequence reads.

Microbiome data resources and data representation
There are several initiatives to store and manage data from microbiome studies in order to make them available and free for everyone to use. The major public servers are MG-RAST (https://www.mg-rast.org/) and QI-ITA (https://qiita.ucsd.edu/). Also, National Center of Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/) is one of the most comprehensive resources that have a database of curated and updated microbial genomes and taxonomic tree.
To analyze these massive amount of sequence data, metagenomic reads are processed for each microbiome sample to construct taxonomic and/or functional profiles [1,2,20,29,51,56]. The taxonomic profiles, functional profiles, or both for all samples, are then combined into one count table (see Table 1 and Table 2 as an example of a toy taxonomic profile) with a dimension of m × n, where m denotes the number of microbial features F 1 , . . . , F m and n denotes the number of metagenomic samples S 1 , . . . , S n . The entry z ij represents the number of reads from sample j that mapped to microbial feature i, while its capitalized version Z ij represents the corresponding random variable. In the table, N j = m i=1 z ij denotes the total number of reads for the m features in sample S j , and z i. = n j=1 z ij denotes the total number of reads mapped to features F i in all samples. Since metagenomic samples may have different sequencing depths, the aggregated metagenomic counts need to be normalized among samples [3]. There are several methods developed to tackle the normalization problem of a count table, such as centered log-ratio (CLR) transformation [11], cumulative sum scaling [33], median-of-ratios scaling factor [23], and trimmed mean of M values [43].   Table   4. Probabilistic models for single feature One of the objectives of the microbiome studies is to determine whether there is a particular microbial signature (e.g., taxa or genes) associated with a particular disease state and/or phenotype. These biomarkers can play an important role in the development of preventive and therapeutic strategies. Differential abundance tests have been developed to identify those microbial features that are significantly different between two phenotypic groups. In this section, we focus on probabilistic models built for sequence read counts from a single microbiome feature, that is, Z ij for feature F i and subject S j . Assuming Z ij follows a probabilistic model with a few unknown parameters, statistical inference can be made based on estimated parameters from the data. In practice, there are two types of experimental designs for microbiome studies: (1) snapshot studies, where each subject provides only one sample, (2) longitudinal studies, which include multiple samples per subject over time.

Models used in snapshot microbiome studies
4.1.1. Poisson model. Poisson distribution has been widely used for modeling non-negative outcomes as a count. If a random feature count Z ij follows a Poisson distribution with mean θ > 0, it assigns the probability for k = 0, 1, 2, . . .. As the mean count increases, the skewness diminishes, and the Poisson distribution becomes approximately a normal distribution. One property of Poisson distribution is that its variance equals the mean. For non-negative count outcomes, a model with Poisson distribution is usually more appropriate than an ordinary least squares linear model [4].

4.1.2.
Negative binomial model. The negative binomial (NB) distribution is an alternative probabilistic model for count data [4]. It is especially useful when the sample variance exceeds the sample mean, known as overdispersion. Given a sequence of independent Bernoulli trials with probability p of success, Z ij is the number of failures observed before the r th success with the probability where r > 0 and 0 p 1 are two parameters that can be estimated from the data. The negative binomial distribution looks similar to Poisson distribution but with a longer, fatter tail. If the observed outcome is suspected to have variance larger than mean, the negative binomial distribution would be more appropriate than either Poisson or normal distribution.

Zero-inflated models.
For microbiome OTU counts, typically there are much more zeros than expected under the assumption of Poisson or negative binomial distributions. This phenomenon is known as zeroinflation. In order to solve this issue, zero-inflated models are used to model read counts that have an excess of zeros. A zero-inflated model assumes that the observed zeros are of two kinds: "sampling" or "structural". The sampling zeros come from a Poisson, negative binomial, or some other distribution due to chance. Other observed zeros are due to some specific structure in the data [14]. As a result, the combined probability under a zero-inflated model is where φ > 0 is a parameter estimated from the data, P (Z ij = k) stands for the probability determined by a Poisson, negative binomial, or other parametric distribution. Note that the zero-inflated model assigns the probability  [15]. In a ZIBB model, the probability P (Z ij = k) in Equation (1) is formulated by a betabinomial distribution. It has two folds: (1) Given a probability p ij , Z ij follows a binomial distribution with parameters N j and p ij ; (2) In order to make the model flexible, the probability p ij itself is also random, which follows a beta distribution with parameters α 1 > 0, α 2 > 0. As a result, the probability based on the beta-binomial distribution is The probability P ZI (Z ij = k) based on the ZIBB model takes the same form as in Equation (1). A relevant R package ZIBBSeqDiscovery is available at the Comprehensive R Archive Network (CRAN, https://CRAN.R-project.org/package= ZIBBSeqDiscovery). Hu, et al. [15] compared ZIBB with ZINB and a few other models using the Gevers microbiome data and concluded that ZIBB shows the highest number of significantly enriched genera. 4.1.5. Hurdle models. Hurdle models, also known as zero-altered models, provide another way of dealing with the excess zeros in OTU counts [4]. A hurdle model consists of two components, one generating the zeros and one generating the positive values. In contrast to zero-inflated models, a hurdle model assumes that all zeros are from the "structural" source. In order to make the comparison clearly, we define the hurdle models using a similar formula as in Equation (1): is known as a zero-truncated Poisson distribution [58]. The hurdle model P ZA (Z ij = k) collapses to the standard model P (Z ij = k) if φ = P (Z ij = 0). It clearly allows for excess zeros when φ > P (Z ij = 0). Different from zero-inflated models, in principle, hurdle models can also model too few zeros when φ < P (Z ij = 0). In other words, hurdle models are more flexible than zero-inflated models.

Comparisons between probabilistic models used in snap-
shot studies. Xu, et al. [57] classified the competing methods for modeling microbiome data into three categories based on how the excess zeros are treated: standard, zero-inflated (ZI), and hurdle models. Standard models do not consider the excess zeros and model the data using a standard distribution, for examples, Poisson or negative binomial distributions. ZI and hurdle models are reviewed in Sections 4.1.3 and 4.1.5, respectively. Xu, et al. [57] compared the performance of different models, including Poisson, ZIP, PH, NB, ZINB, and NBH, through simulation studies and real microbiome data. Their comparison was from different perspectives, including type I error, power, model selection, and goodness of fit. They concluded that: (1) Poisson regression has inflated type I error and may not be appropriate for data with excess zeros; (2) ZI or hurdle models perform consistently well in all scenarios examined in terms of test power; (3) ZINB is more robust than ZIP; (4) In many situations, hurdle models (PH or NBH) produce identical fitting results as their corresponding ZI models (ZIP or ZINB); (5) In terms of Akaike Information Criterion (AIC), NBH and ZINB models perform the best among all fitted models.

Models used in longitudinal microbiome studies
The recent advances in DNA sequencing technologies and rapid reduction in costs have fostered longitudinal analyses, which include multiple samples per subject over time. These longitudinal studies provide increased insights into the underlying biological mechanisms of the microbiome role in health and disease. In addition to identifying differentially abundant features, detecting the time intervals where these features exhibit changes in their abundance between two phenotypes in longitudinal studies adds insights into disease pathogenesis.
Longitudinal differential abundance is to identify time intervals of differentially abundant microbial features. To date, three methods have been proposed, MetaSplines [34], MetaDprof [26], and MetaLonDA [30,31]. MetaSplines and MetaDprof are both based on the Gaussian Smoothing Spline ANOVA (SS-ANOVA) approach [12,13,55]. MetaSplines has a higher sensitivity of detecting time intervals of differentially abundant features than MetaDprof, while MetaDprof has higher specificity [26,31]. MetaDprof has a major drawback in its implementation since it assumes consistency in longitudinal microbial samples. It is only able to perform the analysis on an equivalent number of subjects per phenotypic group, the same number of samples from each subject, and the same elapsed time between adjacent time points, which are rarely fulfilled in human microbiome longitudinal studies.
MetaLonDA (Metagenomic Longitudinal Differential Abundant) is a recently developed method that can identify significant time intervals of differentially abundant microbial features such as taxonomies, genes, or pathways.
MetaLonDA is flexible such that it can perform differential abundance tests on longitudinal samples with different numbers of subjects per phenotypic group, different numbers of samples per subject, and samples that are not collected at consistent time points. These inconsistencies are often the case for samples collected from human subjects in translational studies. Inconsistencies increase with the complexity of the procedure utilized to obtain the samples.
MetaLonDA relies on two modeling components: the NB distribution for modeling the mapped read counts for each feature and the semi-parametric SS-ANOVA technique for modeling longitudinal profiles associated with different phenotypes. Specific significant time intervals of microbial features can then be utilized to establish targeted timely screening or prevention of individual features and facilitate timely interventions, such as the use of antibiotics or probiotics. Unlike with cross-sectional methods that are incapable of identifying significant time intervals associated with differentially abundant features, significant time intervals of differentially abundant features identified through MetaLonDA may lead to the reconstitution of the microbiome and reestablishment of homeostasis prior to the onset of overt disease. MetaLonDA is publicly available on the CRAN repository (https://CRAN.R-project.org/package=MetaLonDA).

Probabilistic models for a group of features in the count table
Modeling multivariate feature counts is becoming important in the microbiome research community because these models are able to retain more information contained in the data. For example, researchers are interested in testing multivariate hypotheses concerning the effects of treatments or experimental factors on the whole assemblages of bacterial taxa, so that they may know the impact of the microbiome on human health and on characterizing the microbial diversity in general [21]. In addition, they may be able to identify relevant microbes that play essential roles in a microbial network by connecting a variety of key features.

Multinomial and Dirichlet multinomial (DM) models
Suppose we have m bacterial taxa, and their counts Z j = (Z 1j , . . . , Z mj ) T from the j th subject. With a given column sum N j and a probability measure p = (p 1 , . . . , p m ) T , a multinomial distribution assigns the following probability to an observed column z j = (z 1j , . . . , z mj ) T of the OTU table In order to make the model more flexible and thus can fit the data better, the probability vector p = (p 1 , . . . , p m ) T is assumed to follow a Dirichlet distribution with a probability density function where γ i > 0, i = 1, . . . , m are parameters of the Dirichlet distribution.
The aggregated distribution is known as a Dirichlet multinomial (DM) distribution, which assigns the following probability The DM distribution is commonly used to model taxon counts. For example, Rosa, et al. [21] used DM to calculate the powers and sample sizes for experimental designs, perform tests of hypotheses (for example, comparison of microbiomes across groups), and estimate parameters describing microbiome properties. Chen and Li [6] used a DM model for developing a penalized likelihood approach to estimate the regression parameters. Nevertheless, several recent studies showed that a multinomial or DM model might not be appropriate for microbiome data [28,46] since those models assume a negative correlation among all paired OTUs. Actually, Mandal, et al. [28] showed that the correlation between a pair of OTUs could be positive as well.

Generalized Dirichlet multinomial (GDM) and zero-inflated generalized Dirichlet multinomial (ZIGDM) models
Dirichlet distribution has been widely used as a conjugate prior for the parameters of a multinomial distribution since the calculation on its posterior distribution is easier. Nevertheless, if a random vector follows a Dirichlet distribution, all its components must have the same variance and sum to one. Motivated by this limitation, a generalized Dirichlet (GD) distribution was introduced by Connor and Mosimann [8] to allow more general covariance structure. The multinomial model with a GD prior is called a generalized Dirichlet multinomial (GDM) model [49]. Under the GDM model, the j th column Z j of OTUs with m observed features (Z 1j , . . . , Z mj ) is modeled as m + 1 features with one additional unobserved feature Z m+1,j . Let the extended column vector Z + j = (Z 1j , · · · , Z mj , Z m+1,j ) T with m i=1 Z ij = N j being the total number of reads in the original j th column. The corresponding proportions for the m + 1 features are p 1 , . . . , p m , p m+1 with m+1 i=1 p i = 1. The random parameter vector p = (p 1 , . . . , p m ) T follows a GD distribution with the density function where α j > 0, β j > 0, c j = β j − α j+1 − β j+1 for j = 1, . . . , m − 1 and c m = β m − 1 (see [49] for more details). The zero-inflated generalized Dirichlet (ZIGD) distribution can be produced by adding a zero-inflated component to the GD distribution [49], and the zero-inflated generalized Dirichlet (ZIGDM) model is constructed by using the ZIGD as a prior for the multinomial parameters. Tang and Chen [49] used real gut microbiome data to compare three models, DM, GDM, and ZIGDM. Their conclusions include: (1) ZIGDM is more flexible for handling the excess zeros and a complex correlation structure; (2) ZIGDM is more robust when the counts are zero-inflated; (3) GDM fits the data better than DM, while ZIGDM can further improve the goodness of fit for zero-inflated counts; (4) ZIGDM seems more appropriate for handling high-dimensional microbiome taxa.

A nonparametric Bayesian model for microbiome data
In a typical OTU count table, the number of rows m is fixed. However, from the collection procedure of microbiome data, the number of potential features is typically unknown, and the number of observed features by the experiments is not predetermined.
Following a nonparametric Bayesian model [42], we may denote F = {F 1 , F 2 , . . .} as the collection of all potential features in the n biological samples. Different from the models in Section 5, the number of features in F could be infinity. For biological sample j, a discrete probability measure P j = (p 1j , p 2j , . . .) on F determines the distribution of the frequencies (Z 1j , Z 2j , . . .). In other words, P j ({F i }) = p ij , i p ij = 1, and E(Z ij ) ∝ p ij .
The goal of the nonparametric Bayesian model [42] is to model the distribution of P j 's and the variation among them. It assigns the probability mass to any subset A of the potential features in F: is the average abundance of feature F i across all biological samples, Z (i) is the row of OTU table associated with feature F i , Z j = (Z 1j , Z 2j , . . .) T is the j th column, · stands for the usual inner product (see [42] for more details).
Ren, et al. [42] constructed a Bayesian framework with dependent Dirichlet process for microbiome analysis. They applied their model to two microbiome datasets. Based on their Bayesian nonparametric method, they obtained the posterior probabilities of any two biological samples being clustered together. Their result are consistent with Caporaso, et al. [5]'s conclusion.

Conclusion
In this paper, we review three types of probabilistic models based on the number of features under discussion, single, a fixed number, and a flexible number of features. Since the microbiome data is often sparse, it may contain much more than expected zeros, which makes the construction of a suitable probabilistic model challenging. Zero-inflated or hurdle model seems more appropriate than a standard model. Among different models for analyzing the differential abundance of a single OTU, ZINB was recommended by Xu, et al. [57], while ZIBB was preferred by Hu, et al. [15].
Moreover, we may model multiple OTUs or features simultaneously. This kind of models may foster our understanding of interactions between species, or even build a network among species. Quite a few models have been proposed for this purpose, including multinomial, Dirichlet multinomial (DM), generalized Dirichlet multinomial (GDM), and zero-inflated generalized Dirichlet multinomial (ZIGDM) models. Although multinomial and Dirichlet multinomial (DM) were commonly used to model multiple OTUs, several recent studies suggested that those two models may not be appropriate for microbiome data [28,46] due to their negative correlations between all paired OTUs. The comparison study was done by Tang and Chen [49] concluded that ZIGDM is more appropriate than the others.
On the other hand, nonparametric Bayesian approach for modeling microbiome data has not been commonly used yet. Recently, Ren, et al. [42]'s work provided a nonparametric Bayesian framework for modeling a flexible number of OTUs. It requires more complicated probabilistic models, such as dependent Dirichlet process.