Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding

High-dimensional compositional data arise naturally in many applications such as metagenomic data analysis. The observed data lie in a high-dimensional simplex, and conventional statistical methods often fail to produce sensible results due to the unit-sum constraint. In this article, we address the problem of covariance estimation for high-dimensional compositional data, and introduce a composition-adjusted thresholding (COAT) method under the assumption that the basis covariance matrix is sparse. Our method is based on a decomposition relating the compositional covariance to the basis covariance, which is approximately identifiable as the dimensionality tends to infinity. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable for large covariance matrices. We rigorously characterize the identifiability of the covariance parameters, derive rates of convergence under the spectral norm, and provide theoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some naive thresholding estimators that ignore the unique features of compositional data. We apply the proposed method to the analysis of a microbiome dataset in order to understand the dependence structure among bacterial taxa in the human gut.


Introduction
Compositional data, which represent the proportions or fractions of a whole, arise naturally in a wide range of applications; examples include geochemical compositions of rocks, household patterns of expenditures, species compositions of biological communities, and topic compositions of documents, among many others.This article is particularly motivated by the metagenomic analysis of microbiome data.The human microbiome is the totality of all microbes at various body sites, whose importance in human health and disease has increasingly been recognized.Recent studies have revealed that microbiome composition varies based on diet, health, and the environment (The Human Microbiome Project Consortium 2012), and may play a key role in complex diseases such as obesity, atherosclerosis, and Crohn's disease (Turnbaugh et al. 2009;Koeth et al. 2013;Lewis et al. 2015).
With the development of next-generation sequencing technologies, it is now possible to survey the microbiome composition using direct DNA sequencing of either marker genes or the whole metagenomes.After aligning these sequence reads to the reference microbial genomes, one can quantify the relative abundances of microbial taxa.These sequencing-based microbiome studies, however, only provide a relative, rather than absolute, measure of the abundances of community components.The counts comprising these data (e.g., 16S rRNA gene reads or shotgun metagenomic reads) are set by the amount of genetic material extracted from the community or the sequencing depth, and analysis typically begins by normalizing the observed data by the total number of counts.The resulting fractions thus fall into a class of high-dimensional compositional data that we focus in this article.The high dimensionality refers to the fact that the number of taxa may be comparable to or much larger than the sample size.An important question in metagenomic studies is to understand the co-occurrence and coexclusion relationship between microbial taxa, which would provide valuable insights into the complex ecology of microbial communities (Faust et al. 2012).Standard correlation analysis from the raw proportions, however, can lead to spurious results due to the unit-sum constraint; the pro-portions tend to be correlated even if the absolute abundances are independent.Such undesired effects should be removed in an analysis in order to make valid inferences about the underlying biological processes.The compositional effects are further magnified by the low diversity of microbiome data, that is, a few taxa make up the overwhelming majority of the microbiome (Friedman and Alm 2012).
Owing to the difficulties arising from the simplex constraint, it has been a long-standing question how to appropriately model, estimate, and interpret the covariance structure of compositional data.
The pioneering work of Aitchison (1982Aitchison ( , 2003) ) introduced several equivalent matrix specifications of compositional covariance structures via the log-ratios of components.Statistical methods based on these covariance models respect the unique features of compositional data and prove useful in a variety of applications such as geochemical analysis.A potential disadvantage of these models, however, is that they lack a direct interpretation in the usual sense of covariances and correlations; as a result, it is unclear how to impose certain structures such as sparsity in high dimensions, which is crucial for our applications to microbiome data analysis.
Covariance matrix estimation is of fundamental importance in high-dimensional data analysis and has attracted much recent interest.It is well known that the sample covariance matrix performs poorly in high dimensions and regularization is thus indispensable.Bickel and Levina (2008) and El Karoui (2008) introduced regularized estimators by hard thresholding for large covariance matrices that satisfy certain notions of sparsity.Rothman, Levina, and Zhu (2009) considered a more general class of thresholding functions, and Cai and Liu (2011) proposed adaptive thresholding that adapts to the variability of individual entries.Exploiting a factor model structure, Fan, Fan, and Lv (2008) proposed a factor-based method for high-dimensional covariance matrix estimation.Fan, Liao, and Mincheva (2013) extended the work by considering a conditional spar-sity structure and developed a POET method by thresholding principal orthogonal complements.
In this article, we address the problem of covariance estimation for high-dimensional compositional data.Let W = (W 1 , . . ., W p ) T with W j > 0 for all j be a vector of latent variables, called the basis, that generate the observed data via the normalization Estimating the covariance structure of W has traditionally been considered infeasible owing to the apparent lack of identifiability.By exploring a decomposition relating the compositional covariance to the basis covariance, we find, however, that the nonidentifiability vanishes asymptotically as the dimensionality grows under certain sparsity assumptions.More specifically, define the basis where Y j = log W j .Then Ω 0 is approximately identifiable as long as it belongs to a class of large sparse covariance matrices.
The somewhat surprising "blessing of dimensionality" allows us to develop a simple, two-step method by first extracting a rank-2 component from the decomposition and then estimating the sparse component Ω 0 by thresholding the residual matrix.The resulting procedure can equivalently be viewed as thresholding the sample centered log-ratio covariance matrix, and hence is optimization-free and scalable for large covariance matrices.We call our method compositionadjusted thresholding (COAT), which removes the "coat" of compositional effects from the covariance structure.We derive rates of convergence under the spectral norm and provide theoretical guarantees on support recovery.Simulation studies demonstrate that the COAT estimator outperforms some naive thresholding estimators that ignore the unique features of compositional data.
We illustrate our method by analyzing a microbiome dataset in order to understand the dependence structure among bacterial taxa in the human gut.
The covariance relationship, which was due to Aitchison (2003, sec. 4.11), has recently been exploited to develop algorithms for inferring correlation networks from metagenomic data (Friedman and Alm 2012;Fang et al. 2015;Ban, An, and Jiang 2015).Our contributions here are to turn the idea into a principled approach to sparse covariance matrix estimation and provide statistical insights into the issue of identifiability and the impacts of dimensionality.Our method also bears some resemblance to the POET method proposed by Fan, Liao, and Mincheva (2013) in that underlying both methods is a low-rank plus sparse matrix decomposition.The rank-2 component in our method, however, arises from the covariance structure of compositional data rather than a factor model assumption.As a result, it can be obtained by simple algebraic operations without computing the principal components.
The rest of the article is organized as follows.Section 2 reviews a covariance relationship and addresses the issue of identifiability.Section 3 introduces the COAT methodology.Section 4 investigates the theoretical properties of the COAT estimator in terms of convergence rates and support recovery.Simulation studies and an application to human gut microbiome data are presented in Sections 5 and 6, respectively.We conclude the article with some discussion in Section 7 and relegate all proofs to the Appendix.

Identifiability of the Covariance Model
We first introduce some notation.Denote by • 1 , • 2 , • F , and • max the matrix L 1norm, spectral norm, Frobenius norm, and entrywise L ∞ -norm, defined for a matrix A = (a ij ) by where λ max (•) denotes the largest eigenvalue.
In the latent variable covariance model ( 1) and ( 2), the basis covariance matrix Ω 0 is the parameter of interest.One of the matrix specifications of compositional covariance structures introduced by Aitchison (2003) is the variation matrix T 0 = (τ 0 ij ) p×p defined by In view of the relationship (1), we can decompose τ 0 ij as or in matrix form, where ω 0 = (ω 0 11 , . . ., ω 0 pp ) T and 1 = (1, . . ., 1) T .Corresponding to the many-to-one relationship between bases and compositions, the basis covariance matrix Ω 0 is unidentifiable from the decomposition (5), since ω 0 1 T + 1ω T 0 and Ω 0 are in general not orthogonal to each other (with respect to the usual Euclidean inner product).In fact, using the centered log-ratio covariance matrix where g(x) = ( p j=1 x j ) 1/p is the geometric mean of a vector x = (x 1 , . . ., x p ) T , we can similarly write or in matrix form, where γ 0 = (γ 0 11 , . . ., γ 0 pp ) T and 1 = (1, . . ., 1) T .Unlike (5), the following proposition shows that ( 6) is an orthogonal decomposition and hence the components γ 0 1 T + 1γ T 0 and Γ 0 are identifiable.
In addition, by comparing the decompositions (5) and (6), we can bound the difference between Ω 0 and its identifiable counterpart Γ 0 as follows.
Proposition 1.The components γ 0 1 T + 1γ T 0 and Γ 0 in the decomposition (6) are orthogonal to each other.Moreover, for the covariance parameters Ω 0 and Γ 0 in the decompositions (5) and (6), Proposition 1 entails that the covariance parameter Ω 0 is approximately identifiable as long as In particular, suppose that Ω 0 belongs to a class of sparse covariance matrices considered by Bickel and Levina (2008), where 0 ≤ q < 1 and Ω ≻ 0 denotes that Ω is positive definite.Then and hence the parameters Ω 0 and Γ 0 are asymptotically indistinguishable when s 0 (p) = o(p).This allows us to use Γ 0 as a proxy for Ω 0 and greatly facilitates the development of new methodology and associated theory.The intuition behind the approximate identifiability under the sparsity assumption is that the rank-2 component ω 0 1 T + 1ω T 0 represents a global effect that spreads across all rows and columns, while the sparse component Ω 0 represents a local effect that is confined to individual entries.
Also of interest is the exact identifiability of Ω 0 over L 0 -balls, which has been studied by Fang et al. (2015) and Ban, An, and Jiang (2015).The following result provides a sufficient and necessary condition for the exact identifiability of Ω 0 by confining it to an L 0 -ball.
Proposition 2. Suppose that Ω 0 belongs to the L 0 -ball where p ≥ 5. Then there exist no two values of Ω 0 that correspond to the same T 0 in (5) if and only if s e (p) < (p − 1)/2.
A counterexample is provided in the proof of Proposition 2 to show that the sparsity conditions in Fang et al. (2015) and Ban, An, and Jiang (2015), which are both at the order of O(p 2 ), do not suffice.The identifiability condition in Proposition 2 essentially requires the average degree of the correlation network to be less than 1, which is too restrictive to be useful in practice.This illustrates the importance and necessity of introducing the notion of approximate identifiability.

A Sparse Covariance Estimator for Compositional Data
Suppose that (W k , X k ), k = 1, . . ., n, are independent copies of (W, X), where the compositions X k = (X k1 , . . ., X kp ) T are observed and the bases W k = (W k1 , . . ., W kp ) T are latent.In Section 3.1, we rely on the decompositions ( 5) and ( 6) and Proposition 1 to develop an estimator of Ω 0 , and in Section 3.2 discuss the selection of the tuning parameter.

Composition-Adjusted Thresholding
In view of Proposition 1, we wish to estimate the covariance parameter Ω 0 via the proxy Γ 0 .To this end, we first construct an empirical estimate of Γ 0 and then apply adaptive thresholding to the estimate.
There are two equivalent ways to form the estimate of Γ 0 .Motivated by the decomposition (6), one can start with the sample counterpart T = (τ ij ) p×p of T 0 defined by where τ kij = log(X ki /X kj ) and τij = n −1 n k=1 τ kij .A rank-2 component α1 T + 1 α T with α = ( α1 , . . ., αp ) T can be extracted from the decomposition (6) by projecting T onto the subspace is then an estimate of Γ 0 .Alternatively, Γ can be obtained directly as the sample counterpart of Γ 0 through the expression where γ kj = log(X kj /g(X k )) and γj = n −1 n k=1 γ kj .Now applying adaptive thresholding to Γ, we define the composition-adjusted thresholding where S λ (•) is a general thresholding function and λ ij > 0 are entry-dependent thresholds.
In this article, we consider a class of general thresholding functions S λ (•) that satisfy the following conditions: These two conditions were assumed by Rothman, Levina, and Zhu (2009) and Cai and Liu (2011) along with another condition that is not required in our analysis.Examples of thresholding functions belonging to this class include the hard thresholding rule S λ (z) = zI(|z| ≥ λ), the soft thresholding rule S λ (z) = sgn(z)(|z| − λ) + , and the adaptive lasso rule The performance of the COAT estimator depends critically on the choice of thresholds.Using entry-adaptive thresholds may in general improve the performance over applying a universal threshold.To derive a data-driven choice of λ ij , define where µ j = EY j .We take λ ij to be of the form where θij are estimates of θ ij , and λ > 0 is a tuning parameter to be chosen, for example, by cross-validation.We rewrite (8) as γij = n −1 n k=1 γ kij , where

Tuning Parameter Selection
The thresholds defined by ( 10) depend on the tuning parameter λ, which can be chosen through Vfold cross-validation.Denote by (λ) the COAT estimate based on the training data excluding the vth fold, and Γ v the residual matrix (or the sample centered log-ratio covariance matrix) based on the test data including only the vth fold.We choose the optimal value of λ that minimizes the cross-validation error With the optimal λ, we then compute the COAT estimate based on the full dataset as our final estimate.When the positive definiteness of the covariance estimate in finite samples is required for interpretation, we follow the approach of Fan, Liao, and Mincheva (2013) and choose λ in the range where the minimum eigenvalue of the COAT estimate is positive.

Theoretical Properties
In this section, we investigate the asymptotic properties of the COAT estimator.As a distinguishing feature of our theoretical analysis, we assume neither the exact identifiability of the parameters nor that the degree of (approximate) identifiability is dominated by the statistical error.Instead, the degree of identifiability enters our analysis and shows up in the resulting rate of convergence.Such theoretical analysis is rare in the literature, but is extremely relevant for latent variable models in the presence of nonidentifiability and is of theoretical interest in its own right.We introduce our assumptions in Section 4.1, and present our main results on rates of convergence and support recovery in Section 4.2.

Assumptions
Recall that Y j = log W j , µ j = EY j , and Without loss of generality, assume µ j = 0 for all j throughout this section.We need to impose the following moment conditions on the log-basis Y = (Y 1 , . . ., Y p ) T .
Condition 4.There exists a sequence s Conditions 1-3 are similar to those commonly assumed in the covariance estimation literature; see, for example, Cai and Liu (2011).Condition 1 requires that the variables Y j s be uniformly sub-Gaussian; the definition we use here is among several equivalent ways of defining sub-Gaussianity (Boucheron, Lugosi, and Massart 2013, sec.2.3), and is most convenient for our technical analysis.
Condition 2 imposes some restrictions on the dimensionality and sparsity of the basis covariance matrix Ω 0 .It is worth mentioning that the sparsity level condition s 0 = o(p) is so weak that it suffices to guarantee only approximate identifiability but allows the degree of nonidentifiability to be large relative to the statistical error.Condition 3 is essential for methods based on adaptive thresholding.Condition 4 arises from identifiability considerations in estimating the variances θ ij .In particular, if Y is multivariate normal, then Condition 4 is implied by the assumptions Ω 0 ∈ U(q, s 0 (p), M) and s 0 (p) = o(p) in Condition 2, since from Isserlis' theorem (Isserlis 1918) we have

Main Results
We are now in a position to state our main results.The following theorem gives the rate of convergence under the spectral norm for the COAT estimator.
Theorem 1 (Rate of convergence).Under Conditions 1-4, if the tuning parameter λ in (10) is chosen to be for sufficiently large C 1 , C 2 > 0, then the COAT estimator Ω in (9) satisfies The rate of convergence provided by Theorem 1 exhibits an interesting decomposition: the term s 0 (p){(log p)/n} (1−q)/2 represents the estimation error due to estimating Γ 0 , while the term s 0 (p)(s 0 (p)/p) 1−q accounts for the approximation error due to using Γ 0 as a proxy for Ω 0 .In particular, if the approximation error is dominated by the estimation error, then the COAT estimator attains the minimax optimal rate under the spectral norm over U(q, s 0 (p), M) (Cai and Zhou 2012).
It is important to note that the dimensionality p appears in both terms where it plays opposite roles.
We observe a "curse of dimensionality" in the first term, where the growth of dimensionality contributes a logarithmic factor to the estimation error.In contrast, a "blessing of dimensionality" is reflected by the second term in that a diverging dimensionality shrinks the approximation error toward zero at a power rate.
The insights gained from Theorem 1 have important implications for compositional data analysis.In the analysis of many compositional datasets, the dimensionality often depends on the taxonomic level to be examined.For example, in metagenomic studies, the dimensionality may range from only a few taxa at the phylum level to thousands of taxa at the operational taxonomic unit (OTU) level.Suppose, for simplicity, that the magnitudes of correlation signals are of about the same order across different taxonomic levels.Then Theorem 1 indicates a tradeoff between an accurate estimation of the covariance structure with low dimensionality and a sensible interpretation in terms of the basis components with high dimensionality.This tradeoff thus suggests the need to analyze compositional data at relatively finer taxonomic levels when a latent variable interpretation is desired.
The proof of Theorem 1 relies on a series of concentration inequalities that take the approximation error term into account, which can be found in the Appendix.As a consequence of these inequalities, we obtain the following result regarding the support recovery property of the COAT estimator.Here the support of Ω 0 refers to the set of all indices (i, j) with ω 0 ij = 0.
Theorem 2 (Support recovery).Under Conditions 1-4, if the tuning parameter λ in (10) is chosen as in (11), then the COAT estimator Ω in (9) satisfies Moreover, if in addition for some constant C > 3/2, then Theorem 2 parallels the support recovery results in Rothman, Levina, and Zhu (2009) and Cai and Liu (2011).However, owing to the extra term s 0 (p)/p in the expression of λ, the assumption (13) requires in addition that no correlation signals fall below the approximation error.
In other words, exact support recovery will break down if any correlation signal is confounded by the compositional effect.

Simulation Studies
We conducted simulation studies to compare the numerical performance of the COAT estimator Ω with that of the oracle thresholding estimator Ω o , which knew the latent basis components and applied the thresholding procedure to the sample covariance matrix of the log-basis Y.We also include in our comparison two naive thresholding estimators Ω c and Ω l , which are based on the sample covariance matrices of the composition X and its logarithm log X, respectively.Note that Ω o is the ideal estimator that the COAT estimator attempts to mimic, whereas both Ω c and Ω l ignore the unique features of compositional data and thus are expected to perform poorly.

Simulation Settings
The data (W k , X k ), k = 1, . . ., n, were generated as follows.We first generated Y k in two different ways: (i) Y k are independent from the multivariate normal distribution N p (µ, Ω 0 ); , where FF T = Ω 0 and the components of U k are independent gamma variables with shape parameter 10 and scale parameter 1, so that Var(Y k ) = Ω 0 .Here the matrix F is obtained by computing the singular value decomposition Ω 0 = QSQ T and letting Then W k = (W k1 , . . ., W kp ) T and X k = (X k1 , . . ., X kp ) T were obtained through the transformations W kj = e Y kj and X kj = W kj / p i=1 W ki , j = 1, . . ., p. Hence, in Case (i), W k and X k follow multivariate log-normal and logistic normal distributions (Aitchison and Shen 1980), respectively; the distributions of W k and X k in Case (ii) can similarly be viewed as a type of multivariate log-gamma and logistic-gamma distributions.
In both cases, we took the components of µ randomly from the uniform distribution on [0, 10], in order to reflect the fact that compositional data arising from metagenomic studies are often heterogeneous.The following two models for the covariance matrix Ω 0 were considered: • Model 1 (Identity covariance): Ω 0 = I p .
• Model 2 (Sparse covariance): Model 1 is an extreme but illustrative case intended for comparing the distributions of spurious correlations under different transformations.The setting of Model 2 is typical in the covariance estimation literature and similar to that in Cai and Liu (2011).We set the sample size n = 100 and the dimension p = 50, 100, and 200, and repeated 100 simulations for each setting.

Spurious Correlations
The boxplots of sample correlations with simulated data under different transformations in Model 1 are shown in Figure 1.Clearly, the sample centered log-ratio (clr) correlations are centered around zero and have a similar distribution to that of the sample correlations of Y; the resemblance tends to increase as the dimension p grows.This trend is consistent with Proposition 1 and provides numerical evidence for the validity of the centered log-ratio covariance matrix Γ 0 as a proxy for Ω 0 .In fact, from the proof of Proposition 1 we have, when In contrast, the phenomenon of spurious correlations is observed on both log X and X.The sample correlations of log X exhibit a severe upward bias, while the sample correlations of X contain many outliers that would be detected as signals by a thresholding procedure with threshold level close to 1.Moreover, the spurious correlations seem to become worse with gamma-related distributions where the components of the composition have more heterogeneous means.

Performance Comparisons
We applied the COAT method with hard and soft thresholding rules to simulated data in Model 2.
For comparison, we also applied the thresholding procedure to the sample covariance matrices of Y, log X, and X, resulting in the estimators Ω o , Ω l , and Ω c , respectively.The tuning parameter λ in each thresholding estimator was chosen by tenfold cross-validation.Losses under the matrix L 1 -norm, spectral norm, and Frobenius norm were used to measure the estimation performance, while the true positive rate and false positive rate were employed to assess the quality of support recovery.The simulation results for Model 2 with normal-and gamma-related distributions are summarized in Tables 1 and 2, respectively.We see that the COAT estimator Ω performs almost equally well as the ideal estimator Ω o , and outperforms the naive thresholding estimators Ω l and Ω c by a large margin.In particular, the estimation losses of Ω l are disastrously large in the gamma setting, in agreement with the severe bias observed in Figure 1.The estimation losses of Ω c do not change much across different thresholding rules and distributions, since all entries of the estimate are very small relative to the true values.Both Ω l and Ω c show inferior performance in terms of true and false positive rates, indicating that they are not model selection consistent.Comparisons between hard and soft thresholding rules suggest that the former is more conservative in selecting false positives and results in a more parsimonious model, whereas the latter strikes a balance between true and false positives due to the shrinkage effect.
To further compare the support recovery performance without selecting a threshold level, we plot the receiver operating characteristic (ROC) curves for all methods in Figure 2. Note that hard and soft thresholding rules lead to the same ROC curve for each method.We observe that the ROC curves for Ω and Ω o are almost indistinguishable and uniformly dominate those for Ω l and Ω c ,  demonstrating the superiority of the COAT method.Of the two naive thresholding estimators, Ω l tends to outperform Ω c when the threshold level is high, since the former is less influenced by the high spurious correlations as reflected in Figure 1.

Gut Microbiome Data Analysis
The gut microbiome plays a critical role in energy extraction from the diet and interacts with the immune system to exert a profound influence on human health and disease.Despite an emerging interest in characterizing the ecology of human-associated microbial communities, the complex interactions among microbial taxa remain poorly understood (Coyte, Schluter, and Foster 2015).
We now illustrate the proposed method by applying it to a human gut microbiome dataset described by Wu et al. (2011), which was collected from a cross-sectional study of 98 healthy individuals at the University of Pennsylvania.DNA from stool samples of these subjects were analyzed by 454/Roche pyrosequencing of 16S rRNA gene segments, resulting in an average of 9265 reads per sample, with a standard deviation of 3864.Taxonomic assignment yielded 3068 operational taxonomic units, which were further combined into 87 genera that appeared in at least one sample.
Demographic information, including body mass index (BMI), was also collected from the subjects.
We are interested in identifying and comparing the correlation structures among bacterial genera between lean and obese subjects.We therefore divided the dataset into a lean group (BMI < 25, n = 63) and an obese group (BMI ≥ 25, n = 35), and focused on the p = 40 bacterial genera that appeared in at least four samples in each group.The count data were transformed into compositions after zero counts were replaced by 0.5.
We applied the COAT method with the soft thresholding rule to each group, and used tenfold cross-validation to select the tuning parameter.The resulting estimate was represented by a correlation network among the bacterial genera with each edge representing a nonzero correlation.
To assess the stability of support recovery, we further generated 100 bootstrap samples for each group and repeated the thresholding procedure on each sample.The stability of the correlation network was measured by the average proportion of edges reproduced by each bootstrap replicate.
Finally, we retained only the edges in the correlation network that were reproduced in at least 80 bootstrap replicates.The numbers of positive and negative correlations and the stability of correlation networks are reported in Table 3; the results for the two naive thresholding estimators Ω l and Ω c are also included for comparison.We see that the COAT method achieves the highest stability among the three methods and has the most edges passing the stability test.The correlation network identified by Ω l has substantially fewer negative correlations than the other two methods, which is likely due to the severe upward bias observed in Figure 1.The correlation network identified by Ω c is the least stable.
The correlation networks identified by the COAT method for the two groups are displayed in Figure 3. Clearly, the networks for the lean and obese groups show markedly different architecture, indicating that the obese microbiome is less modular with less complex interactions between the modules.This phenomenon has been demonstrated by previous studies and is possibly due to adaptation of the microbiome to low-diversity environments (Greenblum, Turnbaugh, and Borenstein 2012).Table 3 and Figure 3 also suggest that the gut microbial network tends to contain more competitive (negative) interactions than cooperative (positive) ones, which seems consistent with the recent finding that the ecological stability of the gut microbiome can be attributed to the benefits from limiting positive feedbacks and dampening cooperative networks (Coyte, Schluter, and Foster 2015).
A closer inspection of the correlation networks identifies Bacteroides and Prevotella as two key genera of the gut microbiome.The abundances of these two genera are well known to distinguish two gut microbial enterotypes, which are strongly associated with long-term dietary patterns (Arumugam et al. 2011;Wu et al. 2011).The negative correlations between Bacteroides and Prevotella (−0.404 in the lean group and −0.296 in the obese group) are well explained by the diet-dependent enterotypes and the within-body separation of the two genera (Jordán et al. 2015).
Moreover, recent studies have suggested several keystone species belonging to the genus Bacteroides, through which the structure of gut microbial communities may be influenced by small perturbations (Fisher and Mehta 2014).Also, the Firmicutes-enriched microbiome has been found to hold greater metabolic potential than the Bacteroidetes-enriched microbiome for more efficient energy harvest from the diet (Turnbaugh et al. 2006).Figure 3  networks identified by the other two methods.

Discussion
Understanding the dependence structure among microbial taxa within a community, including cooccurrence and co-exclusion relationships between microbial taxa, is an important problem in microbiome research.Such structures provide biological insights into the community dynamics and factors that change the community structures.To overcome the difficulties arising from the unit-sum constraint of the observed compositional data, we have developed a COAT method to estimate the sparse covariance matrix of the latent log-basis components.Our method is based on a decomposition of the variation matrix into a rank-2 component and a sparse component.The resulting procedure is equivalent to thresholding the sample centered log-ratio covariance matrix, and thus is optimization-free and scalable for high-dimensional data.
Our simulation results demonstrate that the COAT method performs almost as well as the oracle thresholding estimator that knew the latent basis components, and outperforms some naive thresholding estimators by a large margin.These improvements are more pronounced when the basis components have a skewed distribution, as is often observed in microbiome studies.In the application to gut microbiome data, the COAT method leads to more stable and biologically more interpretable results for comparing the dependence structures of lean and obese microbiomes.
We have provided conditions for the approximate and exact identifiability of the covariance parameters, and have established rates of convergence and support recovery guarantees for the COAT estimator.The rate of convergence includes an extra term of O p (s 0 (p)(s 0 (p)/p) 1−q ) in addition to the usual minimax optimal rate of convergence for sparse covariance estimation.The extra term represents an approximation error due to using Γ 0 as a proxy for Ω 0 , which vanishes under mild assumptions as the dimensionality increases.
The proposed methodology may be extended in several ways.First, it would be possible to develop a joint optimization procedure based on the decomposition (5).For example, one may consider the regularized estimator where ω = diag(Ω) and P λ (•) is a sparsity-inducing penalty function.The COAT estimator can be viewed as a one-step approximation to Ω reg with appropriately chosen penalty function and initial value Ω = 0. Solving the full optimization problem is computationally more expensive but is expected to improve on the performance of the COAT estimator.Another worthwhile extension would be to deal with zero counts directly.One may, in principle, combine the ideas presented here with models that account for sampling and structural zeros.The issues of identifiability and computational feasibility are the major concerns with such extensions.

A.1 Proof of Proposition 1
Using the fact that the centered log-ratio covariance matrix Γ 0 is symmetric and has all zero row sums (Aitchison 2003, Property 4.6), we have that is, the components γ 0 1 T + 1γ T 0 and Γ 0 are orthogonal to each other.
To show the desired inequality, by the identity (4.35) of Aitchison (2003), we have Therefore,

A.2 Proof of Proposition 2
We first claim that if α = (α 1 , . . ., α p ) T = 0, then the matrix A ≡ α1 T + 1α T has at least p − 1 nonzero upper-triangular entries.To prove this, without loss of generality, assume α 1 = 0 and that the last q entries of the first row of A are zero, where 0 ≤ q ≤ p − 1; that is, α 1 + α j = 0 for 1 ≤ j ≤ p − q, and α which gives rise to q 2 = q(q − 1)/2 nonzero entries at positions (i, j) with p − q + 1 ≤ i < j ≤ p. Putting these pieces together, we obtain that the number of nonzero upper-triangular entries in A is at least To show that the lower bound p − 2 is not attainable, note that if there are only p − 2 nonzero upper-triangular entries, then q = 1 or 2, and we have Note that the right-hand side has fewer than p − 1 nonzero upper-triangular entries.Then it follows from the above claim that Ω 1 = Ω 2 .

A.3 Concentration Inequalities
To prepare for the proofs of Theorems 1 and 2, we first establish some useful concentration inequalities.For notational simplicity, the constants C 1 , C 2 , . . .below may vary from line to line.
Lemma 1.Under Condition 1, there exist constants C 1 , C 2 > 0 such that for sufficiently small t > 0.Moreover, if log p = o(n 1/5 ), then there exists a constant C 3 > 0 such that for every constant ε > 0.
To prove (A.
To bound the term T 1 , we further write Consider the event A 1 on which max i,j,ℓ,m Then, on A 1 , we have To bound the next term in T 1 , we write which, on A 1 and by Condition 4, is bounded by ε 1 + s 1 (p)/p.We can similarly bound the other terms in T 1 and obtain, on A 1 , |T 1 | ≤ 16ε 1 + 15s 1 (p)/p.(A.9) To bound the term T 2 , note that To bound the term T 1 , we write ≡ T 4 + T 5 + T 6 .
Consider the event B 1 on which |γ ij − ω 0 ij | ≤ λ ij /2 for all i, j.On B 1 , we have Combining these pieces yields We can similarly bound the terms T 2 and T 3 on B 1 : and hence for all i, j.Now applying (A.13) yields P sgn(ω ij ) = sgn(ω 0 ij ), ω 0 ij = 0 for some i, j = O(p −C 3 ), which, together with (12), proves the result.

Figure 1 :
Figure 1: Boxplots of sample correlations with simulated data under different transformations in Model 1.

Figure 2 :
Figure 2: ROC curves for four methods in Model 2 with normal-related distribution (top panel) and gamma-related distribution (bottom panel).

Figure 3 :
Figure3: Correlation networks identified by the COAT method for the lean and obese groups in the gut microbiome data.Positive and negative correlations are displayed in green and red, respectively.The thickness of edges indicates the magnitude of correlations.

Table 1 :
Means (standard errors) of various performance measures for four methods with hard and soft thresholding rules in Model 2 with normal-related distributions over 100 replications

Table 3 :
Numbers of positive and negative correlations and stability of correlation networks for three methods applied to the gut microbiome data