Constructing Predictive Microbial Signatures at Multiple Taxonomic Levels

ABSTRACT Recent advances in DNA sequencing technology have enabled rapid advances in our understanding of the contribution of the human microbiome to many aspects of normal human physiology and disease. A major goal of human microbiome studies is the identification of important groups of microbes that are predictive of host phenotypes. However, the large number of bacterial taxa and the compositional nature of the data make this goal difficult to achieve using traditional approaches. Furthermore, the microbiome data are structured in the sense that bacterial taxa are not independent of one another and are related evolutionarily by a phylogenetic tree. To deal with these challenges, we introduce the concept of variable fusion for high-dimensional compositional data and propose a novel tree-guided variable fusion method. Our method is based on the linear regression model with tree-guided penalty functions. It incorporates the tree information node-by-node and is capable of building predictive models comprised of bacterial taxa at different taxonomic levels. A gut microbiome data analysis and simulations are presented to illustrate the good performance of the proposed method. Supplementary materials for this article are available online.


Introduction
The human microbiota is the aggregate of microorganisms that live inside and on the human body. They include bacteria, archea, fungi, and viruses, with bacteria alone estimated to outnumber human cells within an individual by an order of magnitude (Savage 1977). The totality of these microbes and the genes they encode is called the human microbiome.
High-throughput DNA sequencing technologies, coupled with bioinformatics developments, have enabled rapid advances in our understanding of the human microbiome (Kuczynski et al. 2011). For example, several studies have revealed that hostassociated microbial communities are likely to play key roles in normal physiology (Dethlefsen and Relman 2011), development (Koenig et al. 2011), and diseases such as inflammatory bowel disease (Sartor 2008). To characterize the microbial community structure and function, marker gene sequencing and wholegenome shotgun or metagenomic sequencing are two widely used approaches (Petrosino et al. 2009). After sequences have been acquired, they are usually clustered into Operational Taxonomic Units (OTUs): groups of sequences that are intended to correspond to taxonomic clades or monophyletic groups. QIIME (Caporaso et al. 2010) and AMPHORA (Wu and Eisen 2008) are two commonly used tools to infer OTUs. The abundance of an OTU is defined as the number of sequences in that OTU. The microbial community is often described by a list of OTUs and their frequencies, and a phylogenetic tree.
One of the goals of human microbiome studies is to understand how microbial communities affect health and correlate with disease (Peterson et al. 2009). In particular, it is important to identify groups of OTUs (species or taxa, which are used interchangeably) that are consistently predictive of host phenotypes . To develop new statistical methods to determine the association between microbial profiles and phenotypes such as obesity, however, it is crucial to take into consideration the challenges inherent in microbiome sequence data.
First, the OTUs are related to one another by a phylogenetic tree. Appropriate use of this hierarchical structure can lead to more informative analyses. For example, to estimate the taxonomic diversity between communities, distance measures that use the tree information, such as UniFrac (Lozupone and Knight 2005), are more effective at revealing ecological patterns than distance measures that use only sets of isolated OTUs. To develop predictive models, on the other hand, using the structure of the tree allows us to borrow statistical strength across phylogenetically close OTUs by encouraging them to have similar effects on the trait of interest. Chen et al. (2013) proposed to use the phylogeny in canonical correlation analysis by treating the tree as a special case of undirected graphs and then imposing a tree-constrained smoothness penalty. More recently, Tanaseichuk, Borneman, and Jiang (2014) adopted a tree-guided group-lasso penalty that incorporates the hierarchy in the microbiome space, and developed a new method for classifying microbial community samples.
Second, a host phenotype may be characterized by the microbial community through several taxa of varying phylogenetic depth . In this case, the goal of microbiome analysis is to determine, for the given phenotype, which taxa matter and at what taxonomic level. Intuitively, treating intermediate nodes of the phylogenetic tree as candidate features will lead to more accurate and more interpretable results. To our knowledge, however, few methods are available for constructing predictive microbial signatures at multiple taxonomic levels (Zhang et al. 2015).
Third, the number of sequencing reads varies greatly across samples. In other words, the abundance of species is measured in relative terms. Consequently, these count data are compositional (Aitchison 1986) and should be treated appropriately. To select the taxa that are associated with a phenotype, Lin et al. (2014) recently proposed an l 1 regularization methodology under a linear log-contrast model that accounts for the natural constraint of compositional data. An alternative approach is to model the counts explicitly. For example, Xia et al. (2013) proposed to use a logistic normal multinomial model to characterize the dependency of the community composition.
The rest of the article is organized as follows. In Section 2.1, we introduce the linear model, the biological assumption underlying variable fusion, and two methods for variable fusion. We then propose two tree-guided fused lasso penalties that use the tree topology as well as branch lengths in Section 2.2. In Section 2.3, we consider the compositional aspect of the microbiome data and the non-identifiability of regression coefficients caused by it, and adopt fusion methods we proposed to deal with the identifiability issue. Computation and tuning are covered in Section 2.4. Application of the proposed methods to a gut microbiome dataset and simulations are presented in Sections 3 and 4, respectively. We conclude with discussion in Section 5.

Methodology
Suppose we have an outcome vector y = (y i ) and a communityby-OTU matrix X = (x i j ), where y i is the phenotype for the ith community sample and x i j denotes the observed abundance of OTU j in sample i, for i = 1, . . . , n and j = 1, . . . , p. We also have a binary tree T that describes the phylogenetic relationships among the OTUs. In this tree, leaves correspond to OTUs and internal nodes represent taxa at different levels. We assume that the tree information is known a priori, although in many applications it can be learned from data using a variety of methods (Nielsen 2005). There may be uncertainties in this tree, and we will discuss the robustness of our method in the presence of tree uncertainties in the following.

Linear Regression Model and Variable Fusion
Since the goal is to understand the role of the structured microbiome space in explaining a phenotype, an effective approach is to pose the problem in a regression framework. In this article, we restrict our discussion to cases where the outcome variable is continuous or quantitative, although the general approach is applicable to cases where the trait of interest is discrete or qualitative. Specifically, we consider the linear model where β 0 is an intercept, β 1 , . . . , β p are regression coefficients, and 1 , . . . , n are independent and identically distributed (iid) errors with zero mean and finite variance.
High-throughput characterizations of microbial communities produce high-dimensional data. The scale of data may remain very large even after sequences have been assigned to OTUs. As such, dimension reduction is necessary. Our method is based on the biological assumption that phylogenetically close taxa may have similar associations with a host phenotype. Evidence supporting this can be found in several studies, see, for example, Ahn et al. (2013), Dey et al. (2013), and Garcia et al. (2014). Under this assumption, one natural way to handle the dimensionality problem is to conduct variable fusion by shrinking some |β j 1 − β j 2 | to zero. For example, we can solve the pairwise fused lasso (She 2010) problem where λ > 0 is the tuning parameter. This method fuses variables to each other and induces a clustering of variables.
Variable fusion is crucial in high-dimensional problems for interpretability and for improved predictive performance. However, the pairwise fused lasso treats all pairs of OTUs equally and thus fails to exploit the phylogeny of the OTUs. The recently proposed structure-constrained canonical correlation analysis ) used a ridge-type penalty λ j 1 < j 2 (β j 1 − β j 2 ) 2 /d 2 j 1 j 2 to smooth the coefficients of two OTUs j 1 and j 2 based on their closeness d j 1 j 2 on the phylogenetic tree. Similarly, we can use a tree-weighted penalty where the w j 1 j 2 are tree-based weights. The assumption that closely related taxa are likely to share a similar effect implies that w j 1 j 2 should be small when OTUs j 1 and j 2 are distantly related. Throughout this article, we set w j 1 j 2 = d α j 1 j 2 , with α ≤ 0 being another tuning parameter to be determined. Then, we need to solve the following problem: The pairwise fused lasso is a generalization of the ordinary fused lasso (Tibshirani et al. 2005) defined by minimize β0,β1,...,βp The latter is intended for situations in which variables have an ordering along which smoothness is expected. Since any ordering of OTUs is arbitrary, the ordinary fused lasso may be misleading.

Tree-Guided Fused Lasso
In a classification context, Tanaseichuk, Borneman, and Jiang (2014) recently proposed a method that uses the tree topology. To be specific, let V denote the collection of all nodes or vertices of the tree T . For each v ∈ V , let L v ⊆ {1, . . . , p} be the set of indices of OTUs that correspond to the leaves of the subtree rooted at v, and let β L v = (β j , j ∈ L v ). They adopted a grouplasso-type penalty λ v∈V β L v 2 based on the grouping of the OTUs along the tree. Here, · 2 denotes the Euclidean norm.
However, their penalty is not able to harness a predictive microbial signature made of a set of multi-level taxa, and as we will see in the next section, it does not take into account the compositional aspect of the microbiome data. To address these problems, we propose a novel tree-guided variable fusion method. The basic idea is to treat the internal nodes as potential features. In the development below, we construct two tree-weighted fused lasso penalties that encode the tree topology.
Denote by N the set of internal nodes of T . For each v ∈ N, let c v 1 and c v 2 be the two child nodes of v. For any set S, we denote by |S| the size of S. Let β = (β 1 , . . . , β p ) . The first tree-guided fused lasso penalty is defined by and 0 otherwise. In other words, the two child nodes c v 1 and c v 2 each take a proportion of the weight w c v 1 c v 2 of the parent node v ∈ N, relative to the sizes of their subtrees. Denote by L the set of leaves of T . Without loss of generality, let L = {1, . . . , p} and N = {p + 1, . . . , |V |}. Note that, for a binary tree with p leaves, |N| = p − 1 and |V | = 2p − 1. To design the second penalty, we use a bottom-up approach by first computing the penalty terms for all internal nodes with size 2 subtrees, then all with size 3 subtrees if any exist, and so on. Specifically, define A to be the level set such that, for each s ∈ A, there is v ∈ N such that s = |L v |. Let s h denote the hth smallest element of A. Let E = (e i j ) be a |V | × p matrix with e j j = 1 for j ∈ L and 0 otherwise. We denote by E j the jth row of E. The second tree-guided fused lasso penalty is defined by We now consider the tree-guided fused lasso problem Since each internal node represents the abundance of a taxonomic lineage, by incorporating the tree information node-by-node, the estimated microbial signature from our tree-guided fused lasso tends to be composed of a few taxonomic units at different depths. Formally, the variables indexed by L u for u ∈ N are fused together, defining a new variable indexed by u, if D v β = 0 for all the internal nodes v of the subtree rooted at u. Note that although in this article we focus on binary trees, the idea here can be naturally extended to K-ary trees.
It turns out that the ordinary fused lasso, pairwise fused lasso, and tree-guided fused lasso are nicely encapsulated by the generalized lasso (Tibshirani and Taylor 2011) formulation (7) where D(α) ∈ R m×p is a tree-weighted penalty matrix and · 1 denotes the l 1 norm. For the pairwise fused lasso, m = p × (p − 1)/2, while for the ordinary fused lasso and tree-guided fused lasso, m = p − 1. Note also that each component of D(α)β denotes a linear contrast of β 1 , . . . , β p , due to the nature of these fusion penalties.
For illustration, Figure 1 shows two binary trees, each with p = 4 leaves. For the pairwise fused lasso, the weighted penalty matrix is Here, the d j 1 , j 2 are calculated from branch lengths t 1 , . . . , t 6 and are tree dependent. Consider the tree in the left panel. The penalty matrices for the two tree-guided fused lasso penalties are respectively. For the tree in the right panel, the penalty matrices for our tree-guided fused lasso penalties are the same (in general they are not),

Compositional Data and Identifiability
For data collected in practical studies, different samples usually have very different numbers of sequences. To deal with this issue of sequencing depth, a typical step is to convert raw counts into proportions. The resulting percentage data, denoted by x ik , are compositional in that the relative abundances must sum up to one, p j=1 z i j = 1. This property is consistent with the fact that when the relative abundance of one OTU increases, the relative abundance of the rest of the community must necessarily decrease. However, it renders standard multivariate statistical methods inappropriate.
To understand this, we assume that {( i , z i1 , . . . , z ip ), i = 1, . . . , n} is an iid random sample on ( , Z 1 , . . . , Z p ) with p j=1 Z j = 1, and y 1 , . . . , y n are generated from the model where E( ) = 0 and E( 2 ) < ∞. Here, we assume that the microbiome's effect on the outcome variable is mediated through its taxonomic composition but not the sampling depth. This working assumption is reasonable since for the microbiome data, the number of sequencing reads can vary substantially across samples, and more importantly, the actual amount of the mixture of components is usually unknown. From (8), we have Furthermore, it is easy to check that for any constant c. That is, the coefficients β j are identifiable only up to a common additive constant. The situation here formally resembles standard analysis of variance, in which the dummy variables that code a multi-level categorical predictor sum up to one. Although the idea of variable fusion, or the biological assumption underlying it, is independent of the unit-sum constraint, variable fusion can handle this identifiability issue, as we show below.
Specifically, to make the p parameters identifiable, we can impose a constraint on them. Note that, for each k ∈ {1, . . . , p}, where β j (k) = β j − β k reflects the difference between β j and β k . The constraint is β k (k) = 0. Without loss of generality, assume that both Y and (Z 1 , . . . , Z p ) are centered, so the intercept is not included in the regression. To overcome the dimensionality problem, an alternative to dimension reduction is variable selection. For example, to get a sparse solution, we can consider the lasso (Tibshirani 1996) If β j (k) is estimated to be zero, then OTUs j and k are fused together. However, the lasso has two drawbacks. First, the solution to (9) depends on the choice of the reference index k. Second, the corresponding fusion pattern, which is the same as the sparsity pattern, is very rough (only those components with zero coefficients are fused together).
To overcome these problems, we note that for all 1 ≤ j 1 , j 2 ≤ p. This in turn means that it is the relative, rather than absolute, value of β j that matters, and that the choice of k has no impact on the differences β j 1 − β j 2 . Thus, to handle the dimensionality problem for compositional data, instead of conducting variable selection by penalizing |β j (k)| directly, we should conduct variable fusion by shrinking some |β j 1 (k) − β j 2 (k)| to zero, as we did in Sections 2.1 and 2.2. Taking the compositional aspect of the microbiome data into account, we now arrive at the following generalized lasso problem: Special instances of D(α) give rise to the ordinary fused lasso, pairwise fused lasso, and tree-guided fused lasso. Since each row of D(α) is a contrast vector, the solution to (10) is independent of k.

Computation and Tuning
We use the dual path algorithm of Tibshirani and Taylor (2011) to optimize (10). The algorithm is efficiently implemented in the genlasso R package. Given D(α), the main function in genlasso computes the solution path for all values of the tuning parameter λ. In addition, it provides an unbiased estimate of the degrees of freedom of the fit (Tibshirani andTaylor 2011, 2012). This is very important, since we can then use different model selection criteria, which employ degrees of freedom to assess risk, to select the tuning parameter. Let us denote the solution of problem (10) by {β j (λ, α; k), j = k}. Define where RSS(λ, α; k) = n i=1 {y i − j =k z i jβ j (λ, α; k)} 2 , κ is a penalty factor, and d f (λ, α; k) is an unbiased estimate of the degrees of freedom of {β j (λ, α; k), j = k}. Two popular criteria are AIC, which uses κ = 2, and BIC, which uses κ = log(n). For each α, we select λ by minimizing IC(λ, α; k) along the path. Throughout the numerical studies of this article, we set k = 1 and explore three values of α: −2, −1, and 0. The chosen α is the one giving the smallest IC value.

Gut Microbiome Data Analysis
The human gut harbors diverse microbes that play fundamental roles in the well-being of their host (Clemente et al. 2012). For example, studies have suggested that the gut microbiome is implicated in the etiopathogenesis of obesity (Turnbaugh et al. 2008;Ley 2010), although the casual nature of this association is somewhat controversial (Finucane et al. 2014;Rosenbaum, Knight, and Leibel 2015).
To demonstrate the use of our method, we applied it to a dataset from a human gut microbiome study carried out at the University of Pennsylvania (Wu et al. 2011). Specifically, for this study, fecal samples were obtained from 98 subjects and bacterial DNA was extracted using a standard protocol (Wu et al. 2010). After multiplexed 454/Roche pyrosequencing, about 900,000 high quality, partial (∼370 bp) 16S rRNA gene sequences were generated. These sequences were analyzed by the QIIME pipeline with default parameter settings (Navas-Molina et al. 2013): more than 17,000 OTUs were identified and their counts quantified at 97% of sequence similarity, the taxonomy was assigned using the RDP classifier (Wang et al. 2007), and the phylogenetic tree was inferred by FastTree 2 (Price, Dehal, and Arkin 2010). To reduce the number of OTUs, we merged them into genera (genus-level OTUs). We further dropped genera that were observed in less than 5 of the samples. Since the sampling depths were very different for different samples, we normalized counts into proportions. In our analysis, the microbiome data were summarized as a matrix Z = (z i j ) ∈ R 98×48 of relative abundances of 48 genus-level OTUs, and a phylogenetic tree T as displayed in Figure 2. The tree was rooted using an outgroup. More information about the tree can be found in Supplementary Materials, Section A. On the other hand, clinical measurements, including Body Mass Index (BMI), were also available for these 98 subjects. We are interested in constructing a predictive microbial signature for BMI that accounts for the phylogeny relating OTUs.
Note that our method requires a phylogenetic tree, but not a taxonomic table. While a phylogenetic tree can always be learned from molecular sequences, classifying the microbes into different taxonomies, from phylum to species level, necessitates the existence of a reference database that is often incomplete, because the vast majority of microbes have not yet been formally described. In addition, the taxonomy may not be consistent with the tree structure (see Figure 2).
We used Monte-Carlo cross-validation (also called sample splitting) to evaluate the prediction performance of various methods, including the lasso defined in (9). More specifically, we sampled 80% of the observations without replacement, fitted the model using that subsample, and used the remaining  observations to assess the predictive accuracy of each method. The results of applying this procedure 200 times are summarized in Table 1 and Figure 3. As we can see, the AIC criterion was inferior to BIC in terms of prediction accuracy. Furthermore, when BIC was used, tree-guided variable fusion helped improve prediction slightly. For the tree-guided fused lasso with BIC, we checked the fusion pattern along the tree. Table 2 reports the subtrees (indexed by the internal nodes of T ) that appeared in the model at least 160 out of 200 times. Here, a subtree is said to be in the model if all its leaves are fused together. We see that there was considerable overlap in the sets of subtrees from two versions of the tree-guided fused lasso.
To assist in the interpretation of results, Table 3 shows the taxonomic memberships for some of the genus-level OTUs that were fused along the tree. In this table, the first column "Internal node" lists the internal nodes whose leaves were all fused together by our method. For example, for the internal node 50, its six leaves (i.e., Lachnospiraceae I. S., Erysipelotrichaceae I. S., Roseburia, Ruminococcaceae I. S., Lachnospira, and Dorea) were fused. We can see that these OTUs were grouped into key phyla in the human gut microbiota: Firmicutes, Bacteroidetes, Actinobacteria, and Proteobacteria. Furthermore, some OTUs were also mapped together at lower taxonomic levels. It has been experimentally shown that in humans on weight-reduction diets, the decrease in Bacteroidetes was accompanied by an increase in Firmicutes (Ley 2010). In addition, a study with human twins manifested the interplay between Bacteroidetes and Actinobacteria in obese individuals, with a reduction of the former and a corresponding increase in the latter (Turnbaugh et al. 2008). Generally, the observed shift in the relative abundances of these phyla is closely related to energy harvest (Kinross, Darzi, and Nicholson 2011).

Simulations
In this section, we present a simulation study to evaluate the performance of the proposed methods. To mimic the kind of data that we might see in metagenomic applications, we use the compositional data matrix Z and the phylogenetic tree T from the gut microbiome data described in the previous section, where n = 98 and p = 48. The continuous outcome is generated from the model . . , n, where β 1 , . . . , β p are specified in Table 4, and 1 , . . . , n are iid N(0, σ 2 ). Three values of σ are considered: 1.1, 1.8, and 2.7. The corresponding signal-to-noise ratios are 4.10, 1.53, and 0.68, respectively. We note that the β j here were learned from the real data by the second version of the treeguided fused lasso. Figure 4 shows a graphical summary of the fusion pattern among the OTUs. As we can see, three OTUs are fused but not along the tree (two OTUs are fused together if they have the same coefficients), hence the fusion pattern is not fully consistent with the tree topology.
For each of 200 simulated datasets, we applied the lasso, ordinary fused lasso, pairwise fused lasso, and two versions of treeguided fused lasso. To evaluate the performance, we computed  = the number of internal nodes whose child nodes are correctly fused the number of internal nodes whose child nodes are truly fused , (d) the false positive rate with respect to internal nodes, FPR N = the number of internal nodes whose child nodes are falsely fused p − 1 − the number of internal nodes whose child nodes are truly fused , (e) the model size after fusion (the true value is 15), and (f) the mean squared error Zβ(k) − Zβ(k) 2 2 /(nσ 2 ). The simulation results are summarized in Table 5 for σ = 1.1, and Tables B1 and B2 in Supplementary Materials, Section B for σ = 1.8 and σ = 2.7. As we can see, the pairwise fused lasso was very aggressive, with a high false positive rate with respect to internal nodes and a small model size after fusion. We can also see that for the lasso and ordinary fused lasso, the two types of true positive rates were relatively low. On the other hand, the two versions of the tree-guided fused lasso behaved similarly and performed reasonably well: they had lower false positive rates than the pairwise fused lasso, and higher true positive rates than the lasso and ordinary fused lasso. Generally, BIC outperformed AIC, and the performance of all methods became worse as the signal-to-noise ratio decreased.
In Supplementary Materials, Section B, we further examined the impact of the reference index k on the performance. The results shown in Table B3 confirmed that the solution of the lasso varied with k. Furthermore, the ordinary fused lasso and treeguided fused lasso were not affected much by the choice of k, while the pairwise fused lasso was numerically very unstable. This is because the penalty matrix for the pairwise fused lasso has a large number of rows, making the resulting optimization much more challenging. In Supplementary Materials, Section C, we ran simulations when the z i j values were generated from a logistic normal distribution rather than fixed, and when p > n. The results reported in Tables C1 and C2 are qualitatively similar, except that the performance of all methods deteriorated as the sample size decreased. Table 6 shows the average CPU seconds for computing the entire path of solutions. All timings were carried out on the Yale Louise high performance cluster (Dell m620 system, 8-core processor, 48G of memory) using R version 3.1.0. As expected, the pairwise fused lasso was much slower than its competitors. We note that the absolute numbers here might be different on different machines, and that it is the relative values that matter.
So far we have assumed that closely-related OTUs in a tree have a similar function because of shared evolutionary ancestry. In reality, this assumption could be violated (see the Discussion section). To further examine the performance of the tree-guided fused lasso, in Supplementary Materials, Section D, we conducted a simulation study in which the tree structure encoded by our penalties was partly or fully inconsistent with the fusion pattern among the true coefficients. The results reported in Figure . The fusion pattern for the simulation study. Black leaves stand for OTUs that are not fused, blue leaves represent OTUs that are fused along the true, orange leaves show OTUs that are fused but not along the tree, and red internal nodes indicate subtrees whose leaves are fused.
Tables D1 and D2 indicate that the tree-guided fused lasso achieved a level of robustness when the tree structure was partly inconsistent with the fusion pattern, and had inferior performance when the tree was fully inconsistent.

Discussion
In this article, we have introduced the concept of variable fusion for dimension reduction in linear regression with compositional covariates (i.e., bacterial species or OTUs) related in a Table . The averages of the true positive rate (TPR L ) and false positive rate (FPR L ) with respect to leaves, true positive rate (TPR N ) and false positive rate (FPR N ) with respect to internal nodes, model size after fusion (MS, with standard deviation in parentheses), and mean squared error (MSE, with standard deviation in parentheses), based on  data replications, are reported for the lasso (LASSO), ordinary fused lasso (OFL), pairwise fused lasso (PFL), and two versions of tree-guided fused lasso (TFL- and TFL-) when σ = 1.1. phylogenetic tree. To encourage hierarchically close species to have similar effects on the phenotype, we have proposed the tree-weighted pairwise fused lasso using branch lengths as weights. To construct a predictive microbial signature composed of taxa at different levels, we have designed two tree-guided fused lasso penalties that use the tree topology as well as branch lengths. Both real data analysis and simulations have shown that the tree-guided fused lasso performs better than other methods, such as the lasso and ordinary fused lasso. There are three possible extensions. First, it is easy to extend tree-guided fused lasso penalties from binary trees to general K-ary trees. Second, we can extend the tree-guided fused lasso from linear models to generalized linear models. Finally, treeguided variable fusion can be extended to other multivariate analysis methods, such as principal component analysis. Below we discuss some issues related to our methodology.
The idea of fusion and the method of fusion along the tree are very natural under the biological assumption that Table . The average CPU timings (seconds) for computing the solution path, based on  data replications, are reported for the lasso (LASSO), ordinary fused lasso (OFL), pairwise fused lasso (PFL), and two versions of tree-guided fused lasso (TFL- and TFL-) when σ = 1.1.

LASSO
. OFL . PFL . TFL- . TFL- . closely-related species in a tree may function in a similar manner. However, there are situations in which this assumption could be violated. Clostridia are a good example, where some species convert dietary fiber into anti-inflammatory short chain fatty acids, while others cause severe colitis (Bartlett et al. 1978;Lopetuso et al. 2013). How to fuse bacterial taxa based on their biological function is an interesting direction for future study. Our method assumes that the tree information is known a priori. Typically, the tree is inferred from molecular sequences, as was the case in the real-data analysis. Since our tree-guided fused lasso penalties use the discrete topology, not just treeassociated distances, accurate methods, especially likelihoodbased methods (Price, Dehal, and Arkin 2010;Stamatakis 2014), are recommended for learning trees from data. Another important issue is tree rooting. We note that rooting is not part of tree inference and rooting error is in addition to tree-estimation error. We conducted a small simulation (Supplementary Materials, Section E) and found that there was a loss in accuracy if the root was wrongly inferred, suggesting that rooting error could have a large impact on tree-based methods.
To use the tree topology, we consider the number of subtree nodes as a measure of phylogenetic diversity. A potentially better measure is the branch length. For example, suppose we have a "bushy" subtree of many closely-related nodes (i.e., short branches) and a subtree with long branches. Using the subtree size would treat them equally, but fusing on a bushy tree might be more justified. Clearly, our way of simply using branch lengths as weights is inefficient. As suggested by one referee, our tree-guided fused lasso penalties are related to the so-called "phylogenetic independent contrasts" (Felsenstein 1985), which have the potential advantage of using branch lengths in a more complex but more biologically relevant way. More generally, it is an interesting topic to investigate possible improvements based on the models used in phylogenetic comparative methods (Garamszegi 2014).
To handle the dimensionality problem, variable selection is an alternative to variable fusion. The reason why we do not consider variable selection in the current framework is that, unlike variable fusion, it does not provide a natural way to deal with the compositional aspect of the microbiome data (i.e., the unit-sum constraint). Furthermore, the sparsity pattern is a very rough fusion pattern and is less biologically meaningful. Nevertheless, identifying differential features is one of the central goals for microbiome studies (Albanese et al. 2015). We are working along this line, under a different framework explored by Lin et al. (2014).

Supplementary Materials
The online supplementary materials include the following: Section A: Additional tree information Section B: Additional simulation results Section C: Additional simulations Section D: Impact of the tree structure Section E: Issue of tree rooting