NOTUNG: A Program for Dating Gene Duplications and Optimizing Gene Family Trees

Large scale gene duplication is a major force driving the evolution of genetic functional innovation. Whole genome duplications are widely believed to have played an important role in the evolution of the maize, yeast, and vertebrate genomes. The use of evolutionary trees to analyze the history of gene duplication and estimate duplication times provides a powerful tool for studying this process. Many studies in the molecular evolution literature have used this approach on small data sets, using analyses performed by hand. The rapid growth of genetic sequence data will soon allow similar studies on a genomic scale, but such studies will be limited unless the analysis can be automated. Even existing data sets admit alternative hypotheses that would be too tedious to consider without automation. In this paper, we describe a program called NOTUNG that facilitates large scale analysis, using both rooted and unrooted trees. When tested on trees analyzed in the literature, NOTUNG consistently yielded results that agree with the assessments in the original publications. Thus, NOTUNG provides a basic building block for inferring duplication dates from gene trees automatically and can also be used as an exploratory analysis tool for evaluating alternative hypotheses.

°Every node in the tree represents either a speciation or a duplication event.It is possible to nd the set of duplication nodes by comparing the gene family tree to a species tree such as the cartoon of the Tree of Life shown in Figure 2. Hughes identi ed two duplication nodes (14 and 15).There are two more duplication nodes in the RXRB clade (3 and 6) that he does not mention.°Bounds on the time of duplication, given in terms of major speciation events, can be inferred for each duplication node from the relative positions of speciation and duplication nodes in the tree.According to the topology shown in Figure 1, duplications 14 and 15 are both bounded above by the divergence of vertebrates and insects and bounded below by the divergence of tetrapods and bony shes.The upper FIG. 1.A rooted Neighbor Joining tree for the RXR family reproduced from Hughes (1998).Interior nodes are labeled numerically.Labels in square brackets represent the percentage of bootstrap samples supporting that branch leading from the label to the root.Values µ 50% are not shown.

FIG. 2.
A species tree showing major speciation events in the eukaryote lineage.
bound can be inferred from the clustering of insect genes outside the gene family clades and the lower bound from the presence of a sh gene in each subfamily clade.°When a duplication hypothesis depends on a node with weak support in the sequence data, alternative hypotheses should be considered.Because the bootstrap values associated with the zebra sh branches in Figure 1 are low, topologies in which zebra sh genes do not cluster within the subfamilies should also be considered.For this reason, the divergence of the amphibian lineage may be a more reliable lower bound for duplications 14 and 15.
Our results.Hughes' analysis is typical of many studies in the biology literature of gene duplication using ad hoc analysis of gene family trees (Endo et al., 1997;Hughes, 1998;Hughes, 1999;Kasahara, 1997;Pebusque et al., 1998;Ruvinsky and Silver, 1997).These analyses are based on the assumption that a rooted GFT is a hypothesis concerning the evolution of a gene family.The duplication history of the family can be inferred from a GFT, where we de ne the duplication history to be a list of the duplications that occurred and a time range for each.If the topology of the tree is unambiguous and the tree is rooted, then the inferred history is unique.
However, as Hughes' discussion of weak bootstrap values illustrates, in many instances it is not possible to infer the tree topology unambiguously from the sequences and, in this case, alternate histories must be considered.Alternate histories must also be considered if the tree is unrooted.While a rooted GFT is a hypothesis concerning the evolutionary history of a gene family, an unrooted gene family tree represents a set of such hypotheses, one for each possible rooting.Unlike the RXR example, many of the trees reported in the literature are unrooted because it is frequently not possible to nd a sequence from the gene family in a suitable outgroup species.
These considerations suggest two computational problems.The solution to the rst problem is used as a subroutine in the solution of the second.
1.When the correct rooted gene family tree is known, the problem is to infer the duplication history from the tree.2. When the correct rooted GFT is not known, the problem is to nd the rooted gene family tree that best represents the evolutionary history of the gene family.
The most general approach to the second question is to consider all rooted binary trees with respect to an optimization criterion based on a model of gene family evolution.That is, given G, a set of orthologous and paralogous gene sequences from a gene family, and T S , a binary species tree containing exactly the species in G, nd the rooted binary tree, T G , whose leaf set is G, that optimizes a given optimization criterion.The forces that govern the evolution of gene families involve gene duplication and loss as well as the evolution of the gene sequences themselves, as we discuss at greater length later.The optimization criterion should therefore take all three processes into account.However, it is not obvious how to determine the relative importance of macroscopic mutations like gene duplication and loss with microscopic mutations like point mutations in a single optimization criterion.Instead, our approach is to start with a gene family tree inferred from sequence alone.When the sequence data is not suf cient to reconstruct an unambiguous, rooted GFT, we use optimization criteria based on gene duplication and loss to consider alternate hypotheses.

CHEN ET AL.
We review previous work involving the relationship between gene family trees and species trees in the next section.After that, we present a linear time algorithm for inferring the duplication history of a gene family from a rooted GFT.We then address the problem of nding the optimal alternate hypothesis, when the correct rooted GFT is not known.After introducing two optimization criteria for GFTs based on duplication and loss, we discuss how to use them to select the optimal rooting of an unrooted tree and present an algorithm for nding the optimal gene duplication history given a rooted GFT with weak branches.Although, in the general case, the number of alternate hypotheses is superexponential in the number of taxa, we show that with respect to the criteria introduced it is possible to vastly reduce the search space if the weak branches are sparse by identifying sets of weak branches that can be evaluated independently.
We implemented these algorithms and tested them on gene family trees published in the molecular evolution literature (Hughes, 1998;Pebusque et al., 1998;Ruvinsky and Silver, 1997).As summarized in Section 5, the gene family histories generated by our program are consistent with the assessments presented in the original papers.For unrooted trees and nodes with low bootstrap values, our program generates and scores alternate hypotheses, providing an exploratory analysis tool.In addition, an explicit statement of all hypotheses helps mitigate any biased expectations of the data the user might have.

RELATED WORK
The problem of disagreement between gene trees and species trees was rst raised by Goodman et al. (1979) in the context of inferring a species tree from a gene tree that may contain paralogies.They introduced the notion of a mapping between a gene tree and a species tree and suggested a cost function for evaluating a species tree with respect to a gene tree based on edit distance, gene duplication, and gene loss.
These concepts were further developed and formalized in Guigo et al. (1996), Hallett and Lagergren (2000), Ma et al. (1998), Ma et al. (2000), Mirkin et al. (1995), Page (1994), Page and Charleston (1996), Stege (1999), Zhang (1997).Formally, given a set of rooted gene trees, fT G g, the problem is to nd the species tree, T S , that optimizes an evaluation criterion.Several optimality criteria have been proposed (see Eulenstein et al. [1996], Eulenstein et al. [1998] for a comparative survey), all of which attempt to capture the notion that gene duplication and subsequent loss are rare events.These criteria involve constructing a mapping, M : T G ® !T S , between a gene tree and a species tree, that is used to compute the cost function.Several authors have pointed out that it is dif cult to distinguish true gene loss from genes that have not yet been sequenced, and they discuss approaches to distinguishing true losses from apparent losses in the cost function (Goodman et al., 1979;Mirkin et al., 1995;Page and Charleston, 1996).
When inferring a species tree from a gene tree, the gene tree is assumed to be correct and the true species tree is unknown.We, on the other hand, assume that the true species tree is known and use it to infer the duplication history from a gene tree.While we share some mathematical structure with Guigo et al. (1996), Hallett and Lagergren (2000), Ma et al. (1998), Mirkin et al. (1995), Page (1994), Stege (1999), most notably the mapping M. /, we consider the problem of dating duplication events and generating and evaluating alternate hypotheses.The problem of nding an optimal species tree is NP-hard (Ma et al., 1998;Ma et al., 2000) for the optimality criteria considered so far.In contrast, dating duplication events in rooted and unrooted trees is a computationally tractable problem, which is crucial if we hope to apply this to large data sets.
The methods for inferring species trees from gene trees surveyed here do implicitly generate duplication histories in rooted trees although the time of duplication is generally not considered.In addition, most optimality criteria surveyed here are subject to the constraint that each species may be represented only once in G and hence would not be suitable for our application.A notable exception is the work of Page and Charleston (1997), who have developed two software packages, COMPONENT and GENETREE, that, as well as inferring species trees, will compute and display duplication histories for rooted gene trees.This provides an interactive, exploratory analysis tool, but cannot be used to automate the analysis of large data sets.
The rst analyses of alternate rootings of unrooted gene trees have appeared very recently.Hallett and Lagergren (2000) presented an algorithm to infer the species tree from a set of rooted gene trees in polynomial time under the restriction that the species tree can be reconciled with the gene tree using at most k simultaneous copies of the gene along its branches.In this context, they also developed a polynomial time algorithm that selects a root for each unrooted gene tree such that the total number of duplications required to explain the data is minimized.Their algorithm considers all possible rootings of all trees using a brute force approach.In work developed independently and presented simultaneously, we (Chen et al., 2000) presented an algorithm that, given an unrooted gene tree and a xed species tree, ranks all possible rootings of the gene tree according to an optimization criterion that estimates the plausibility of the resulting duplication histories.By using a data structure that stores intermediate results, we avoid the brute force approach, ranking all rootings in time linear in the size of the gene tree.In the current paper, we apply this approach to a wider range of optimization criteria based on duplication and loss.None of the work surveyed here addresses alternate hypotheses due to weak edges.In Chen et al. (2000), we introduced this problem and proposed a heuristic solution.In the current paper, we develop an exact approach.

INFERRING GENE FAMILY HISTORIES FROM ROOTED TREES
The process of inferring a duplication history from a gene family tree is illustrated in Figure 3 which shows an unrooted GFT for a hypothetical gene family, A, and three possible rootings of that tree.Each rooted tree corresponds to a single duplication history.For example, the rooted tree in Figure 3(b) suggests that gene A was present in the common ancestor of sh and tetrapods and was duplicated after the divergence of sh and before the separation between birds and mammals.In contrast, Figure 3(c) implies that the duplication took place before the divergence of sh and tetrapods.Although there is only one sh sequence, it clusters with the genes in the A 2 family, suggesting either that the sh A 1 gene has been lost due to mutation or that it has not yet been sequenced.Because these trees are rooted, we can observe evidence of the duplication through clustering, even when one of the two paralogs is missing.
The problem of reconstructing a duplication history from a rooted gene family tree can be stated formally as follows: Problem 1.Given a rooted GFT, T G , and T S , a binary species tree containing only the species in T G , identify all duplication nodes and determine lower and upper bounds on the time of each duplication in terms of speciation events.Both the identi cation of duplication nodes and the calculation of duplication dates require constructing a mapping, M. /, from every node in T G to a target node in T S .We construct this map as follows.Let v be a node in T G and let l.v/ and r.v/ be its left and right children, respectively.M. / maps each leaf node in T G to the node in T S representing the species from which the sequence was obtained.(Leaf nodes in T G represent sequences, whereas leaf nodes in T S represent species.)Each internal node in T G is mapped to the least common ancestor (lca) in T S of the target nodes of its children; that is, M.v/ 5 lca.M.l.v//; M.r.v///.For example, in Figure 3(b), the leaf nodes are mapped to sh, chicken, mouse, chicken, mouse, from top to bottom.M.c/ 5 amniote, since the lca of mouse and chicken is amniote in the Tree of Life (Figure 2).Nodes b and d both map to amniote, while M.a/ 5 jawed_vertebrate.
An algorithm for constructing the mapping, M. / and identifying duplication nodes has been developed independently in the context of using multiple gene trees to generate a species tree.By using fast lca queries Bender and Farach-Colton (2000) and JáJá (1999),1 M. / can be computed in O.jT G j/ time.While our goals are different, we share a key algorithmic component with this work.We refer the reader to Mirkin et al. (1995) for a complete description and proofs.
Observe that, under the mapping, a node v in T G is a speciation node if its children are mapped to independent lineages in T S .In Figure 3(c), x is a speciation node since mammals and birds are separate lineages.If the children of M.v/ share a lineage, then v is a duplication node.When this occurs, either both children have the same label or one child's target in T S is an ancestor of the other's and v will be mapped to the same label as the ancestral child.For example, node w is a duplication node in Figure 3(c) because M.y/ 5 jawed_vertebrate is an ancestor of M.x/ 5 amniote.
The mapping, M. / can also be used to compute lower and upper bounds on the time of duplication.Let v be a duplication node in T G .Since copies of the duplicated gene are observed in descendents of both l.v/ and r.v/, the duplication must have been present in their least common ancestor, yielding the lower bound L.v/ 5 M.v/.By a similar argument, the upper bound can be shown to be the target of the nearest ancestor, a v , of v that is a speciation node.Since copies of the duplicated gene are present in only one of the subtrees rooted at children of a v , the duplication must have occurred in a more recent species.If v has an ancestor that is a speciation node, we set the upper bound U .v/ 5 M.a v /.Otherwise, U .v/ is the origin of life.For example, in Figure 3(c), the bounds on the duplication node, w, are L.w/ 5 jawed_vertebrate and U .w/ 5 1, since w is the root node of T G .In Figure 3 Observation 2. The duplication associated with a node, v, in T G , occurred after the speciation event U .v/ and before the speciation event L.v/ 5 M.v/.
Notice that we can compute U in O.jT G j/ time.

EVALUATING ALTERNATE GENE FAMILY HISTORIES
As the example in Figure 3 shows, alternate rootings of an unrooted GFT give alternate duplication histories.Alternate hypotheses such as these can be evaluated in terms of the number of gene duplications and losses implied by the associated rooted trees.For example, Figure 3(a) requires a single gene duplication to explain, while Figure 3(c) involves one gene duplication and one gene loss.In contrast, the tree shown in Figure 3(d) has three subfamilies resulting from two duplications (nodes p and q) and four losses.A duplication and loss model can also be used to evaluate rearrangements of weak branches.
We formalize the intuition provided by this example in the remainder of this section.First, we develop two optimization criteria based on duplication and loss for evaluating alternate hypotheses.We then discuss how to apply these to unrooted trees.Finally, algorithms for evaluating rearrangements of weak edges with respect to these criteria are presented.

Optimization criteria for duplication histories
We consider two criteria for evaluating alternate gene duplication histories.The rst scoring function, C dl ./, calculates gene duplication and loss explicitly and can be stated as where ± represents the number of duplication nodes in T G , ¸represents the number of gene losses, and c ¸and c ± are constants that determine the relative importance of duplications and losses in the scoring function.The best choice of c ¸and c ± is an open problem.In this paper, we use c ¸5 c ± 5 1 consistent with the work cited earlier.
We de ne ¸to be the most parsimonious number of gene losses required to explain the duplication history.That is, if an entire subtree of T S is missing from T G , we assume that the loss occurred at the root of the subtree and count it as a single event.For example, in Figure 4, we assume that B 2 was lost in the common ancestor of mouse and chicken rather than two independent gene losses in the leaf species, resulting in ± 5 1 and ¸5 1.
The number of duplications, ±, is a by-product of the algorithm for inferring duplication histories described in the previous section.The loss can be computed by summing the gene losses over all edges in T G : The loss associated with the edge e 5 .parent;child/ is given by EDGELOSS 5 .DEPTH .M.child// ¡ DEPTH .M.parent// ¡ 1/ 1 ISDUP .parent/,where DEPTH .M.v// returns the depth of the target of node v in T S (the species tree for the species in T G only) and ISDUP .v/ 5 1 if v is a duplication node and zero otherwise.The rst term in this equation counts gene losses associated with speciation nodes.The labels of a speciation node and its child should differ by one if no loss has occurred and will increase by one with each gene loss.In contrast, if no gene loss has occurred on the edge immediately below a duplication node, the labels of a duplication node and its child should be identical.Thus, if parent is a duplication node, the rst term will be off by one.The second term corrects for this case.
We de ne a second cost function, C ub ./, that is an upper bound on the maximum number of missing leaf nodes plus the number of duplication nodes.Let the cost, k.t /, of a node t in T S be the number

CHEN ET AL.
of leaf nodes under t.That is, k.t/ is the maximum number of gene losses that could occur following a duplication in the ancestral species t .If t is a leaf, then k.t / 5 1.The cost of the gene tree, T G , is where D is the set of duplication nodes in T G .Under C ub ./, the cost of the tree in Figure 4 is four, since k.M.q// 5 k.jawed_vertebrate/ 5 3.In comparison, under C dl ./, the cost of the tree is two, when c ¸5 c ± 5 1.
These criteria represent two extremes on the spectrum of gene-loss models.If a gene is missing in species which share a common ancestor, is this due to a single loss in the common ancestor or did several independent losses occur?C dl ./ represents the minimum number of gene losses required to explain the data, while C ub ./ is an upper bound on the maximum number of possible gene losses.We believe that gene loss is a rare event and that in most cases the more parsimonious model is likely to be the correct one.However, by testing our methods on two diametrically opposed models, we can gain insight into how sensitive our methods are to the choice of model.

Unrooted trees
Unlike a rooted tree, which encodes a single evolutionary hypothesis, an unrooted tree with n 5 jEj edges represents up to n different hypotheses, one for each possible rooting.We propose to use the two cost functions presented above to rank these n hypotheses, making it possible to consider the most biologically plausible hypotheses rst.We can state the unrooted tree problem formally as follows: Problem 2.1.Given an unrooted GFT, T G 5 .V ; E/, and an optimization criterion, C. /, determine the duplication history of the tree rooted at e for every edge e 2 E. Rank the histories according to the criterion C. /.
To do this, we must label each node in T G as either a duplication or speciation node under every possible rooting.However, the mapping M. / changes with the position of the root.A simple quadratic time algorithm would be to apply the rooted tree algorithm to every possible rooting.However, we can derive a linear time algorithm to compute all labelings as follows.Notice that, with respect to a node v, we can partition all possible rootings of the tree into three groups: the root must be in one of three directions, depending on which of the edges incident on v is on the path from v to the root.Let e 1 , e 2 , and e 3 be the edges incident on v.The status of v as either a duplication or speciation only depends on which edge points towards the root.This is because if we x which edge is up, the subtree rooted at v is xed and so is the bottom-up lca computation.Now, we need only compute M e 1 .v/,M e 2 .v/,M e 3 .v/,and one M. / value for each possible "up" edge, from which we can compute the labeling under any desired rooting in linear time.
To compute the three values, we simply do the recursive computation at each node in any order.That is, suppose we want to compute M e 2 .v/for some v.This determines which two nodes are down.Call them u and w.Then M e 2 .v/ 5 lca.M fv;ug .u/;M fv;wg .w//.We recursively compute M fv;ug .u/and M fv;wg .w/.In order to keep from recomputing the same value over and over, we simply store all values in a table as we compute them.Thus, after we have computed M fv;ug .u/once recursively, we can look it up in constant time without need for recomputation in the future.Thus, all 3n values can be computed in O.n/ time.Similarly, any cost function that depends on a function of M. /, including C dl ./ and C ub ./, can be computed in O.n/ time.Thus, linear time is suf cient to rank all n rootings and print a constant number of top ranking duplication histories.Printing all duplication histories takes longer.

Rooted tree rearrangements
As in the case of unrooted trees, alternate hypotheses for weak edges can be selected with respect to an optimization criterion.While the following analysis can be applied to any arbitrary set of edges deemed suspect by the user, typically bootstrapping (Efron and Gong, 1983) is used to associate a measure of con dence with every edge in a phylogeny.We use a rearrangement strategy to identify appropriate alternate hypotheses for weak edges as follows.
The removal of any edge, e, in a tree bipartitions the set of leaf nodes.If the bootstrap value of e is low, it suggests that the evidence in the data for that bipartition is weak.It does not re ect on the certainty of the structure of any other part of the tree.In reconstructing the duplication history of a rooted GFT, we consider alternate hypotheses associated with a weak edge, e, by generating Nearest Neighbor Interchanges (NNI's) around e. Examples of NNI rearrangements are shown in Figure 5 (see, for example, Swofford et al. [1996] for a more detailed description).The NNI around an edge e 5 .parent;child/ shown in bold is effected by swapping the subtree rooted at the sibling of child with either its left or its right subtree.This rearrangement generates alternate bipartitions for e while leaving all other bipartitions associated with the tree unchanged.
An NNI in T G will change the mapping, M : T G ® !T S , resulting in a new mapping and in some cases, will also change cost of the tree and the duplication history.For example, Figure 5(a) shows a scenario where, presumably, the frog and sh genes were incorrectly placed with respect to each other due to a weak signal in the sequence data.The two internal nodes in this tree fragment are both labeled vertebrate.NNI a 00 changes the label of the deeper internal node, thereby eliminating a false duplication.The example in Figure 5(b) is more ambiguous.The middle tree represents a duplication before the divergence of sh followed by a loss in the sh lineage.The rearrangement b 00 changes M. / and moves the duplication to the deeper node, resulting in a duplication after the sh-tetrapod split and eliminating the gene loss.Cost functions based on gene loss tend to favor histories with more recent duplications (e.g., b 00 rather than b).This is a disadvantage to the duplication/loss model since both b 00 and b are plausible and which scenario is more likely requires specialized knowledge of the gene family.
We now state the problem of optimizing a tree with weak edges formally: Problem 2.2.Let T G 5 .V ; E/ be a rooted GFT and let W E be a set of weak edges.De ne T G;W to be the set of trees that can be derived from T G by NNI operations across edges in W . Find the tree T ¤ G 2 T G;W such that T ¤ G optimizes a given criterion, C. /.
Note that, when W 5 E, the problem reduces to that of nding the optimal gene family tree counting only duplications and losses and ignoring sequence data.

CHEN ET AL.
It is not known if an ef cient algorithm for nding T ¤ G exists.No doubt the existence of such an algorithm depends on the choice of C. /.In Chen et al. (2000), we considered the following simple heuristic for nding optimal trees: Enumerate the weak edges bottom-up and perform any NNI that reduces the label of the child node of the weak edge under consideration.Once that choice has been made, the weak edge is considered xed and rearrangements of weak edges higher in the tree are considered.While we found that this greedy heuristic works well on the real data sets we applied it to, it is possible to produce biologically plausible examples where the heuristic performs badly.
In the current paper, we therefore consider algorithms that nd the actual optimum.The naïve algorithm for such an optimization is to enumerate all possible trees and to evaluate C. / on each one.This is clearly not feasible in general, since the number of trees is superexponential in the number of taxa.However, in many practical cases, the weak edges are sparse and typically appear in small connected components.We will show that for some choices of C. /, in particular C dl ./ and C ub ./, it is suf cient to optimize these connected components independently rather than exhaustively enumerating all trees in T G;W , reducing the cases to be considered to a feasible number.
Consider the tree in Figure 6, which contains four weak edges, shown in bold.These edges form two connected components: ((a,b),(a,e),(e,f)) and ((k,m)).Given a connected component of weak edges, W i , let T i be the rooted binary tree formed by adding all edges that are immediate descendents of a weak edge to W i .T i has k i 5 jW i j 1 2 leaves.For example, for the connected component ((a,b),(a,e),(e,f)), T i is the tree with root a and leaves c, d, g, h and i. (Note that the leaves of T i may be subtrees of T G .)Thus, each connected component induces a rooted binary tree embedded in the GFT.For each such T i , the set of trees that can be generated by applying an arbitrary sequence of NNIs to edges in W i is equivalent to the set of all rooted binary trees with the same root and leaf set as T i .We call this set T i .The number of trees in i be the tree from T i that optimizes C. / and let TG be the tree obtained by replacing each T i with T ¤ i .The tree, TG , can be generated through independent optimization of the fT i g by the following algorithm: Note that there may be more than one optimal rearrangement T ¤ i of T i .If T i is optimal, we report this as T ¤ i .Otherwise, we pick an arbitrary optimal solution to be T ¤ i .Below, we show that, when C. / is C dl ./ or C ub ./, optimization of each T i independently will yield the globally optimal tree, T ¤ G .Thus, we do not need to consider the product of all possible rearrangements of all T i and this approach will substantially speed up the computation in cases where no T i is too large.Our exhaustive algorithm will, in fact, run in time linear in the number of components and exponential in the size of the largest component.
Theorem 1.Let C. / be any cost function such that the cost of the subtree of TG rooted at any node v is of the form C.T v / 5 C.T l.v/ / 1 C.T r.v/ / 1 f .M.v/; M.r.v//; M.l.v///; (1) that is, the cost of a tree rooted at v depends on the cost of its subtrees and a function, f ./, that depends only on the labels of v and its children.Then TG 5 T ¤ G .
Proof.Let TG be the tree resulting from running Algorithm A on T G 5 .V ; E/.We show that C.T v / is optimal for any vertex, v 2 V , if v is the root of some T i or if v is not in S fT i g.
Case 1: is a root of some weak component, .In this case, C.T v / is optimal by brute force optimization.
Case 2: is outside the .C.T v / is optimal if each of the terms of Equation 1 is optimal.C.T l.v/ / and C.T r.v/ / are optimal by induction.Thus, it is suf cient to show that there is no rearrangement that can reduce the value of f ./ to show that C.T v / is optimal.To see that this is true, note that the label of any node depends only on the set of leaves below it.Although rearrangement can change the structure of the fT i g and hence the labels of their interior nodes, the labels of the roots of the fT i g and nodes outside them are unchanged.
Since the root, r, of T G is either outside of S fT i g or a root of some T i , C.T r / is optimal.Thus, TG 5 T ¤ G .
Proof.Both C d l ./ and C ub ./ satisfy Equation 1.
The time to optimize a single connected component, W i , is the product of the number of possible rearrangements of T i and the time to score each rearrangement.Thus, the time to compute optimal rearrangement tree using Algorithm A is where N is the number of components.The term corresponding to the number of rearrangements of the largest component will dominate and is exponential in the number of edges in the largest component.However, the running time is linear in the number of components.If weak edges are sparse, and in particular, individual components are small, this run-time will not be onerous in practice.In the trees culled from the literature for this study, no connected component had more than three edges.NOTUNG 's running time on these trees never exceeded one second of cpu time per tree on an UltraSparc 10.

EXPERIMENTAL RESULTS
The algorithms described in the previous sections have been implemented in a Java program called NOTUNG.NOTUNG takes a gene family tree, T G , a species tree, T s , and a bootstrap threshold, ¿ , as input.The user may select either of the cost functions described in the previous main section to evaluate alternate hypotheses.Input trees are represented in Newick format (http://evolution.genetics.washington.edu/phylip/newicktree.html).For rooted trees, NOTUNG generates a gene duplication history as output; that is, a list of duplication nodes, with bounds on the time of duplication for each one.If the rearrangement option is turned on, NOTUNG will also compute the optimal rearrangement tree according to Algorithm A. Currently, we use the tie-breaking rule described in the previous subsection.In this case, NOTUNG will present the original duplication history and the history for the optimized tree, giving scores for both.In the next release, we plan to report all possible optima to the user.For unrooted trees, NOTUNG considers all possible rootings and computes a duplication history for each.These histories are ranked according to the cost function selected.Note that our algorithms, as designed, work with binary and higher degree trees, but NOTUNG currently only handles binary trees.Similarly, there is no algorithmic reason why the rearrangement option cannot be applied to every possible rooting of an unrooted tree automatically, but this approach has not implemented it.Since most rootings of an unrooted tree are biologically implausible, we leave it to the user to select the rootings of interest and apply the rearrangement algorithm to them explicitly.
One goal of the experimental work presented here is to determine whether these cost functions rank alternative hypotheses effectively.Below we describe NOTUNG's performance on rooted trees, unrooted trees, and trees with low bootstrap values.As test data, we used all "nonpathological" trees from three recent articles on large scale duplication (Hughes, 1998;Pebusque et al., 1998;Ruvinsky and Silver, 1997).We eliminated nonbinary trees and trees based on genes with complicated internal structure, such as mosaic genes or genes with repeated domains, and trees that show evidence of horizontal gene transfer.We analyzed the remaining thirteen trees (summarized in the table below) using NOTUNG and compared the automatically generated results with the verbal analysis presented in the source paper.
NOTUNG compares the input GFT with a species tree to infer the duplication history.Since there are many competing hypotheses concerning the topology of the Tree of Life, our program allows the user to supply a species tree as input.In the experiments described below we tried, to the extent that it was possible to determine from the text, to use the same species tree as did the authors who originally analyzed the tree.Most authors used a tree consistent with that shown in Figure 2. Pebusque et al. (1998) used a variant in which nematodes are included in the protostome clade.The exact tree used is shown in Figure 7.This tree is constructed from information in the University of Arizona Tree of Life project (http://phylogeny.arizona.edu/tree/phylogeny.html)edited by D.R. Maddison and W.P. Maddison, the NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html) and (Graur et al., 1996), for the classi cation of rabbits.

Rooted trees
NOTUNG was applied to the rooted trees in Table 1.The scores of the resulting duplication histories are shown in Table 2.
The histories constructed by the program were consistent with the analysis of Hughes (1998) in all cases.Generally, NOTUNG nds a superset of the duplications discussed by Hughes, since he only mentions those duplications that are relevant to the biological question he is addressing.This was true of all the trees reported here; the authors of the original studies did not attempt to describe the entire duplication history.They simply reported the aspects they considered relevant to their research.In contrast, NOTUNG reports the entire history, including variants, and allows the user to triage the information.
As an example, we show the duplication history generated by NOTUNG for the RXR tree shown in Figure 1.Here "jaw" refers to jawed_vertebrate, "pro" to protostomes, and "tet" to tetrapods.Both duplications 14 and 15 occurred after the divergence of protostomes (insects and molluscs) from deuterostomes ( sh and tetrapods), which is consistent with Hughes' analysis.NOTUNG also nds more recent duplications (3 and 6) not discussed by Hughes.(Hughes, 1998) rooted PBX (Hughes, 1998) rooted PSMB (Hughes, 1998) unrooted RXR (Hughes, 1998) rooted TCF (Ruvinsky and Silver, 1997) unrooted TEN (Hughes, 1998) rooted VMAT (Pebusque et al., 1998) unrooted 1 A rat sequence was removed from the NOTCH tree to obtain a binary tree.

Alternate hypotheses for weak branches
Alternate hypotheses were evaluated with respect to both cost functions for every branch with a bootstrap value less than 90% in the six rooted trees of the rst table.In all cases, C ub ./ and C dl ./ selected the same optimal rearrangement tree, although they gave different numerical scores.For three input trees, the original tree was unchanged.The PBX tree had no weak edges at the 90% level and the RXR and TEN trees could not be improved through rearrangement.
The other three trees were modi ed by the rearrangement process.The improvements in score, with respect to both cost functions, are summarized in Table 2. To illustrate NOTUNG's performance on weak edges, we discuss the rearrangement of one GFT in the data set, HSP70, in detail below.Similar behavior was observed in the remaining two trees, the C family tree and the NOTCH tree, which are simpler than HSP70.
The examples in Figure 5 show that some rearrangements correct phylogenetic mistakes, eliminating duplications and resulting in a GFT that is more consistent with the species tree, while others move duplications down in the tree, reducing loss while maintaining the same number of duplications.Both types of rearrangement appear in the HSP70 trees shown in Figure 8.These trees have been simpli ed for the purposes of exposition.Subtrees containing only birds and mammals have been compressed and are shown in capital letters (e.g., AMNIOTE*HSP70).The upper tree shows the original topology before rearrangements were considered.This tree contains ve branches with bootstrap values below the threshold.Two of them are adjacent.Initially, our program inferred a duplication history with nine duplications and eighteen losses:  The structure of this tree, (fungi (insects (plants, vertebrates))), is at odds with the structure of the Tree of Life, (plants (fungi, (insects, vertebrates))).The structure within the plant clade also disagrees with the Tree of Life since petunias and tomatoes are more closely related to each other than either is to corn.
In contrast, the topology after rearrangement had three fewer duplication nodes and seven losses, as shown in the lower tree.The removal of duplications from nodes 6, 16 and 17 can be interpreted as correcting errors in the original topology.The original topology implies that an ancestral HSP gene was duplicated twice early in the eukaryote lineage; subsequently, each of the four resulting copies survived in only one lineage (fungi, insects, plants, and vertebrates, respectively) and was lost in the other three.
In view of the low bootstrap support, it seems more plausible that the yeast and y sequences are placed incorrectly.In the rearranged tree, the branching of plants, yeast, and insects is compatible with the Tree of Life.This second hypothesis is more compelling than the original hypothesis of two early duplications followed by massive gene loss.The exchange of the corn and tomato genes to remove the duplication at node 6 also appears to correct an error in the reconstruction of the tree topology.The rearrangement of the frog sequence that led to the replacement of the duplication at node 14 with one at node 13 results from the tendency of the duplication/loss model to favor more recent duplications and is more controversial.It is open to interpretation whether a duplication in the amniote lineage is more or less likely than a duplication before the divergence of amphibians followed by loss of one copy.This example shows that rearrangement can result in substantially different hypotheses.The number of duplication nodes in the rearranged HSP tree decreased from nine to six.As this illustrates, although it is possible to pick out individual rearrangements of low con dence branches by eye, when a tree contains many weak branches it is helpful to have a tool to integrate all the alternate hypotheses automatically.

Unrooted trees
We tested NOTUNG on the seven unrooted trees in Table 1.For each tree, NOTUNG computed the duplication history for every root and ranked them according to both cost functions.We compared these rankings with the rootings favored by the authors.Although possible rootings are rarely, if ever, mentioned explicitly by the authors whose trees we tested, they frequently imply that only a subset of the possible rootings lead to plausible hypotheses.Consider, for example, the TCF family tree, shown in Figure 9.In their analysis, Ruvinsky and Silver (1997) state that "it is dif cult to conclude whether the split between the TCF1 and TCF2 subfamilies occurred before or after the separation between sh and tetrapods," but "in any case, divergence between the two subfamilies has taken place prior to the amniote-amphibian separation."These conclusions are consistent with a rooting on the bold edges in Figure 9 and no others.
For each tree, we partitioned the set of edges into plausible and implausible rootings from the analysis presented by the original authors.We executed NOTUNG on all trees and compared the rankings induced by each cost function with the author's rankings.The two cost functions always agreed on the ranking of the trees with the best scores, ranging from an identical ranking of the top nine rootings for the EGR family to identical rankings of all rootings for ANK, LHX and VMAT.When NOTUNG 's rankings were FIG. 9.An unrooted tree for the TCF family (Ruvinsky and Silver, 1997).Each edge, e, is labeled with the ranking, with respect to both C d l ./ and C ub ./, of the tree rooted at e. Edges in bold correspond to rootings supported by the analysis in Ruvinsky and Silver (1997).
compared with those of the original authors, NOTUNG ranked all plausible rootings above all implausible rootings for ve out of the seven trees.For the remaining two trees, the costs of all implausible rootings were greater than or equal to the costs of all plausible rootings.For one of these, the PSMB tree, the set of highest ranked edges is a superset of the rootings deemed plausible by Hughes.One of these edges has weak bootstrap support.A rearrangement around this edge improves the score of the tree with respect to both C dl ./ and C ub ./.When the alternate rootings of this rearranged tree were rescored, the set of lowest cost rootings exactly agreed with Hughes' analysis.In the other case, the CRYB tree, there were eight top-ranked rootings of equal cost, while the authors' analysis implied that only one rooting is possible.The duplication histories (i.e., the set of duplication nodes with time ranges) were identical for the eight edges.Only the ordering of the duplication nodes differed.This suggests either that the authors did not consider all alternate scenarios, possibly missing something of interest, or that they had additional information about the gene family, such as the biochemical properties or functional roles of the proteins, that allowed them to rule out other rootings.
Within the set of plausible rootings, the ordering of scores does not always agree with the biologists' assessments.In contrast to the analysis of Ruvinsky and Silver, who ranked the three best edges equally, the edge adjacent to the sh sequence in Figure 9 is ranked higher than the other two because this rooting requires one fewer gene loss.

DISCUSSION
In this study, we analyzed every nonpathological tree in three recent gene duplication studies (Hughes, 1998;Pebusque et al., 1998;Ruvinsky and Silver, 1997).The duplication histories generated for rooted trees were consistent with the analyses of the authors of the original papers for all trees considered.For unrooted trees, alternate histories were computed for each rooting and ranked according to two different cost functions based on duplication and loss.One of these is based on a model of parsimonious gene loss, while the other is an upper bound on the maximum number of gene losses that occurred.Although the rankings for the two cost functions were different, for top ranking rootings the cost functions agreed for every tree in the test set.In particular, they agreed on the hypotheses most likely to be of interest to the user.Both rankings correctly identi ed unlikely hypotheses, providing the user with a way to control the quantity of output to be reviewed.When used to select alternate hypotheses for trees with weak edges, both cost functions were effective in correcting errors in the duplication history stemming from errors in the original tree topology.
The fact that the cost functions agree on both unrooted trees and rearrangements suggests that the problem is robust with respect to the duplication/loss model.Since the cost functions represent models that are opposite extremes of gene loss, our results imply that inferring duplication histories is not highly sensitive to the choice of model and we can have con dence in an exploratory analysis tool based on such scoring functions.This approach does present some limitations.Because both cost functions count gene losses, they favor hypotheses with more recent duplications.As a result, in addition to correcting phylogenetic errors when evaluating alternatives for weak branches, in some cases they select more controversial alternatives.By combining duplication and loss with sequence information, this approach offers a richer source of information for reconstructing gene duplication histories than sequence information alone.Nevertheless, there are other sources of biological information (e.g., biochemical properties and intron/exon structure) that can guide analysis but are not incorporated in this approach.Another limitation of this approach is that it is dif cult to distinguish true gene loss from genes that have not yet been sequenced.
This paper suggests several open problems for future research.Currently, there is no known polynomial algorithm for nding the optimal rearrangement tree with respect to either cost function.Nor has this optimization problem been shown to be NP-hard.The sensitivity of the cost function, C dl ./, to the relative importance of gene duplications and gene losses (i.e., the values of c ¸and c ± ) has not been investigated.Finally, the more general problem of phylogeny reconstruction using a cost function that combines sequence comparison, gene duplication, and gene loss is an outstanding challenge.
Currently, there is a great deal of interest in using gene duplications to study the role of whole genome duplications in genome evolution (Skrabanek and Wolfe, 1998;Wolfe and Shields, 1997).This will require dating all paralogs in a genome.In its current form, NOTUNG can be used to estimate duplication dates of rooted GFT's automatically.With more reliable evaluation methods, NOTUNG can also be adapted to the automatic analysis of unrooted trees and rearrangements of trees with weak edges.

FIG. 3 .
FIG. 3. Gene family trees for the hypothetical gene family, A, with ve known gene sequences (two in mouse, two in chicken and one in sh).(a) An unrooted GFT for A. (b) -(d) Three alternate rootings of the GFT in (a).Duplication nodes are shown in italics and missing genes are shown in parentheses.
FIG.6.A GFT with four weak edges, shown in bold.The triangles represent subtrees whose structure is not made explicit.

FIG. 8 .
FIG. 8.The HSP70 tree before and after NNI rearrangements.The trees have been simpli ed by compressing clades containing only mammals and birds (AMNIOTE*GRP78, AMNIOTE*HSP70, AMNIOTE*HSC70).No rearrangements were accepted in these clades.Internal nodes are given numerical labels.In the upper tree, the bootstrap values of edges with bootstrap support above 50 are labeled with square brackets.

Table 1 .
Gene Family Trees used in Experiments

Table 2 .
Scores of Rooted Trees with Respect to C dl ./, Given in Terms of the Number of Duplications and Losses, and C ub ./.