Hybridization network for the system of woody plant gene pools in the United States

Abstract Hybridization generates similarities among gene pools. This structure can be visualized and analyzed at the systems level using networks. Here we construct a network of the 315 woody plant species native or naturalized in the United States using data compiled by the Forest Service of the US Department of Agriculture (USFS). Each species is represented by a node in the network whose size is proportional to a recent census for live stems in the continental United States. Each of the 416 links between node pairs represents evidence for hybridization compiled from the USFS manual Silvics of North America. The total network resolved into 100 separate connected components or clusters (mean size, 3.15 species), with 44% of species linked to at least one other. Betula had the largest component (18 species) following by the separate Quercus clusters (17 red oaks and 16 white oaks); Q. velutina was the most genetically connected woody plant in the continental US. The number of species held together per component (i.e. size) scaled as a power–law albeit a slightly truncated one. The truncation suggests there are fewer than expected hybridizing species within the large woody genera of plants in the US.


Introduction
Hybridization is an important factor in plant evolution (Soltis and Soltis 2009). While it is safe to say that most hybridization occurs between congeneric species, studies have revealed a heterogeneous distribution across groups. For example, Whitney et al. (2010) found a significant heterogeneity of hybridization propensity across orders of vascular plants. Stace (1975) reviewed the flora of the British Isles and showed that hybridization was especially wellrepresented in forest tree genera.
There are numerous reports on the hybridization dynamics among pairs of forest tree species (e.g. Lexer et al. 2005), and several involving three or more taxa (e.g. Aldrich et al. 2003;Curtu et al. 2007). Particular attention has been paid to genera that are prone to hybrid formation such as the oaks, wherein it has become common to augment traditional evidence from phenological, ecological, and silvicultural sources with molecular marker data (for a review of Quercus, see Aldrich and Cavender-Bares 2011). It is more challenging to survey hybridization at broader taxonomic scales, in part because the researcher may be extending outside his taxonomic area of expertise, and because the evidence becomes quite patchy. A common approach is to consult regional floras and botanical manuals since these have been carefully curated by typically one or two experts conversant in a broad range of taxa. For example, Whitney et al. (2009) recently reviewed evidence of hybridization taken from six floras dating from 1986 to 2005.
Networks are useful for representing relationships among hybridizing taxa (Posada & Crandall 2001, McBreena & Lockharta 2006. Most tree-building algorithms are hierarchical and so tend to perform poorly where reticulate evolution is common since this yields strong cyclic, triangular relationships among taxa. Networks are not so constrained and can be used to represent relationships above, at, or below the species level. Each node corresponds to a taxon and edges or links between nodes represent evidence for hybridization between those taxa. Aldrich and Cavender-Bares (2011) produced hybridization networks for the red and white oaks using the software Pajek (Batagelj & Mrvar 1998), showing that Quercus velutina is the most genetically connected of the oak species in the United States. Chen and Wang (2010) have introduced the software HybridNET specifically for the construction of hybridization networks. To our knowledge, though, no study has described the overall organization of plant gene pools for an entire region using a network model.
Here we examine the network structure of the United States woody plant (tree and shrub) gene pool. Nodes denote species and edges between nodes represent evidence of interspecific hybridization taken from the US Forest Service silvicultural manual Silvics of North America (Burns & Honkala 1990). We analyzed the gene pool network for topological structure based on standard measures of network architecture (Junker & Schreiber 2008), and we interpret these graph theoretic findings in light of the biology of plant hybridization and gene pool dynamics.

Hybridization information
All information regarding the hybridization potential of woody plants was compiled from Burns and Honkala (1990) silvicultural manual Silvics of North America. Both volumes were summarized: (1) conifers and (2) hardwoods. This resource is available online through the US Department of Agriculture -Forest Service website (see reference).
Our study covered 315 woody plant species from 39 families representing both flowering plants and conifers. The species listed as primaries in Burns and Honkala (1990) are mostly native or at least naturalized in the US Represented are 35 angiosperm families (126 species) and four gymnosperm families (64 species). This list of primaries covers most of the tree species with high commercial value for not only the US but also Canada, Mexico, and the Caribbean. The remaining 125 species of the 315 total are taxa not treated as primary species in the manual but which occur in the US and hybridize with at least one of the primary species according to the manual.
All primary species in the manual were reviewed for evidence of hybridization with other taxa. Most species entries include a section on 'Genetics' and subsections on 'Population Differences', 'Races', and 'Hybrids'. These were manually reviewed for information relating to hybridization potential. For example, consider the following entry for Northern Red Oak (Quercus rubra L.):
Northern red oak also hybridizes with blackjack oak (Q. marilandica) and with northern pin oak (Q. ellipsoidalis) (17).'' - Burns and Honkala (1990). This information suggests that the network should contain a node representing Q. rubra with links to the following nodes representing the following pure species: Q. ellipsoidalis, Q. ilicifolia, Q. imbricaria, Q. marilandica, Q. palustris, Q. phellos, Q. shumardii, and Q. velutina. Hybrid and cultivar taxa were not included as nodes in the network, nor were subspecies or varieties; only taxa considered to exist as 'good' species by Burns and Honkala (1990). Naturally occurring hybrids and those resulting from controlled crosses were counted as evidence. Both data types present certain biases; reports of natural hybrids are often lacking in experimental evidence whereas a successful controlled cross does not prove that the hybrid can or does form in nature. Yet both data types do present evidence that is relevant to understanding the capacity or potential of species groups to form hybrids.

Census information
Mostly for the purpose of network visualization, we rendered node sizes as proportional to US census estimates for each of the species in the study. Data on the standing stock of tree species in the continental US were obtained from the Forest Inventory Analysis (FIA) online database maintained by the USDA Forest Service (Forest Inventory Data Online or FIDO: http://fiatools.fs.fed.us/fido/index.html). We obtained 'Tree count reports' on 'Number of live trees !1 inch dbh/drc' on a per county basis for the periods 2006-2008 for all states except Hawaii. We searched the FIA database for all species treated in the hybridization data from Burns and Honkala (1990), primary and non-primary species, and report these data here. Node or vertex sizes in the Pajek networks were made proportional to these standing crop census values. Species not found in FIA were assigned a small, default node size for visualization. Numerical census values are reported in the supplementary materials section, Appendix B.

Network generation and topological analysis
The Burns and Honkala (1990) evidence on hybridization was manually reviewed and transferred to a data file which was then parsed using Perl script to generate an adjacency matrix for the 315 species in the data set. Binary edge weights (0, 1) were used wherein a '1' indicated positive evidence for hybridization. This matrix was used to produce an undirected, unweighted network readable using the program Pajek (Batagelj & Mrvar 1998) and the projection method of Kamada and Kawai (1989).
The following standard measures of network structure were evaluated (see Junker & Schreiber 2008 for a review). Graph size (n) is the number of nodes (species) in a network. The edge count (e) is the number of links between nodes, and represents evidence on hybridization. Density (d) is the proportion of edges present out of the total possible (n (n 7 1)/2) if all nodes were fully connected to all other nodes (discounting self loops): A connected component in graph theory is a maximal set of nodes held together by edges. In a hybridization network it is a maximal set of species held together by hybridizations. If species A-B-C hybridize, D-E hybridize, and F does not hybridize with any species, then there are technically three connected components of sizes three, two, and one, respectively.
Diameter length (L) of a network can be defined as the longest of all the shortest paths between pairs of nodes. For an information network such as the internet, a network with a small diameter conveys information rapidly because there are fairly few steps between any pair of nodes, whereas a large diameter network requires numerous steps. A set of plant gene pools with a large diameter indicates that gene exchange typically involves numerous intermediate steps. For instance, gene exchange between species A and E might require the path A-B-C-D-E. By contrast, a small diameter gene pool might arise if there is little hybridization overall (in which case edge density is also low), or, if the network is large, if there are numerous shortcuts through network. This might arise if one or more species can hybridize with several others. In the above example, if A also could hybridize with B, C, D, and E, then all species would be at most two steps away from any other, and one would expect more widespread gene exchange between species.
The clustering coefficient C i of node i measures the extent to which that node's neighbors connect to one another: where e i is the number of edges between neighbors of node i and k i is the number of neighbors (i.e. nodes connected to node i by an edge). In social networks, the average clustering coefficient measures the extent to which one's friends are friends with one another. In a gene pool network, it represents the extent of triangular hybridizations in the gene pool, whereby species A can cross with both B and C, and B can cross with C. Gene pools with high clustering coefficients would be well-mixed, at least across a portion of the total network. A candidate taxon for high clustering would be the genus Quercus which is noted for its propensity to form hybrids (reviewed in Aldrich and Cavender-Bares 2011).
A small-world network is one in which nodes are globally accessible over short steps from anywhere in the network (small diameter) despite the fact that many of the links are local, forming dense pockets of interlinked nodes (high clustering coefficient). The measure relates to the social science concept of 'six degrees of separation' (Travers & Milgram 1969) which notes that a complex network may be highly subdivided yet allow rapid transmission of information due to occasional long-distance links. The theory was formalized in Watts and Strogatz (1998), and a reliable estimator, the small-world coefficient (s), described in Humphries et al. (2006). Here, where C is the average clustering coefficient for the observed network, C R is the expected clustering coefficient for a random graph, L is the diameter of the observed network, and L R is the expected diameter of a random network. Watts and Strogatz (1998) showed that a small-world network has the short diameter expected of a random network, despite higher than expected clustering; thus, s 4 1 for networks that display a small-world topology. For our plant gene pools, we tested connected components (n ! 5 species) for small-world structure by running 100 randomizations to estimate C R and L R for each component.
Two measures of network centrality were evaluated for the connected components (n ! 5 species). (i) Degree centrality is a local measure of connectedness. The degree (k i ) of node i represents the number of edges incident to that node, which is the same as the number of neighbors of i in the network. In a hybridization network, a species with high degree hybridizes to many other species. (ii) Betweenness centrality is a global measure of connectedness evaluated for a particular node. A node with a high betweenness centrality (b i ) is present on numerous shortest paths between pairs of nodes in the network. In an information network, a node with a high b i is a bottleneck through which information must pass to get from one part of the network to another. In a plant gene pool network, a species with a high b i is also a bottleneck, but for transmission of genetic information between species. It might cross with two otherwise reproductively isolated groups of species.

Comparison of gymnosperms and angiosperms
Comparisons were performed between the gymnosperm and angiosperm groups with respect to the 10 primary topological measures as enumerated in Tables I and II (node number, edge number, etc.). Observational units were the connected components of size n ! 5 as shown in these tables. We used MATLAB (The MathWorks Inc., Natick, MA, USA) to conduct the nonparametric Mann-Whitney U-test which entails a two-sided rank sum test of the null model that the two groups are independent draws from the same distribution with equal median values. The P-value of this test is reported.

Power-law scaling
We examined two distributions from the overall study, degree and connected component, to deter-mine if they conformed to power-law expectations. (i) The degree distribution is the frequency distribution of node degrees. It has been shown (Barabási & Albert 1999) that numerous complex networks display a power-law degree distribution which is characterized by a few highly connected nodes and numerous poorly connected nodes. Such distributions have a heavy tail and are linear on a log-log transformation. A system of plant gene pools whose degree distribution conforms to a power-law would indicate that the hybridizing group is held together largely by one or more super-hybrid species able to cross with several taxa, most of which can only cross to one or few other species. (ii) We also considered the size distribution of the connected components in the overall study to determine if most of the hybridization between taxa occurred within massive groups of species, while the great majority of taxa crossed with fairly few others. For both distributions, we employed the maximum likelihood estimator of Clauset et al. (2009) to test for a significant power-law relationship in the data. We used their MATLAB modules (plfit, plpva, and plplot) which are available online (http://tuvalu.santafe.edu/*aaronc/ powerlaws/).

Topology of the woody plant gene pool
The network of 315 species resolved into 100 connected components (Figure 1). Over half (56%) of the species were isolated, hybridizing to no other taxa. The average size of a connected component was 3.15 species. The largest connected components were Betula (n ¼ 18), Quercus section Lobatae (red oaks, n ¼ 17), and Quercus section Quercus (white oaks, n ¼ 16) (Table I).
Overall the network was sparse with only 416 of the possible 49,455 edges (density d ¼ 0.0084 or roughly 1%; Table I). The red oaks (Quercus section Lobatae) contained the most edges (e ¼ 46), though the highest density (d ¼ 0.500) was observed in Juniperus and Pinus subgenus Strobus. While Betula contained the largest connected component (n ¼ 18), it exhibited the second lowest density (d ¼ 0.196).
The diameters (L) of the connected components ranged from 2 to 5 (Table I). The largest values were observed in Pinus subgenus Pinus.
Clustering coefficients (C) ranged from 0 to 0.788, with the maximum observed in Populus (Table I). Values for Quercus also were high (ranging from C ¼ 0.444 to 0.620). Seven connected components displayed C ¼ 0 which indicates an absence of triangular crossing patterns. Small-worldness (s) was detected in 12 of the 21 (57%) connected components ( Table I). The largest value was seen in Betula (s ¼ 2.875) followed by Populus (2.630) and Pinus subgenus Pinus (2.154).
Mean degree centrality for a connected component ranged from 1.200 to 5.412 (Table II). The high value was observed for the red oaks, Quercus section Lobatae. In particular, Quercus velutina with k ¼ 12 is the most connected red oak species and the most connected woody plant species in the US. Q. stellata is most connected of the white oaks (k ¼ 11) and Betula alleghaniensis for the birches (k ¼ 11). Three species had k ¼ 10: Eucalyptus globulus, Picea rubens, and Populus balsamifera. In the overall dataset, degree was not significantly correlated with stem census (r ¼ 0.112, P ¼ 0.123), suggesting that common and widespread species were not necessarily more likely to form hybrids.
Mean betweenness centrality (b) for a connected component ranged from 0.055 to 0.363, the high value for Betula (Table II). The highest betweenness centrality for a species was the maximum possible of b ¼ 1.000 for Robinia pseudoacacia and Salix nigra, though these values were for rather small connected components (n ¼ 5 and 6, respectively). As expected, the connected components with high density (d) had generally low betweenness centralities (b). The most important brokers of gene exchange in the large species groups were as follows: Betula nigra The gymnosperms and angiosperms were not significantly different in their topological structure for any of the 10 measures shown in Tables I and 2. P-values ranged from 0.42 to 1.0.

Power-law scaling and expected gene pool attributes
The method of Clauset et al. (2009) was used to detect power-law scaling in the woody plant gene pool. Here, a P-value 4 0.1 suggests a power-law hypothesis is a plausible explanation, otherwise the model is rejected. Plots are shown on log-log scales (Figure 2) which should provide a linear relationship if the variable scales in distribution as a power-law. Results showed no significant scaling for node degrees (k; Figure 2A) given that P ¼ 0. However, the size of connected components did exhibit powerlaw attributes with P ¼ 0.2020 ( Figure 2B). Both distributions were truncated at higher values of x.

Discussion
This network view of hybridization potential in the woody plant gene pool offers some system-level insights. Figure 1. One hundred connected components in the US woody plant gene pool. Each of the 315 nodes represents a species, and a link or edge between nodes represents evidence of hybridization between these taxa according to the USDA Forest Service silvics manual by Burns and Honkala (1990). Node size represents stem abundance in the US according to the USDA Forest Service FIA inventory. Black nodes represent gymnosperms, white nodes angiosperms. See supplementary material to locate species in the network: Appendix A to determine the species identity of a node using its alphanumeric coordinate code; Appendix B to locate a particular species in this network, along with its census data.

System-level hybridization is prevalent, despite appearances
There was very little evidence of hybridization (barely 1%) relative to the total amount that could have occurred given the number of species examined. Over half the species occurred as isolated individuals in the network, apparently not crossing to any other species in this study. It is likely that some crosses have gone undetected and methods weighted more by molecular evidence might elevate this 1% estimate. Moreover, there are undoubtedly geographic constraints as well in that some species are potentially interfertile but rarely come in contact.
Despite these appearances at the highest taxonomic level, the hybridization rate is much higher if one assumes that crosses only occur among congeneric species. Given the number of species per each genus, we calculate a maximum of 1106 crossfertilities possible in the woody plant gene pool, of which over a third (416, or 37.6%) were reported as likely by Burns and Honkala (1990).
The gymnosperms and angiosperms were similar in their topological structure. This suggests that whatever factors influence hybridization network topologies, it does not seem overly reliant on the biology distinguishing flowering plants from conifers.

Genus-level topologies
Some tree genera are noted for their ability to form hybrids. Betula has an extensive capacity to hybridize which renders individual species difficult to resolve (eFloras 2008). The genus Quercus is also notable for hybrid formation though only within infrageneric groups (reviewed in Aldrich and Cavender-Bares 2011). Populus species are readily crossed and many natural hybrids have been described, which is one of the features that contributes to the use of poplar as a genetic and genomic model organism for trees (Tuskan et al. 2006), though here as well hybrid formation is typically restricted to taxa within a section (eFloras 2008).
In our study, Betula is notable for having the largest connected component (n ¼ 18). It also had the lowest density of edges, which likely relates to the rarity of several of its species such that they would rarely occur sympatrically. Betula also had the strongest small-world signature, suggesting phylogenetic substructure in the gene pool yet short paths of gene exchange between species on average. An important conduit for gene exchange appears to be Betula alleghaniensis which had among the highest values for degree centrality.
The oaks (Quercus) are notable for forming the second and third largest connected components (n ¼ 17 and 16) in the US woody plant gene pool. The red and white oaks retain their distinctions even in sympatry and are not thought to cross (Aldrich and Cavender-Bares 2011). The oak groups carried the most edges of all the connected components, and had among the highest mean clustering coefficients, meaning that triangular gene exchange is common.
The topology of the poplar (Populus) connected component also reflected its known hybrid behavior. It had a high clustering coefficient suggesting that species which cross with a particular focal species also cross amongst themselves, and it had the second highest small-world coefficient after Betula.

Triangular gene exchange
In triangular gene exchange, taxon A might hybridize with B and C, and B could also hybridize with C. This would elevate levels of biparental inbreeding and of perceived homoplasy in molecular marker analysis. Muir and Schlö tterer (2005) suggested that shared ancestral polymorphisms could explain some of the evidence for active gene exchange between oak species; but triangular gene exchange would allow the ancestral polymorphisms to become shared through the process of hybridization. Oaks had the highest average degree centrality measures in the study, and included the most connected of all the woody species in the US, Q. velutina.
Several non-oak connected components lacked triangles altogether. However, most connected components with enough nodes (n ! 3) and edges (e ! 3) to form triangles, did so. Of the 25 that met the minimal criterion, 11 did not form triangles, the largest being Pinus (8A). There were only a few long and linear sets of hybrids, namely Pinus ( Figure 8A), Abies (7B), Acer (9B), and Fraxinus (10B). These often had larger network diameters. Clustering was highest for Populus, which is known to readily form hybrids.

Small-world gene exchange
Roughly half of the connected components had a small-world character, meaning that they had high clustering but low diameter. This means that species cross in clusters, cycles or loops, or triangles within the group, yet the potential for fairly rapid transfer of genetic information remains across the group. Betula and Pinus displayed this pattern most.

Power-law scaling and expected gene pool attributes
The size of the connected components scaled as a truncated power-law whereas the degree distribution did not. Biologically this suggests conformity to a general rule by which species form hybridizing communities. Most species hybridize with few if any other species, but there are a few groups comprised of numerous hybridizing species (e.g. Betula and Quercus). It is unclear to what extent this is a geographic phenomenon and/or something intrinsic to the taxa. One could speculate that groups that are actively speciating might also retain the capacity to form hybrids, forming a very large connected component. It is widely believed that many oak species are fairly recently evolved, by geological standards, and that at least some of the novelty owes to repeated interspecific hybridizations (Burger 1975;Van Valen 1976;Daghlian & Crepet 1983;Manos et al. 1999;Conte et al. 2007); in our study, the most connected of all species was a red oak, Q. velutina. Over evolutionary time, different genera might undergo rapid and large-scale speciation in which case the specifics would change but the overall topology of the hybridization network would remain a power-law.
That the power-law distribution was truncated is worth consideration. Firstly, it is not surprising to encounter truncated power-law distributions since they are common in several other biological and nonbiological systems (e.g. Burroughs & Tebbens 2001;Guimera & Amaral 2004;Aban et al. 2006;Jovani et al. 2008). In the present case, truncation indicates a rarity of extremely large groups of hybridizing taxa, which could result from several factors. (a) The data from Burns and Honkala (1990) might underestimate the true extent of hybridization in these large groups. Though we did include controlled crosses as evidence of hybridization potential, it is likely that many of the possible crosses have not been tried which would under-estimate the capacity to form hybrids. (b) Broad surveys are likely sensitive to group-specific differences or biases in taxonomic lumping versus splitting. Genera whose members are lumped rather than split would deflate the size of the connected components. (c) Natural hybridization rates are likely underestimated for rare species which do not often occur sympatrically with other potentially interfertile taxa. In support of this explanation, Mossa et al. (2002) found that truncation of powerlaw relationships is readily explained as 'information filtering' whereby links are established between nodes not simply according to rules based on overall network size but also accessibility; a given segment of the gene pool only forms links to those portions that are geographically accessible.

Summary
Our work with hybridization networks demonstrates the utility of system-level approaches to gene pool research. Additional work that incorporates more surveys from flora and manuals for other regions would be interesting, as would more detailed information from molecular evidence if it could be applied uniformly to the taxa under consideration. Future work might also separate naturalized from native taxa since the capacity of species to form hybrids would be influenced by reproductive barriers that might arise between species that evolve in sympatry. Separation of natural and humanmediated crosses would be interesting as well.