figshare
Browse
pbio.3001365.s002.docx (9.51 MB)

S1 Text -

Download (9.51 MB)
journal contribution
posted on 2021-08-11, 12:22 authored by Alexander K. Tice, David Žihala, Tomáš Pánek, Robert E. Jones, Eric D. Salomaki, Serafim Nenarokov, Fabien Burki, Marek Eliáš, Laura Eme, Andrew J. Roger, Antonis Rokas, Xing-Xing Shen, Jürgen F. H. Strassert, Martin Kolísko, Matthew W. Brown

Table A: Taxonomic composition of the PhyloFisher v. 1.0 dataset. Fig A: Phylogenomic tree of 304 taxa, 240 orthologs, and 72,632 amino acid sites (gt80 matrix). Supermatrix was processed as described above in the matrix_constructor.py methodology. The tree was built using IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 350 real bootstrap replicates (MLBS). This is the uncollapsed version of the tree shown in Fig 3 of the main text (with the branches and nodes colored in the same way). MLBS values of 100% are not shown; all other values are indicated at their respective node. Data associated with this figure are available in the directory archive FigA.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig B: Violin plot of each gene’s RTC score per method of trimming and untrimmed. Quartiles are drawn on the violin plots, overlaid with box and whisker plots. Data associated with this figure are available in the directory archive FigB.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig C: Box and whisker plot of the pairwise difference between each gene’s RTC score per method of trimming to the untrimmed RTC score. Data associated with this figure are available in the directory archive FigC.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig D: Violin plot of each node in the ML tree’s gene concordance factor assessed through IQ-TREE per method of trimming. Quartiles are drawn on the violin plots, overlaid with box and whisker plots. Data associated with this figure are available in the directory archive FigD.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig E: Bootstrap values of nodes of interest, inferred in IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (MLBS). This analysis highlights that different alignment trimming methods have little effect on the output tree and the bootstrap support values when a site heterogeneous model is used. Data associated with this figure are available in the directory archive FigE.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig F: Bootstrap values of a few selected nodes of interest displayed in Fig 3 are illustrated here, inferred in IQ-TREE under LG+G4+F+C60+PMSF with either an LG+G4+F or an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (MLBS). Conflicting topologies with high support are found in the LG+G4+F input tree analysis, while nodes and topologies do not conflict when inferred with LG+G4+F+C20 as the input tree. Data associated with this figure are available in the directory archive FigF.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig G: Histogram of amino acid sites of the supermatrices generated per each trimming method. Data associated with this figure are available in the directory archive FigG.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig H: Fast site removal of sites from the whole dataset (the gt80 trimAl matrix of 72,632 amino acid sites). Each step has 9,000 sites, removed in a fastest to slowest stepwise manner to exhaustion. ML tree was inferred for each dataset in IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (UFBOOT). The tree from 9,000 sites removed (9K) is shown in Fig I and represents the tree shown in Fig 3 of the main text. Data associated with this figure are available in the directory archive FigH.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig I: Phylogenomic cartoon tree of 304 taxa 240 orthologs and 63,632 amino acid sites, with the top 9,000 of the fastest evolving sites in the original supermatrix removed (as indicated by fast_site_remover.py; see Fig H). The resulting supermatrix was processed as described above in the matrix_constructor.py methodology. IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 200 real ML bootstrap replicates (MLBS). Branches and nodes are colored as shown in Fig 3 of the main text. MLBS values of 100% are not shown; all other values are indicated at their respective node. Data associated with this figure are available in the directory archive FigI.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig J: Random subsampling of using the random_sample_iteration.py utility. (“random_sample_iteration.py -i gt80trimal.fastas/ -f phylip-relaxed -ci 0.95 -ps 20”). Each replicate was inferred in IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (MLBS). The support values of nodes of interest were calculated with the PhyloFisher utility bipartition_examiner.py and plotted in R using the boxplot function in the gplots library. Data associated with this figure are available in the directory archive FigJ.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig K: Hierarchical clustering of amino acid compositions of our supermatrix. Colors are depictions of taxa as labeled in Fig 3 of the main text. Data associated with this figure are available in the directory archive FigK.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig L: Heterotachious site removal of sites from the whole dataset (72,632 amino acid). Step 0 and Step 1 have 3,000 sites removed (see rationale below and Fig M) and then each subsequent step has 9,000 sites removed using the greatest to least heterotachy ratio stepwise manner to exhaustion. ML tree was inferred for each dataset in IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (UFBOOT). Data associated with this figure are available in the directory archive FigL.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig M: Ratio of fast to slow taxa site rates, on a per site basis, estimated from a simulated dataset. This dataset was simulated under the LG+G4+C60+F model of evolution using our output tree under this model with our gt80 dataset. Fast/slow taxa site ratios were estimated using the heterotachy.py utility. The maximum observed ratio was 9.08 in simulated data. This set of ratios was used further as a null distribution of expected fast/slow ratios under this model. Data associated with this figure are available in the directory archive FigM.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig N: Ratio of fast to slow taxa site rates, on a per site basis, estimated from our gt80 dataset with our output tree from this supermatrix inferred in IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies. Fast/slow taxa site ratios were estimated using the heterotachy.py utility. The null distribution as estimated from the LG+G4+C60+F simulation (Fig M) was used to calculate p-values from the top 3,000, 6,000, and 9,000 fast/slow ratios. Data associated with this figure are available in the directory archive FigN.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig O: Heterotachious site removal of 3,000 and 6,000 sites from the whole dataset (72,632 amino acid), removal of 3,000 (p-value = 0.0001) (left) and 6,000 (p-value = 0.003) (right). From these starting heterotachious removed datasets, 6,000 of the fastest sites were removed using fast_site_remover.py, Het3KFast6K (63,632 sites) and Het6KFast6K (60,632 sites). ML tree was inferred for each dataset in IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (UFBOOT). UFBOOT values of 100% are not shown; all other values are indicated at their respective node. Data associated with this figure are available in the directory archive FigO.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig P: Coalescent-based species tree from the 240 ortholog trees inferred in RAxML (under the PROTCATLGF model with 100 bootstraps) using the default trimming methodology listed above in the matrix_constructor.py description. Tree was inferred by ASTRAL-III using the PhyloFisher utility, astral_runner.py. Values at nodes are ASTRAL bootstrap replicate values (BS). BS values of 100% are not shown; all other values are indicated at their respective node. Data associated with this figure are available in the directory archive FigP.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig Q: Cartoon tree of the tree generated by RTC sorted bins (top 75%, 180 orthologs, 63,750 sites) using rtc_binner.py. Input datasets for the matrix_constructor.py concatenation was gt80trimal single ortholog files. This tree was inferred in IQ-TREE under LG+G4+F+C60+PMSF with an LG+G4+F+C20 input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (UFBOOT). UFBOOT values of 100% are not shown; all other values are indicated at their respective node. Data associated with this figure are available in the directory archive FigQ.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig R: Cartoon tree of the tree generated in IQ-TREE using the model generated by mammal_modeler.py. This tree was inferred in IQ-TREE under LG+G4+F+ESmodel+PMSF (ESmodel = 60 rate classes estimated from the data inferred in MAMMaL) with an LG+G4+F+ESmodel input tree for generation of the PMSF site frequencies inferred in IQ-TREE with 1,000 ultrafast bootstrap replicates (UFBOOT). Note, this is without -bnni UFBOOT correction due to a bug in IQ-TREE v1.6.12. UFBOOT values of 100% are not shown; all other values are indicated at their respective node. Data associated with this figure are available in the directory archive FigR.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig S: Phylogenetic tree of ortholog EOG0934062S of the dataset from [5]. Tree was inferred using the sgt_contructor.py as detailed in the main text. Tree is the output from parasorter. Leaf names in bold are identified by the fisher.py algorithm as suggested orthologs. Leaf names not bolded are from the sequences collected as potential paralogs. The leaves with a colored background are those sequences from the dataset from [5]. Problematic paralogs are highlighted with red arrows, and the corrected replacement identified by PhyloFisher are highlighted by blue arrows. A downloadable figure and data associated with this figure are available in the directory archive FigS-X.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig T: Phylogenetic tree of ortholog EOG093409ME of the dataset from [5]. Tree was inferred using the sgt_contructor.py as detailed in the main text. Tree is the output from parasorter. Leaf names in bold are identified by the fisher.py algorithm as suggested orthologs. Leaf names not bolded are from the sequences collected as potential paralogs. The leaves with a colored background are those sequences from the dataset from [5]. Problematic paralogs are highlighted with red arrows, and the corrected replacement identified by PhyloFisher are highlighted by blue arrows. A downloadable figure and data associated with this figure are available in the directory archive FigS-X.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig U: Phylogenetic tree of ortholog EOG093407UY of the dataset from [5]. Tree was inferred using the sgt_contructor.py as detailed in the main text. Tree is the output from parasorter. Leaf names in bold are identified by the fisher.py algorithm as suggested orthologs. Leaf names not bolded are from the sequences collected as potential paralogs. The leaves with a colored background are those sequences from the dataset from [5]. Problematic paralogs are highlighted with red arrows, and the corrected replacement identified by PhyloFisher are highlighted by blue arrows. A downloadable figure and data associated with this figure are available in the directory archive FigS-X.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig V: Phylogenetic tree of ortholog EOG093403TH of the dataset from [5]. Tree was inferred using the sgt_contructor.py as detailed in the main text. Tree is the output from parasorter. Leaf names in bold are identified by the fisher.py algorithm as suggested orthologs. Leaf names not bolded are from the sequences collected as potential paralogs. The leaves with a colored background are those sequences from the dataset from [5]. Problematic paralogs are highlighted with red arrows, and the corrected replacement identified by PhyloFisher are highlighted by blue arrows. A downloadable figure and data associated with this figure are available in the directory archive FigS-X.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig W: Phylogenetic tree of ortholog EOG09340RBX of the dataset from [5]. Tree was inferred using the sgt_contructor.py as detailed in the main text. Tree is the output from parasorter. Leaf names in bold are identified by the fisher.py algorithm as suggested orthologs. Leaf names not bolded are from the sequences collected as potential paralogs. The leaves with a colored background are those sequences from the dataset from [5]. Problematic paralogs are highlighted with red arrow. A downloadable figure and data associated with this figure are available in the directory archive FigS-X.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig X: Phylogenetic tree of ortholog EOG093400WO of the dataset from [5]. Tree was inferred using the sgt_contructor.py as detailed in the main text. Tree is the output from parasorter. Leaf names in bold are identified by the fisher.py algorithm as suggested orthologs. Leaf names not bolded are from the sequences collected as potential paralogs. The leaves with a colored background are those sequences from the dataset from [5]. Problematic paralogs are highlighted with red arrows, and the corrected replacement identified by PhyloFisher are highlighted by blue arrows. A downloadable figure and data associated with this figure are available in the directory archive FigS-X.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. Fig Y: Phylogenetic reconstruction of the tree of Saccharomycetaceae using the PhyloFisher 208 dataset. ML tree built using (LG+G4+F+C60-PMSF model, with an LG+G4+F+C20 ML tree as a PMSF guide input tree) in IQ-TREE v1.6.7.1 [1]. Sub-clades that make up the Saccharomycetaceae are shown in dark blue, while the outgroup clades of the Saccharomycodaceae and the Phaffomycetaceae are shown in dark green and cyan. Nodes are maximally supported (100 MLBS) unless otherwise shown. Data associated with this figure are available in the directory archive FigY.tgz within the data archive available from https://ir.library.msstate.edu/bitstream/handle/11668/19731/Tice_etal.PhyloFisher.DATA.tar.gz. ML, maximum likelihood; MLBS, maximum likelihood bootstrap support; PMSF, posterior mean site frequency; RTC, relative tree certainty.

(DOCX)

History