pcbi.1008380.s001.pdf (1.44 MB)
Download file

Supplementary material for the manuscript.

Download (1.44 MB)
journal contribution
posted on 13.01.2022, 13:24 authored by Charles-Elie Rabier, Vincent Berry, Marnus Stoltz, João D. Santos, Wensheng Wang, Jean-Christophe Glaszmann, Fabio Pardi, Celine Scornavacca

Fig A: Density probabilities for 5-tips networks, simulated with a prior corresponding to a birth hybridization process with parameters d = 10, r = 1/2 and τ0 = 0.1, using the SpeciesNetwork package [53]. The figure is obtained for 10,000 replicates. The means are given by the dashed vertical lines. Fig B: Density probabilities for 5-tips networks with at most two reticulations, simulated with a prior corresponding to a birth hybridization process with parameters d = 10, r = 1/2 and τ0 = 0.1, using the SpeciesNetwork package [53]. Figures are drawn for the 4,377 cases in 10,000 where the network had at most two reticulations. The means are given by the dashed vertical lines. Fig C: Density probabilities regarding the 5-tips network with a maximum of 3 reticulations, simulated under the birth hybridization process (d = 10, r = 1/2, τ0 = 0.1, 5,837 replicates), using the SpeciesNetwork package [53]. The means are given by the dashed vertical lines. Fig D: Estimated node heights of network B. 10,000 sites are considered and 2 lineages per species. Constant sites are included in the analysis, and the estimated heights are based on the 12 replicates (over 14 replicates) for which network B was recovered by SnappNet (criterion ESS > 200; θ ∼ Γ(1, 200), , r ∼ Beta(1, 1), for the priors, number of reticulations bounded by 3 when exploring the network space). Heights are measured in units of expected number of mutations per site. True values are given by the dashed horizontal lines. The initials MRCA stand for “Most Recent Common Ancestor”. Fig E: Estimated population sizes θ for each branch of network B. Same framework as Figure D in S1 Text. True values are given by the dashed horizontal lines. The initials MRCA stand for “Most Recent Common Ancestor”. Fig F: Same framework as Figure E in S1 Text. Fig G: Estimated node heights of network C as a function of the number of sites. Same experiment as in Table 2 of the main manuscript: 1 lineage in species O, A and D, and 4 lineages in species B and C. The estimated heights are based on the replicates for which network C was recovered by SnappNet. True values are given by the dashed horizontal lines. The initials MRCA stand for “Most Recent Common Ancestor”. Fig H: Estimated height and length for network A, as a function of the number of sites. Heights and lengths are measured in units of expected number of mutations per site. True values are given by the dashed horizontal lines. Two lineages per species were simulated. Only polymorphic sites are included in the analysis, and 20 replicates are considered for each simulation set up (criterion ESS > 200 for m = 1,000 and m = 10,000, and criterion ESS > 100 for m = 100,000; θ ∼ Γ(1, 200), , r ∼ Beta(1, 1), for the priors, number of reticulations bounded by 2 when exploring the network space). Same framework as in Fig 10 of the main paper, except that only polymorphic sites are taken into account. Fig I: Estimated inheritance probability and instantaneous rates for network A, as a function of the number of sites. True values are given by the dashed horizontal lines. Same framework as in Fig 11 of the main paper, except that only polymorphic sites are taken into account. Fig J: Estimated node heights of network A, as a function of the number of sites. Heights are measured in units of expected number of mutations per site. True values are given by the dashed horizontal lines. Same framework as in Fig 12 of the main paper, except that only polymorphic sites are taken into account. The initials MRCA stand for “Most Recent Common Ancestor”. Fig K: Estimated population sizes θ for each branch of network A, as a function of the number of sites. True values are given by the dashed horizontal lines. Same framework as in Fig 13 of the main paper, except that only polymorphic sites are taken into account. The initials MRCA stand for “Most Recent Common Ancestor”. Fig L: Experiments on Network A and based only on polymorphic sites. Same framework as in Figures H and I in S1 Text, except that the correction factor is not used in the calculations (criterion ESS > 200 in all cases). Fig M: Summary of rice molecular diversity used for selecting our sample of rice cultivated varieties and wild types. (A) unweighted neighbour joining (UWNJ) tree reflecting dissimilarities among 899 accessions based on 2.48 million SNPs as described in [73]; the accessions are colored according to their classification into wild population types or cultivar groups. (B, C) UWNJ tree using the same data for the 24 accessions we selected for assessing SnappNet performance, and showing their accessions number (B) and their country of origin (C); the colors are as in A. Fig N: Trace plots obtained according to the Tracer software when data set 1 was analyzed with SnappNet. (a) and (b) refer to the first sampling of 12 kSNPs along the whole genome, whereas (c) and (d) focus on the second sampling. Two chains were considered for each sampling. Fig O: Birth-hybridisation model with speciation rate 20 and hybridisation rate 1 (mean number of reticulations close to zero) and a normal prior with mean 0.1 and standard deviation of 0.01 on the origin height. We plot the simulated networks (orange) against the sampled networks (blue) summarising the networks under: (a) Number of reticulations (b) Time until first reticulation (c) Height of the network (d) Length of the network. Fig P: Birth-hybridisation model with speciation rate 20 and hybridisation rate 2 (mean number of reticulations close to one) and normal prior with mean 0.1 and standard deviation of 0.01 on the origin height. We plot the simulated networks (orange) against the sampled networks (blue) summarising the networks under: (a) Number of reticulations (b) Time until first reticulation (c) Height of the network (d) Length of the network. Fig Q: Birth-hybridisation model with speciation rate 20 and hybridisation rate 3 (mean number of reticulations close to two) and normal prior with mean 0.1 and standard deviation of 0.01 on the origin height. We plot the simulated networks (orange) against the sampled networks (blue) summarising the networks under: (a) Number of reticulations (b) Time until first reticulation (c) Height of the network (d) Length of the network. Fig R: Birth-hybridisation model with speciation rate 20 and hybridisation rate 1 (mean number of reticulations close to zero) and an exponential prior with mean 0.1 on the origin height. We plot the simulated networks (orange) against the sampled networks (blue) summarising the networks under: (a) Number of reticulations (b) Time until first reticulation (c) Height of the network (d) Length of the network. Fig S: Birth-hybridisation model with speciation rate 20 and hybridisation rate 2 (mean number of reticulations close to one) and an exponential prior with mean 0.1 on the origin height. We plot the simulated networks (orange) against the sampled networks (blue) summarising the networks under: (a) Number of reticulations (b) Time until first reticulation (c) Height of the network (d) Length of the network. Fig T: Birth-hybridisation model with speciation rate 20 and hybridisation rate 3 (mean number of reticulations close to two) and an exponential prior with mean 0.1 on the origin height. We plot the simulated networks (orange) against the sampled networks (blue) summarising the networks under: (a) Number of reticulations (b) Time until first reticulation (c) Height of the network (d) Length of the network. Fig U: Summary distributions of all chains with correct population size priors (chain numbers 1,2,9,10,17,18) given data simulated from network A. We summarize the MCMC chains by combining them, that is: Chains 1 and 2 are indicated by the blue line (mean reticulations close to zero); Chains 9 and 10 are indicated by the orange line (mean reticulations close to one); Chains 17 and 18 are indicated by the green line (mean reticulations close to two); We plot the following distributions (a) Likelihood (b) Prior (c) Network height (d) Network length. Note that network height and network length used to simulate data are indicated by red lines. Fig V: Summary distributions of all chains with incorrect population size priors Gamma(1,20) (chain numbers 3,4,11,12,19,20) given data simulated from network A. We summarize the MCMC chains by combining them, that is: Chains 3 and 4 are indicated by the blue line (mean reticulations close to zero); Chains 11 and 12 are indicated by the orange line (mean reticulations close to one); Chains 19 and 20 are indicated by the green line (mean reticulations close to two); We plot the following distributions (a) Likelihood (b) Prior (c) Network height (d) Network length. Note that network height and network length used to simulate data are indicated by red lines. Fig W: Summary distributions of all chains with correct population size priors (chain numbers 1,2,9,10,17,18 given data simulated under network B. We summarize the MCMC chains by combining them, that is: Chains 1 and 2 are indicated by the blue line (mean reticulations close to zero); Chains 9 and 10 are indicated by the orange line (mean reticulations close to one); Chains 17 and 18 are indicated by the green line (mean reticulations close to two); We plot the following distributions (a) Likelihood (b) Prior (c) Network height (d) Network length. Note that network height and network length used to simulate data are indicated by red lines. Fig X: Summary distributions of all chains with incorrect population size priors (chain numbers 3,4,7,8,11,12) given data simulated from network B. We summarize the MCMC chains by combining them, that is: Chains 3 and 4 are indicated by blue line (mean reticulations close to zero); Chains 7 and 8 are indicated by orange line (mean reticulations close to one); Chains 11 and 12 are indicated by green line (mean reticulations close to two); We plot the following distributions (a) Likelihood (b) Prior (c) Network height (d) Network length. Note that network height and network length used to simulate data are indicated by red lines. Fig Y: In this we figure we plot summary distributions of all chains with incorrect population size priors Gamma(1,20) (chain numbers 5,6,13,14,21,22) given data simulated from Network B. We summarize the MCMC chains by combining them, that is: Chains 5 and 6 are indicated by blue line (mean reticulations close to zero); Chains 13 and 14 are indicated by orange line (mean reticulations close to one); Chains 21 and 22 are indicated by green line (mean reticulations close to two); We plot the following distributions (a) Likelihood (b) Prior (c) Network height (d) Network length. Note that network height and network length used to simulate data are indicated by red lines. Fig Z: In this we figure we plot summary distributions of all chains with incorrect population size priors (chain numbers 7,8,15,16,23,24) given data simulated from network B. We summarize the MCMC chains by combining them, that is: Chain 7 and 8 are indicated by blue line (mean reticulations close to zero); Chain 15 and 16 is indicated by orange line (mean reticulations close to one); Chain 23 and 24 are indicated by green line (mean reticulations close to two); We plot the following distributions (a) Likelihood (b) Prior (c) Network height (d) Network length. Note that network height and network length used to simulate data are indicated by red lines. Table A: Table linked to Table 1 of the main manuscript. Trees inferred by SnappNet when m = 1,000 sites were considered. Table B: Average posterior probability (PP) of the topology of network C obtained by running MCMC_BiMarkers on data simulated from network C. Same as Table 3 of the main manuscript except that 12 × 106 iterations are considered, and only one lineage is sampled in hybrid species B and C. is the average ESS over the different replicates, and SE stands for the sampler efficiency. Table C: Description of the 24 rice varieties considered in our study. These varieties are either representative cultivars spanning the four main rice subpopulations (Indica, Japonica, circum Aus and circum Basmati), or wild types (Or1I, Or1A, Or3). Table D: Data set 1, that includes only one variety per subpopulation. These varieties were chosen from Table C in S1 Text. Table E: Data sets 2 and 3, that include two varieties per subpopulation. These varieties were chosen from Table C in S1 Text. Table F: Informations obtained according to the Tracer software, when data set 1 was analyzed with SnappNet. Two different samplings of 12 kSNPs were considered, and also two chains for each sampling. Table G: BH(birth rate, hybridisation rate) refers to the birth-hybridisation process of Zhang et al. with the specified birth and hybridisation rates. For data simulated with network A, only chains 1,2,3,4,9,10,11,12,17,18,19,20 were run. We indicate the mean number of reticulation for the Birth-Hybridization model given an exponential prior with mean 0.1 on network origin. Note that we only used the exponential prior in the experiment in Section 8.2 of S1 Text. Table H: MCMC summary statistics for network A (correct population size priors). Table I: MCMC summary statistics for network A (incorrect priors). Table J: MCMC summary statistics for Network B (correct population size priors). Table K: MCMC summary statistics for Network B (incorrect population size priors Gamma(1,20)). Table L: MCMC summary statistics for Network B (incorrect population size priors Gamma(1,1000)). Table M: MCMC summary statistics for Network B (incorrect population size priors Gamma(1,2000)). Table N: MCMC acceptance rates for Network B (correct population size priors). Table O: MCMC acceptance rates for Network B (incorrect population size priors Γ(1, 1000)). Table P: MCMC acceptance rates for Network B (incorrect population size priors Γ(1, 2000)).

(PDF)

History