figshare
Browse

257 nuclear genes for Rosaceae phylogenomics

dataset
posted on 2014-06-18, 16:53 authored by Aaron ListonAaron Liston

Conserved, putatively orthologous nuclear loci can readily be identified via comparison of the apple, peach and strawberry genomes. Using ESTs, Cabrera et al (2009) developed over 600 conserved orthologous loci for genetic markers in Rosaceae; however these are introns and thus are suboptimal for phylogenetic analyses (i.e. alignment difficulty across the family). To design the baits used here, we started with the full set of gene models in the January 2012 versions of each genome (Shulaev et al 2010, Velasco et al 2010, Arús et al 2012) downloaded from GDR: Genome Database for Rosaceae (Jung et al 2008). We identified genes that are putative single copy orthologs in strawberry and peach based on reciprocal nucleotide similarity comparisons conducted with BLAT (Kent 2002). We then extracted the corresponding genes from the apple genome, with the expectation that most would have two or more copies due to its allopolyploid ancestry. In these cases, we arbitrarily selected a single apple gene using the criterion of minimizing the number of ambiguous bases present in the sequence (corresponding to polymorphic sites in a highly heterozygous cultivar). We further filtered these genes with the goals of maximizing phylogenetic utility and Hyb-Seq success. For phylogenetic utility, we required that each locus contain at least 960 bp of sequence and >85% sequence similarity in pairwise comparisons among the 3 genomes. The latter criterion was applied to exclude rapidly evolving genes that would be less likely to hybridize across the entire family. To maximize target capture, we removed all exons <80 bp, with GC content <30% or >70% and with >90% sequence similarity to annotated repetitive DNA in their respective genomes. We then compared the set of retained exons to themselves, excluding any with >90% sequence similarity to another target exon in the same genome. No such exons were identified in strawberry and peach, but 49 were found in apple. These steps resulted in a set of 257 genes, which were sent to MycroArray (Ann Arbor, MI) for final bait design. For exons between 80-120 bp, a single oligonucleotide bait was used (1X tiling) and for exons >120 bp a 50% overlap (1.5X tiling) was used. Finally, baits were analyzed with a simulation of the hybridization conditions, identifying and filtering those with multiple targets in the genome. The average targeted locus is 1704 bp in 1-20 exons (mean=5.3 exons). Average GC content is 44.1%. The total target ranges from 422,886 bp in apple to 448,163 bp in strawberry. This comprises approximately 1% of the coding sequence of these genomes. Average pairwise nucleotide divergence among the 257 targeted genes is 3.3-4% lower than genome-wide averages. Thus the above selection criteria resulted in gene targets that are relatively conservative. The targeted loci are dispersed throughout the Fragaria genome. Over 95% (245) of the genes have a putative ortholog in Arabidopsis, and 182 have a known molecular function. The three most abundant annotations are “protein kinase” (14), pentatricopeptide repeat – PPR (5), and tetratricopeptide repeat – TPR (4) proteins. Protein kinases are a large and diverse gene family in plants (Bayer et al 2012; Lundquist et al 2012). The PPR proteins are involved in nuclear-organellar signalling, and have been utlilized in angiosperm phylogenetics (Yuan et al 2009, 2010). Twenty-two of the genes utilized here are among the 959 single copy genes identified by Duarte et al (2010) and proposed for utlization in phylogenetic analyses of the angiosperms. The 3 files in fasta format contain the orthologous exons from 257 nuclear genes extracted from the published apple, peach and strawberry genomes.

The strawberry fasta file contains 1419 exon sequences, peach contains 1425 exon sequences, and apple contains 1254 exon sequences.

History