figshare
Browse
1/1
8 files

Data from: The ancestral animal genetic toolkit revealed by diverse choanoflagellate transcriptomes

Version 2 2018-05-28, 19:45
Version 1 2017-12-09, 14:34
dataset
posted on 2017-12-09, 14:34 authored by Daniel RichterDaniel Richter, Parinaz Fozouni, Michael EisenMichael Eisen, Nicole King
The changes in gene content that preceded the origin of animals can be reconstructed by comparison with their sister group, the choanoflagellates. However, only two choanoflagellate genomes are currently available, providing poor coverage of their diversity. We sequenced transcriptomes of 19 additional choanoflagellate species to produce a comprehensive reconstruction of the gains and losses that shaped the ancestral animal gene repertoire. We find roughly 1,700 gene families with origins on the animal stem lineage, of which only a core set of 36 are conserved across animals. We find more than 350 gene families that were previously thought to be animal-specific actually evolved before the animal-choanoflagellate divergence, including Notch and Delta, Toll-like receptors, and glycosaminoglycan hydrolases that regulate animal extracellular matrix (ECM). In the choanoflagellate Salpingoeca helianthica, we show that a glycosaminoglycan hydrolase modulates rosette colony size, suggesting a link between ECM regulation and morphogenesis in choanoflagellates and animals.

File 1. Final sets of contigs from choanoflagellate transcriptome assemblies. There is one FASTA file per sequenced choanoflagellate. We assembled contigs de novo with Trinity, followed by removal of cross-contamination that occurred within multiplexed Illumina sequencing lanes, removal of contigs encoding strictly redundant protein sequences, and elimination of noise contigs with extremely low (PFKM < 0.01) expression levels.

File 2. Final sets of proteins from choanoflagellate transcriptome assemblies. There is one FASTA file per sequenced choanoflagellate. We assembled contigs de novo with Trinity, followed by removal of cross-contamination that occurred within multiplexed Illumina sequencing lanes, removal of strictly redundant protein sequences, and elimination of proteins encoded on noise contigs with extremely low (PFKM < 0.01) expression levels.

File 3. Expression levels of assembled choanoflagellate contigs. Expression levels are shown in FPKM, as calculated by eXpress. Percentile expression rank is calculated separately for each choanoflagellate.

File 4. Protein sequences for all members of each gene family. This includes sequences from all species within the data set (i.e., it is not limited to the choanoflagellates we sequenced).

File 5. Gene families, group presences, and species probabilities. For each gene family, the protein members are listed. Subsequent columns contain inferred gene family presences in different groups of species, followed by probabilities of presence in individual species in the data set.

File 6. List of gene families present, gained and lost in last common ancestors of interest. A value of 1 indicates that the gene family was present, gained or lost; a value of 0 indicates that it was not. The six last common ancestors are: Ureukaryote, Uropisthokont, Urholozoan, Urchoanozoan, Urchoanoflagellate and Urmetazoan. Gains and losses are not shown for the Ureukaryote, as our data set only contained eukaryote species and was thus not appropriate to quantify changes occurring on the eukaryotic stem lineage.

File 7. Pfam, transmembrane, signal peptide, PANTHER and Gene Ontology annotations for all proteins. Annotations are listed for all proteins in the data set, including those not part of any gene family. Pfam domains are delimited by a tilde (~) and Gene Ontology terms by a semicolon (;). Transmembrane domains and signal peptides are indicated by the number present in the protein, followed by their coordinates in the protein sequence.

File 8. Pfam, transmembrane, signal peptide, PANTHER and Gene Ontology annotations aggregated by gene family. The proportion of proteins within the gene family that were assigned an annotation is followed by the name of the annotation. Multiple annotations are delimited by a semicolon (;).

Funding

NIH R01 GM089977

History