figshare
Browse
1/1
6 files

Data from: Universal and taxon-specific trends in protein sequences as a function of age

Download all (3 GB)
dataset
posted on 2020-03-27, 17:49 authored by Jennifer JamesJennifer James, Sara Willis, Paul Nelson, Catherine Weibel, Luke Kosinski, Joanna MaselJoanna Masel
This dataset contains the following:

Metrics databases, as calculated over all pfam and full gene sequences used in this analysis, stored as tab-delimited flat files with headers.
- EnsemblGenomes_DomainMetrics.txt.gz: All domain (pfam) metrics calculated from sequences downloaded from ensembl.
- EnsemblGenomes_ProteinMetrics.txt.gz: All full gene metrics calculated from sequences downloaded from ensembl.
- NCBIGenomes_DomainMetrics.txt.gz: All domain (pfam) metrics calculated from sequences downloaded from ensembl.
- NCBIGenomes_ProteinMetrics.txt.gz: All full gene metrics calculated from sequences downloaded from ensembl.

-S2_SpeciesList:
Gives the species, and their reference species UIDs (unique identifier numbers), which are used in the above datatables.


Homology dictionary datasets for each metric. These were created for: ISD, amino acid composition, and hydrophobic clustering. files are organised into a single gzipped folder:
-HomologyDictionaryFiles.zip:

Contents:
Filenames are descriptive, and state: 1) whether the homology groups were calculated over genes or pfams, 2) the metric, 3) whether the data was transformed, 4) over which kingdom the homology groups were calculated over (if any, this is either 'animal' or 'plant'), 5) whether it was specified if the genes or pfams included in the analysis were transmembrane (TransmembraneOnly, or Trans) or not (NonTransmembrane, or NonTrans).

-Filenames starting with 'MetricVsTime: Each file consists of two columns, 'Metric', (as specified in the file title, as either isd, isd without cysteine or clustering), and 'Age', which is the estimated age of the homology group in MY. These are generated by calculating an average over all homologous sequences (either genes or pfams), such that a homologous sequence is only a single datapoint in our analyses. For further details, see manuscript.

-Filenames starting with 'AAComp': Summaries of phylostratigraphy slopes for each amino acid slope

-Files in AACompFiles folder: Each file consists of two columns, 'Metric', (amino acid composition for amino acid specified in file title), and 'Age', which is the estimated age of the homology group in MY. These are generated by calculating an average over all homologous sequences (either genes or pfams), such that a homologous sequence is only a single datapoint in our analyses. For further details, see manuscript.

Sequence data was accessed from Ensembl and NCBI. All scripts used to generate the datasets are available at https://github.com/MaselLab/ProteinEvolution

Funding

John Templeton Foundation (60814)

National Institutes of Health (GM-104040)

History