EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life

Version 3 2022-03-23, 14:27

Version 2 2020-07-01, 10:21

Version 1 2020-07-01, 09:18

dataset

posted on 2020-07-01, 10:21 authored by Daniel RichterDaniel Richter, Cédric Berney, Jürgen Strassert, Fabien Burki, Colomban de Vargas

Version 2 (30 June, 2020)

Please see DOI: 10.1101/2020.06.30.180687 for a detailed description of the database. Scroll to the end for changes since version 1.

Are we missing anything? Please let us know!

EukProt is a database of published and publicly available predicted protein sets and unannotated genomes selected to represent eukaryotic diversity, including 742 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for studies in phylogenomics, gene family evolution, and other gene-based research across the spectrum of eukaryotic life. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is currently in version 2, and all versions will be permanently stored and made available via FigShare. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.

This release contains 5 files:

EukProt_proteins.v02.2020_06_30.tgz: 726 protein data sets, for species with either a genome (239) or single-cell genome (10) with predicted proteins, a transcriptome (453), a single-cell transcriptome (7), or an EST assembly (17).

EukProt_unannotated_genomes.v02.2020_06_30.tgz: 16 genomes lacking predicted protein annotations, for 15 species with single-cell genomes, and 1 species with a genome sequence.

EukProt_assembled_transcriptomes.v02.2020_06_30.tgz: assembled transcriptome contigs, for 53 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file.

EukProt_included_data_sets.v02.2020_06_30.txt and EukProt_not_included_data_sets.v02.2020_06_30.txt: tables of information on data sets either included (742 data sets) or not included (50) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:

EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.

Name_to_Use: the name of the species for protein/genome/assembled transcriptome files.

Strain: the strain(s) of the species sequenced.

Previous_Names: any previous names that this species was known by, not including cases where a species was originally assigned to a genus but not identified to the species level (e.g., Goniomonas sp., now identified as Goniomonas avonlea, is not listed as a previous name).

Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).

Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691).

Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).

Merged_Strains: whether multiple strains of the same species were merged to create the data set.

Data_Source_URL: the URL(s) from which the data were downloaded.

Data_Source_Name: the name of the data set (as assigned by the data source).

Paper_DOI: the DOI(s) of the paper(s) that published the data set.

Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database, excluding genomes lacking annotations (these are provided as is, with the label ‘translated sequence search’ indicating that proteins of interest can be identified with translated sequence homology search software). Actions taken:

‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/

‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/

‘extractfeat’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/

‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/

All parameter values were default, unless otherwise specified.

Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).

Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included).

Changes since version 1

There are 45 new data sets included. 38 of the new data sets are from species that were previously not represented in the database. 5 of the new data sets are higher-quality replacements (e.g., genomes replacing transcriptomes) for species that were already in the database. 2 of the new data sets are merges of 5 different strains that were present in version 1. 4 of these 5 strains were previously undescribed at the species level. They were subsequently identified and, together with 1 strain that was already described, represent two different species, which were independently merged using CD-HIT.

11 data sets were removed. This includes the 5 species with new, higher-quality data sets, the 5 data sets from strains that were merged, and 1 species removed due to the low quality of its data set.

21 newly published data sets were added to the list of those that are not included in the database (annotated with the reasons for which they were not included).

Changes to metadata of existing data sets:
- EP00022: species epithet was corrected from Cavenderia fasciculatum to Cavenderia fasciculata.
- EP00032: strain name was added.
- EP00233, EP00234, and EP00235: UniEuk taxonomic lineage was modified to reflect recent phylogenomic results (Li et al. 2020, DOI: 10.1038/s41559-020-1221-7).
- EP00281: species epithet was removed, due to uncertainty regarding the identification of the strain (Hoef-Emden 2018, DOI: 10.1016/j.protis.2018.04.005).
- EP00305: genus name was corrected from Pavlova to Diacronema.