EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotes

posted on 2022-03-23, 14:27 authored by Daniel RichterDaniel Richter, Cédric Berney, Jürgen Strassert, Yu-Ping Poh, Emily K. Herman, Sergio A. Muñoz-GómezSergio A. Muñoz-Gómez, Jeremy G. Wideman, Fabien Burki, Colomban de Vargas

Version 3 (22 November, 2021)

See for a detailed description of the database.

See for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at

See for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above).

Scroll to the end of this page for changes since version 2.

Are we missing anything? Please let us know!

EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available

at We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.

This release contains 5 files:

EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17).

EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity ( for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file.

EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:

EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.

Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files.

Strain: the strain(s) of the species sequenced.

Previous_Names: any previous names that this species was known by.

Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).

Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691).

Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).

Merged_Strains: whether multiple strains of the same species were merged to create the data set.

Data_Source_URL: the URL(s) from which the data were downloaded.

Data_Source_Name: the name of the data set (as assigned by the data source).

Paper_DOI: the DOI(s) of the paper(s) that published the data set.

Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details):
‘assemble mRNA’: Trinity v. 2.8.4,
‘CD-HIT’: v. 4.6,
‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v.,
‘translate mRNA’: Transdecoder v. 5.3.0,
‘gffread’: v.0.12.3
‘predict genes’: EukMetaSanity (cloned on 21 September, 2021)
All parameter values were default, unless otherwise specified.

Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).

Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included).

Columns_Modified_Since_Previous_Version: column(s) in this file modified for the data set since the previous release. Not listed: modifications to the Notes column or to new columns added in this version.

Alternative_Strain_Names: non-exhaustive list of alternative names for the sequenced strain for this data set.

18S_Sequence_GenBank_ID: GenBank identifier for the strain sequenced in the data set. When multiple strains were sequenced, identifiers are separated with a comma, in the same order as the Strain column. Ranges of identifiers for the same strain are separated by a hyphen. ‘N/A’ indicates either that there is no GenBank sequence for the strain or that all available sequences are not full-length (< 1,500 bp).

18S_Sequence: 18S for the strain derived from publicly available sequences associated with the data set, in the case where a GenBank sequence is not available.

18S_Sequence_Source: the source for the sequence in the 18S_Sequence column, if any.

18S_Sequence_Other_Strain_GenBank_ID: GenBank identifier for 18S sequence(s) from other strains of the same species as the data set.

18S_Sequence_Other_Strain_Name: strain name(s) for the sequences in the 18S_Sequence_Other_Strain_GenBank_ID column.

18S_and_Taxonomy_Notes: additional information on the values in the 18S_Sequence columns.

Changes since version 2

There are 324 new data sets included. 57 of these replace data sets from version 2.

40 newly published data sets were added to the list that are not included in the database (annotated in the Notes column with the reasons they were not included).

Instead of unannotated genomes (for published genomes lacking protein predictions), we now include predicted proteins and gene annotations (in GFF3 format).

All sequences within each file are now assigned a standardized, unique identifier based on the data set’s EukProt_ID and on the type of data (protein or transcriptome). Illegal characters are removed from sequences.

In the UniEuk_Taxonomy field, single quotes are now used instead of double quotes, to be consistent with other UniEuk databases (EukMap, EukRibo).

Changes to metadata of individual data sets (in the included and not_included tables) with respect to the previous version are now listed in the Columns_Modified_Since_Previous_Version column.

The Taxogroup_UniEuk column has been split into the Taxogroup1_UniEuk and Taxogroup2_UniEuk columns. This resulted in the Supergroup_UniEuk column changing for Opisthokonta.

In addition, the following new columns have been added (see our manuscript for details): Alternative_Strain_Names, 18S_Sequence_GenBank_ID, 18S_Sequence, 18S_Sequence_Source, 18S_Sequence_Other_Strain_GenBank_ID, 18S_Sequence_Other_Strain_Name, 18S_and_Taxonomy_Notes.

EukProt_assembled_transcriptomes.v03.2021_11_22.tgz: assembled transcriptome contigs, for 126 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file.

Sequence names in the proteins and transcriptomes files have standardized, unique identifiers with the following format:

>[EukProt ID]_[Name_to_Use]_[Type abbreviation][Counter] [Previous header contents]

Type abbreviations are P (protein) and T (transcriptome).

All characters not in the following list are removed from nucleic acid sequences:
All characters not in the the following list are removed from protein sequences:

Lists of legal characters are from:


