figshare
Browse

Virgo Benchmarking Datasets

dataset
posted on 2025-04-13, 21:34 authored by Christopher RiccardiChristopher Riccardi

1. Global Ocean Eukaryotic Viral (GOEV) Database


Source: Extract from Gaïa M. et al. Nature (2023)

Components:


591 MAGs from Schulz, F. et al. (2020) [DOI: 10.1038/s41586-020-1957-x]


445 MAGs from Sunagawa, S. et al. (2020) [DOI: 10.1038/s41579-020-0364-5]


218 MAGs from Moniruzzaman, M. et al. (2020) [DOI: 10.1038/s41467-020-15507-2]


158 reference viral assemblies


Accessed: July 20, 2024

Data File: GOEV_DB_CONTIGS.db.zip from Figshare

Selection Criteria: Only contigs labeled at the Order taxonomic level were retained

Sampling Method: Not applicable

Final Sample Size: 1,412 viral contigs


--



2. Known Viral Sequence Clusters (kVSCs)


Source: Extract from Zolfo, M. et al. (2024) [DOI: 10.1101/2024.02.19.580813]

Data Files:


VSC5_rep_fnas_nr99_45k_metaphlanDB.fna.gz


VSCs_groups.csv metadata


Downloaded From: Zenodo, last accessed June 28, 2024

Selection Criteria:


Started from 45,872 representative sequences from MetaPhlan 4.1


Selected kVSCs (sequences clustering with a RefSeq representative)


Verified RefSeq accessions against ICTV Release #39 for accurate labeling


Sampling Method: RefSeq matching based on metadata

Final Sample Size: 2,232 representative sequences


--



3. ICTV Release #39


Source: International Committee on Taxonomy of Viruses (ICTV) Release #39

Downloaded Using: ICTVdump tool on July 17, 2024

Selection Criteria:


Viruses present in both VMR releases #37 and #39


At least two representatives per family


Sampling Method: Up to 5 genomes randomly sampled per family using pandas.sample(), 192 families represented.

Final Sample Size: 860.


--



4. RefSeq Viral Dataset (Random Iteration)


Source: NCBI Virus Portal NCBI Virus accessed on January 27, 2025

Selection Criteria:


Viruses with an assigned family-level taxonomy, up to 43 viruses per family


Sampling Method: Random uniform sampling

Final Sample Size: 6,778 viral genomes


--



5. RefSeq Viral Dataset (Prokaryote-Infecting)


Source: NCBI Virus Portal NCBI Virus accessed on January 27, 2025

Selection Criteria:


Viruses with an assigned family-level taxonomy


Prokaryote-infecting viruses only


Sampling Method: Random uniform sampling

Final Sample Size: 3,536 viral genomes


--



6. ICTV Release #39 (Reduction Study Subset)


Source: International Committee on Taxonomy of Viruses (ICTV) Release #39

Downloaded Using: ICTVdump tool on July 17, 2024

Selection Criteria: 1,000 viruses randomly sampled from the full release

Sampling Method: Random uniform sampling

Final Sample Size: 1,000 viral genomes


Notes: Reduction study starting data. We provide the source code for generating the fragmented genomes.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC