Two fasta sequence datasets downloaded from Genebank nucleotides and Genebank proteins using the Galaxy tool "fetch_fasta_from_ncbi" (https://toolshed.g2.bx.psu.edu/) and the query strings
DNA sequences were retrieved on 21-03-2018 using two queries
"txid10239[Organism] NOT txid131567[Organism] NOT phage[All Fields] NOT patent[All Fields] NOT chimeric[Title] NOT vector[Title] NOT method[Title] NOT X174[All Fields] AND 301:10000[Sequence length]" and "txid10239[Organism] NOT txid131567[Organism] NOT phage[All Fields] NOT patent[All Fields] NOT chimeric[Title] NOT vector[Title] NOT method[Title] NOT X174[All Fields] AND 10001:1300000[Sequence length]". 301-10000nt long sequences were then subjected to clustering using the galaxy tool vsearch. The resulting centroids were finally merged with the 10001-1300000nt long sequences leading to vir2_NCBI_21-03-2018
Protein sequences were retrieved using the query "txid10239[Organism] NOT txid131567[Organism] NOT phage[All Fields] NOT patent[All Fields] NOT chimeric[Title] NOT vector[Title] NOT method[Title] NOT X174[All Fields] AND 30:9000[Sequence length]".