A global dataset of sequence, diversity and biosafety recommendation of arbovirus and arthropod-specific virus

posted on 2023-05-19, 07:31 authored by Ying HuangYing Huang, Shunlong Wang, Hong Liu, Evans Atoni, Fei Wang, Wei Chen, Zhaolin Li, Sergio Rodriguez, Zhiming Yuan, Zhaoyan Ming, Han Xia

We built a comprehensive dataset of the arboviruses and arthropod-specific viruses by curating worldwide available data from Arbovirus Catalog, Section VIII-F of the Biosafety in Microbiological and Biomedical Laboratories 6th edition, Virus Metadata Resource of International Committee on Taxonomy of Viruses, and GenBank. This dataset includes a complete information on viral taxonomy, biological characteristics, vectors and vertebrate hosts, distribution, recommended biosafety levels, genome segment, and nucleotide/amino acid sequences, which will facilitate research by scientists/researchers of arboviruses and arthropod-specific viruses in viral vector/host prediction, disease outbreak risk warning, arbovirus/arthropod-specific interactions, phylogenetic and evolutionary relationships, and biosafety risk assessment. 


This global dataset of viral sequence, diversity, distribution, and biosafety recommendation for arbovirus and ASV contains a viral information file (.xlsx), a nucleic acid sequences file (.fna) and amino acid sequences file (.faa), as accessible from figshare26.

The column details of the viral meta information file (.xlsx) are as follows (The “NAV” in the field indicates not available value): 

Taxonomy Information

1. Virus_Group: (customized field) viruses in the database are divided into two groups: arbovirus and ASV. The former has both vertebrate and arthropod hosts, the latter has only arthropod hosts.

2. Name: (source from GenBank) the virus name, each name represents a distinct virus. 

3. Acronym: (source from BMBL) acronym of virus name. 

4. NCBI_Taxonomy_ID: (source from GenBank) taxonomy identifier of virus from NCBI Taxonomy Database. 

5. Isolate: (source from GenBank) Isolate of virus from NCBI GenBank. 

6. Unified_Isolate_Number: (customized field) renumbering of the field Isolate. Each isolate of the same virus is numbered. 

7. Species: (source from ICTV) species that the virus belongs to. Species of the viruses are normally different with their names. 

8. Genus: (source from ICTV) genus that the virus belongs to. 

9. Family: (source from ICTV) family that the virus belongs to. 

Genome Information

10. Segmented: (customized field) whether the genome of the virus is unsegmented (recorded as “no”) or segmented virus (recorded as “yes”). Virus with an unknown number of segments (recorded as “NAV”).

11. Number_of_Segments: (source from GenBank) the theoretical number of segments of the virus.

12. Molecule_Type: (source from GenBank) molecule types of the virus genome which are divided into ssRNA(+), ssRNA(-), ssRNA(+/-), dsRNA, RNA, ssDNA(+/-), dsDNA and etc. 

Sequence Information

13. Accession: (source from GenBank) NCBI GenBank Accession of the nucleotide sequence. 

14. Locus: (source from GenBank) the locus name of the nucleotide sequence.

15. SRA_Accession: (source from GenBank) NCBI SRA Accession of the nucleotide sequence.

16. Submitters: (source from GenBank) submitters of the nucleotide sequence. 

17. Sequence_Type: (source from GenBank) whether the nucleotide sequence is a reference sequence (recorded as “RefSeq”) or a non-reference sequence (recorded as “GenBank”). 

18. BioSample: (source from GenBank) NCBI BioSample Accession of the nucleotide sequence. 

19. GenBank_Title: (source from GenBank) the field “DEFINITION” of NCBI GenBank database of the sequence. 

20. Genotype: (source from GenBank) genotype of the nucleotide sequence. 

21. Segment: (source from GenBank) segment identifier of the nucleotide sequence. 

22. Unified_Segment_Number: (customized field) renumbering of the field Segment. Each segment is assigned a new number from 1. Segment of the unsegmented virus is assigned as 1. 

Host Information

23. Host_Species: (customized field) the species of the dead-end host of the virus. 

24. Host_Genus: (customized field) the genus of the dead-end host of the virus. 

25. Host_Family: (customized field) the family of the dead-end host of the virus.

26. Host: (source from GenBank) the field from the NCBI GenBank database that represents dead-end host or vectors. 

Biosafety Information

27. Recommended_BSL: (customized field) recommended biosafety level of laboratory to research the virus (recorded as “2”, “3”, “4”, “NAV”).

28. BMBL_Recommended_BSL: (source from BMBL) BMBL recommended biosafety level of laboratory to research the virus (recorded as “2”, “2 with 3 practices”, “2b”, “3”, “3a”, “3b”, “4”, “NAV”).

29. Basis_of_Rating: (source from BMBL) risk assessment of the virus (recorded as “A1”, “A2”, “A3”, “A4”, “A7”, “IE”, “S”, “NAV”).

30. Antigenic_Group: (source from BMBL) the antigenic group of the virus. 

31. Isolated: (customized field) whether the virus has been isolated (“Yes” or “No”). 

Source Information

32. Latitude_and_Longitude: (source from GenBank) longitude and latitude of the virus isolation source. 

33. State_or_Province: (customized field) state or provincial administrative unit of the virus source. 

34. Geo_Location: (source from GenBank) geographical position of the virus source. 

35. Country_or_Region: (customized field) the country or region of the virus source. 

36. Isolation_Source: (source from GenBank) the organism which the virus was collected from. 

37. Collection_Date: (source from GenBank) the date that the virus was collected. 

38. Submit_Date: (source from GenBank) the date that the virus was submitted.

39. Release_Date: (source from GenBank) the date that the virus was released or last modified.


40. Publications: (customized field) the number of publications and literature covering the specific virus research.

41. Accession_URL: (customized field) the DOI leading directly to the GenBank source. 

The nucleotide sequences file and amino acid sequences file are standard FASTA files. Each sequence information consists of two lines, header and content. The header contains two types of information, locus and accession, split by '|'. Content is a specific nucleic acid or amino acid sequence. The detailed definitions of the fields in the header are as follows:

1. Locus: NCBI GenBank LOCUS ID of the nucleotide sequence.

2. Accession: NCBI GenBank Accession of the nucleotide sequence.

Protein_ID: a protein sequence identification number (for amino acid sequences file). 


the National Natural Science Foundation of China (U22A20363)

the Sino-Africa Joint Research Center, Chinese Academy of Sciences (SAJC201605)

the National Key Research and Development Program of China (2022YFC2302700)

the Scientific Research Foundation of Hangzhou City University (No. X-202212)


