figshare
Browse
1/1
2 files

HostNet: Improved Sequence Representation in Deep Neural Network for Virus-Host Prediction

Download all (549.28 MB) This item is shared privately
dataset
modified on 2023-11-22, 06:23


Abstract:

Background: The escalation of viruses over the past decade has brought about the need to determine their respective hosts, particularly for emergent viruses that present a potential menace to the welfare of human and animal life. Yet, the traditional means of ascertaining the host range of viruses, which involves field surveillance and laboratory experiments, is laborious and demanding. Thus, there is a pressing demand for computational tools that can predict virus-host associations with a high degree of accuracy, given the intricate nature of viral-host prediction, which grapples with issues of data imbalance and deficiency.

Results: To overcome the challenges of virus-host prediction, we presentHostNet, a deep learning framework that utilizes a Transformer-CNN-BiGRUarchitecture and two enhanced sequence representation modules. The first module, K2V, constructs a background vector representation of k-mers from a broad range of virus sequences to address the issue of data deficiency. This second module, an adaptive sliding window (ASW), truncates virus sequences of various lengths to create a uniform number of informative and distinct samples for each sequence. We assess HostNet’s performance on a benchmark dataset of“Rabies lyssavirus” and an in-house dataset of “Flavivirus.” Our results show that HostNet surpasses the state-of-the-art deep learning-based host prediction method in host-prediction accuracies and F1 score. Our proposed sequence representation modules significantly enhance the deep neural network’s training generalization, performance in challenging classes, and stability on imbalanced data.

Conclusions: HostNet is a promising framework for predicting virus hosts from genomic sequences, addressing challenges posed by sparse and varying-length virus sequence data. Our results demonstrate its potential as a valuable tool for viral host prediction in various biological contexts.

Flavivirus. We built a nucleotide sequence dataset of virus genus flavivirus for invertebrate host prediction and collected their invertebrate host labels from NCBI GenBank. A total of 9626 sequences from 94 flaviviruses grouped by three invertebrate hosts, two were mosquitos: culex and Aedes, and one tick: Ixodes were selected. In the dataset, the hosts of 73 viruses were mosquitos, 18 viruses were Ixodes, and three viruses had both mosquito and tick as hosts. There were 2424, 6829, and 373 sequences for Culex, Aedes, and Ixodes, respectively.

Vir61. For pertaining, namely, training the k-mer to vector representation model, we built a large-scale, wide-ranging, and comprehensive viral gene dataset named Vir61. The sources of virus genomes in the Vir61 dataset include virus genomes that have been published in Genbank. The Vir61 dataset contains 103,466 viral genomes for 1377 viruses belonging to 61 viral families, including Rhabdoviridae, Togaviridae, Ascoviridae, Flaviviridae, etc.

Funding

Zhejiang Provincial Natural Science Foundation of China under Grant (No.LQ23F020002)

Scientific Research Foundation of Hangzhou City University (No. X-202212)

National KeyR&D Program of China (No. 2022YFC2302700)

Zhejiang Provincial Key Research and Development Program of China (2021C01164)

Open Foundation of Key Laboratory of Tropical Translational Medicine of Ministry of Education, Hainan Medical University (2021TTM010)

Key R & D projects in Zibo city(2020kj100011)