figshare
Browse
pcbi.1011790.g002.tif (543.14 kB)

PandoGen flow.

Download (543.14 kB)
figure
posted on 2024-01-19, 19:01 authored by Anand Ramachandran, Steven S. Lumetta, Deming Chen

First, a SARS-CoV-2 deep autoregressive (SDA) model is prepared following standard practices for training PLMs, first pretraining it on UniProt and then finetuning on GISAID sequences. From the SDA model, a reward model is trained to predict winners in pairwise games between known sequences. Next, the SDA model initializes the PandoGen model which is improved in an iterative process. First, the model is used to generate in silico sequences, which are then classified as known (K) and unknown (U) sequences. Unknown sequences are scored using the reward model and scores are quantized (γ). An in silico sequence (X), and its labels, (K or (U, γ)) are added to a data pool. Samples from the data pool are used to finetune the PandoGen model while keeping it close to the SDA model. During the generation process PandoGen is conditioned to generate unknown (U), highly infectious sequences, which is indicated through the highest reward quantile, γH. PPandoGen is the probability distribution of the PandoGen model, and PSDA is the probability distribution of the SDA model. DKL is the KL-divergence measure between two distributions.

History