figshare
Browse
pcbi.1011790.g001.tif (721.59 kB)

Training the reward model.

Download (721.59 kB)
figure
posted on 2024-01-19, 19:01 authored by Anand Ramachandran, Steven S. Lumetta, Deming Chen

If two sequences are first reported within a short timeframe of each other, a pairwise game can be played between them. To play the game, all instances of the two sequences within the training period are collected into the game arena, and an instance is randomly sampled. The sequence whose instance is sampled is the winner. To train the reward model, we ask the reward model to produce a potential, γ, for each sequence that is involved in the game. Using these potentials, we calculate the winning probability for each sequence. The ground-truth distribution of wins and losses is calculated from the occurrence counts of the two sequences. The calculated outcome distribution is trained to be similar to the ground-truth distribution.

History