pcbi.1011790.g001.tif (721.59 kB)

Training the reward model.

figure

posted on 2024-01-19, 19:01 authored by Anand Ramachandran, Steven S. Lumetta, Deming Chen

If two sequences are first reported within a short timeframe of each other, a pairwise game can be played between them. To play the game, all instances of the two sequences within the training period are collected into the game arena, and an instance is randomly sampled. The sequence whose instance is sampled is the winner. To train the reward model, we ask the reward model to produce a potential, γ, for each sequence that is involved in the game. Using these potentials, we calculate the winning probability for each sequence. The ground-truth distribution of wins and losses is calculated from the occurrence counts of the two sequences. The calculated outcome distribution is trained to be similar to the ground-truth distribution.

History

Usage metrics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

Training the reward model.

History

Usage metrics

Categories

Keywords

Licence

Exports