posted on 2013-04-22, 00:00authored byRobert P. Sheridan
Cross-validation
is a common method to validate a QSAR model. In
cross-validation, some compounds are held out as a test set, while
the remaining compounds form a training set. A model is built from
the training set, and the test set compounds are predicted on that
model. The agreement of the predicted and observed activity values
of the test set (measured by, say, R2) is an estimate of
the self-consistency of the model and is sometimes taken as an indication
of the predictivity of the model. This estimate of predictivity can
be optimistic or pessimistic compared to true prospective prediction,
depending how compounds in the test set are selected. Here, we show
that time-split selection gives an R2 that is more like
that of true prospective prediction than the R2 from random
selection (too optimistic) or from our analog of leave-class-out selection
(too pessimistic). Time-split selection should be used in addition
to random selection as a standard for cross-validation in QSAR model
building.