ORDerly: Data Sets
and Benchmarks for Chemical Reaction
Data
Posted on 2024-04-22 - 15:36
Machine learning has the potential to provide tremendous
value
to life sciences by providing models that aid in the discovery of
new molecules and reduce the time for new products to come to market.
Chemical reactions play a significant role in these fields, but there
is a lack of high-quality open-source chemical reaction data sets
for training machine learning models. Herein, we present ORDerly,
an open-source Python package for the customizable and reproducible
preparation of reaction data stored in accordance with the increasingly
popular Open Reaction Database (ORD) schema. We use ORDerly to clean
United States patent data stored in ORD and generate data sets for
forward prediction, retrosynthesis, as well as the first benchmark
for reaction condition prediction. We train neural networks on data
sets generated with ORDerly for condition prediction and show that
data sets missing key cleaning steps can lead to silently overinflated
performance metrics. Additionally, we train transformers for forward
and retrosynthesis prediction and demonstrate how non-patent data
can be used to evaluate model generalization. By providing a customizable
open-source solution for cleaning and preparing large chemical reaction
data, ORDerly is poised to push forward the boundaries of machine
learning applications in chemistry.
CITE THIS COLLECTION
DataCiteDataCite
No result found
Wigh, Daniel
S.; Arrowsmith, Joe; Pomberger, Alexander; Felton, Kobi C.; Lapkin, Alexei A. (2024). ORDerly: Data Sets
and Benchmarks for Chemical Reaction
Data. ACS Publications. Collection. https://doi.org/10.1021/acs.jcim.4c00292Â