Version 3 2024-02-05, 22:42Version 3 2024-02-05, 22:42
Version 2 2023-08-29, 17:03Version 2 2023-08-29, 17:03
Version 1 2023-06-12, 16:43Version 1 2023-06-12, 16:43
dataset
posted on 2024-02-05, 22:42authored byDaniel WighDaniel Wigh, Joe arrowsmith, Alexander Pomberger, Kobi Felton, Alexei A. Lapkin
Supplementary datasets used in ORDerly (i.e. the non-benchmark datasets)
Condition prediction datasets: Contains parquet files for each of the four flavours of ORDerly-condition datasets that we used in the ORDerly paper.
Condition prediction datasets config: Contains the .log and .json files showing the parameters used in cleaning and the impact on dataset size after each cleaning step.
Transformer datasets: Contains plain txt files with the six transformer-ready datasets that were used for training/testing with Molecular Transformer.
Non uspto data: Contains the datasets created with ORDerly from non-USPTO data in ORD. These datasets were used as test sets for forward prediction and retrosynthesis prediction.
The ORDerly benchmark datasets can be found here: https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467
Please feel free to contact me, Daniel Wigh, at dsw46@cam.ac.uk in case of any questions.
Funding
UCB Pharma
Engineering and Physical Sciences Research Council via project EP/S024220/1 EPSRC Centre for Doctoral Training in Automated Chemical Synthesis Enabled by Digital Molecular Technologies.
European RegionalDevelopment Fund via the project "Innovation Centre in Digital Molecular Technologies"