Collection of reaction SMILES (reactants, reagents, solvents, products) 1.37M lines total from patent literature (USPTO 1976 - 2024) and from academic literature (2.5% total). Data converted from existing USPTO dataset 1] and data generated by parsing by custom design. Data extraction by OSCAR (semantic) or ChatGPT (LLM), molecule identification by OPSIN and custom synonym list. All SMILES are RDKit-safe with duplicate reactions removed. Please note that the data have been collected in an semi-automated process, the dataset is certainly not without errors.More information on https://kmt.vander-lingen.nl.
1] Chemical reactions from US patents (1976-Sep2016), Daniel Lowe. Link.