Reactions extracted by text-mining from United States patents published between 1976 and September 2016. The reactions are available as CML or reaction SMILES. Note that the reactions SMILES are derived from the CML. The files can be unzipped using a program like 7-Zip.
The reactions were extracted using an enhanced version of the reaction extraction code described in https://www.repository.cam.ac.uk/handle/1810/244727 with LeadMine (https://www.nextmovesoftware.com/leadmine.html) used for chemical entity recognition.
General tips: Duplicate reactions are frequent due to the same or highly similar text occurring in multiple patents, this is especially true when combining the applications and grant datasets, many reactions from applications will later appear in patent grants. Paragraph numbers are only present for 2005+ patent grants and patent applications. Multiple reactions can be extracted from the same paragraph. Atom maps in the reactions SMILES are derived using Epam's
Indigo toolkit. While typically correct, the atom-maps are wrong in many
cases and hence should not be entirely relied on.
The reactions have been filtered to remove common cases of incorrectly extracted reactions: All product atoms must be accounted for by the atom-mapping The product(s) must have >8 heavy atoms The product must not be charged if it is a single component The number of products must be <5 and number of reactants+agents <16
CML: A schema for the CML is present in cml_xsd.zip
Reaction SMILES For convenience the reaction SMILES includes tab delimited columns for: PatentNumber, ParagraphNum, Year, TextMinedYield, CalculatedYield All of this information is also present in the CML (year is inferred from the folders)