figshare
Browse
1/1
5 files

Chemical reactions from US patents (1976-Sep2016)

dataset
posted on 2017-06-13, 16:49 authored by Daniel LoweDaniel Lowe
Reactions extracted by text-mining from United States patents published between 1976 and September 2016. The reactions are available as CML or reaction SMILES. Note that the reactions SMILES are derived from the CML. The files can be unzipped using a program like 7-Zip.

The reactions were extracted using an enhanced version of the reaction extraction code described in https://www.repository.cam.ac.uk/handle/1810/244727
with LeadMine (https://www.nextmovesoftware.com/leadmine.html) used for chemical entity recognition.

General tips:
Duplicate reactions are frequent due to the same or highly similar text occurring in multiple patents, this is especially true when combining the applications and grant datasets, many reactions from applications will later appear in patent grants.
Paragraph numbers are only present for 2005+ patent grants and patent applications.
Multiple reactions can be extracted from the same paragraph.
Atom maps in the reactions SMILES are derived using Epam's Indigo toolkit. While typically correct, the atom-maps are wrong in many cases and hence should not be entirely relied on.

The reactions have been filtered to remove common cases of incorrectly extracted reactions:
All product atoms must be accounted for by the atom-mapping
The product(s) must have >8 heavy atoms
The product must not be charged if it is a single component
The number of products must be <5 and number of reactants+agents <16

CML:
A schema for the CML is present in cml_xsd.zip

Reaction SMILES
For convenience the reaction SMILES includes tab delimited columns for:
PatentNumber, ParagraphNum, Year, TextMinedYield, CalculatedYield
All of this information is also present in the CML (year is inferred from the folders)




History