ao7b02045_si_002.xlsx (48.32 MB)
Distributed Representation of Chemical Fragments
dataset
posted on 2018-03-08, 19:16 authored by Suman K. ChakravartiThis
article describes an unsupervised machine learning method
for computing distributed vector representation of molecular fragments.
These vectors encode fragment features in a continuous high-dimensional
space and enable similarity computation between individual fragments,
even for small fragments with only two heavy atoms. The method is
based on a word embedding algorithm borrowed from natural language
processing field, and approximately 6 million unlabeled PubChem chemicals
were used for training. The resulting dense fragment vectors are in
contrast to the traditional sparse “one-hot” fragment
representation and capture rich relational structure in the fragment
space. The vectors of small linear fragments were averaged to yield
distributed vectors of bigger fragments and molecules, which were
used for different tasks, e.g., clustering, ligand recall, and quantitative
structure–activity relationship modeling. The distributed vectors
were found to be better at clustering ring systems and recall of kinase
ligands as compared to standard binary fingerprints. This work demonstrates
unsupervised learning of fragment chemistry from large sets of unlabeled
chemical structures and subsequent application to supervised training
on relatively small data sets of labeled chemicals.
History
Usage metrics
Categories
Keywords
high-dimensional spacechemical structuresfragment spacevector representationdata setsvectors encode fragment featureskinase ligandsPubChem chemicalssimilarity computationfragment chemistrylanguage processing fieldword embedding algorithmChemical Fragmentsring systemsfragment vectorsunsupervised machine
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC