figshare
Browse
p4_DSS2017_LINCS_SmallMolecule_.pdf (1.18 MB)

LINCS Small Molecules Standardization and Annotation to Improve Data Integration, Analysis, and Modeling

Download (1.18 MB)
poster
posted on 2017-05-09, 14:29 authored by Tanya T. Kelley, Raymond Terryn, Amar Koleti, Caty ChungCaty Chung, John P Turner, Vasileios Stathias, Dušica Vidović, Stephan Schürer

The physical properties of small molecules, in particular “drug-like” molecules, including their ability to interact with and modulate protein function, cell permeability, and metabolic stability make them powerful tools to study biological systems. Large amounts of small molecule biological activity data are publicly available and small molecules are systematically studied in the diverse profiling assays of the LINCS Consortium. To integrate LINCS data across the various assays, Centers, and with external bioactivity data requires to uniquely identify each small molecule samples tested in an assay based on its unique “active” component. Typically, this is done based on the unique chemical structure. Although non-trivial, a unique, single fragment representation of a organic small molecule can, in most cases, be generated after removing salt counter ions and addends, considering ionization states, tautomeric forms and canonicalizing the chemical structure representation, e.g. as a canonical SMILES or InChI. We implemented the chemical structure standardization using chemical informatics tools. Exceptions include small numbers of metal-organic and multi-component compounds, which we handled by manual expert curation. However, a significant challenge in standardizing small molecules lies in the considerable variability of reported chemical structures for the same compound, depending on the source. Typical and frequent errors include inversed or removed stereochemical centers, relative vs absolute stereochemical configuration, E/Z geometric isomerism of alkenes or imines, loss of aromaticity, changes in oxidation states and other problems. Further complexities can be introduced by different representations of compound mixtures. Public resources, such as PubChem, report many different chemical structures for the same compound, for example as identified by a common drug name. The apparent lack of curation of small molecule chemical structures results in error propagation, for example incorrect chemical structures submitted to PubChem, which are then referenced and potentially added to another resource.

Herein we present the chemical structure standardization and registration pipeline implemented for LINCS small molecules including manual curation, automated steps, mappings to PubChem, naming, validation, and several QC and review steps. The standardization pipeline considers stereochemical representations, mixtures of stereoisomers, geometric isomers of carbon-carbon and carbon-hetero double bonds, regio-isomers, non-isomeric mixtures, ionization states, tautomeric forms, and salt forms or other addends. We illustrate typical errors and their propagation, a problem exacerbated by the lack of user-friendly tools to enable biologists to work with complex chemical information. In LINCS we work to disambiguate compound identity during the registration process using redundant information including chemical structures, drug names, vendor information and provided cross references.

Standardized LINCS small molecules are mapped to PubChem, ChEMBL, ChEBI, and via UniChem to many other resources. These mappings facilitate the curation and integration of diverse external annotations, such as biochemical target information. Compound standardization and mapping makes it easy to integrate different LINCS signatures. The LINCS Small Molecule collection has been registered into the MIRIAM Registry, and identifiers.org for the Persistent URL (PURL). The identifiers.org PURL for each LINCS small molecule re-directs to the LINCS Data Portal, and the information is accessible via RESTful API, in coordination with interoperable smartAPI.

Funding

NIH BD2K and LINCS programs (U54HL127624)

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC