datasets.tar.gz (2.21 GB)

Well-curated QSAR datasets for diverse protein targets

Download (2.21 GB)
posted on 2022-08-22, 22:02 authored by Yunchao LiuYunchao Liu, Jens Meiler

High-throughput screening (HTS) is the use of automated equipment to rapidly screen thousands to millions of molecules for the biological activity of interest in the early drug discovery process. However, this brute-force approach has low hit rates, typically around 0.05\%-0.5\%. Meanwhile, PubChem is a database supported by the National Institute of Health (NIH) that contains biological activities for millions of drug-like molecules, often from HTS experiments. However, the raw primary screening data from the PubChem have a high false positive rate. A series of secondary experimental screens on putative actives is used to remove these. While all relevant screens are linked, the datasets of molecules are often not curated to list all inactive molecules from the primary HTS and only confirmed actives after secondary screening. Thus, we identified nine high-quality HTS experiments in PubChem covering all important target protein classes for drug discovery. We carefully curated these datasets to have lists of inactive and confirmed active molecules. 

We preprocessed the input SMIELS strings  to Structure-Data Files (SDFs). The dataset is specified by its PubChem Accession Identifier. Prepossessing to the original data includes converting SMILES strings to 3D SDF files, generating 3D conformation, and filtering. Conversion from SMILES to SDF files is done using Open Babel, version 2.4.1. Conformations are generated using Corina, version 4.3. Molecules are further filtered with validity, duplicates with BioChemical Library (BCL)