figshare
Browse
SRI_3_TB_datasets_combined with descriptors.sdf (12.62 MB)

TB - 3 combined datasets (TB MLSMR, CB2 and Kinase)

Download (0 kB)
dataset
posted on 2013-12-18, 14:53 authored by Sean EkinsSean Ekins

combined dataset from TB paper

J Chem Inf Model. 2013 Nov 25;53(11):3054-63. doi: 10.1021/ci400480s. Epub 2013 Oct 30.
Fusing Dual-Event Data Sets for Mycobacterium tuberculosis Machine Learning Models and Their Evaluation.

Abstract

The search for new tuberculosis treatments continues as we need to find molecules that can act more quickly, be accommodated in multidrug regimens, and overcome ever increasing levels of drug resistance. Multiple large scale phenotypic high-throughput screens against Mycobacterium tuberculosis (Mtb) have generated dose response data, enabling the generation of machine learning models. These models also incorporated cytotoxicity data and were recently validated with a large external data set. A cheminformatics data-fusion approach followed by Bayesian machine learning, Support Vector Machine, or Recursive Partitioning model development (based on publicly available Mtb screening data) was used to compare individual data sets and subsequent combined models. A set of 1924 commercially available molecules with promising antitubercular activity (and lack of relative cytotoxicity to Vero cells) were used to evaluate the predictive nature of the models. We demonstrate that combining three data sets incorporating antitubercular and cytotoxicity data in Vero cells from our previous screens results in external validation receiver operator curve (ROC) of 0.83 (Bayesian or RP Forest). Models that do not have the highest 5-fold cross-validation ROC scores can outperform other models in a test set dependent manner. We demonstrate with predictions for a recently published set of Mtb leads from GlaxoSmithKline that no single machine learning model may be enough to identify compounds of interest. Data set fusion represents a further useful strategy for machine learning construction as illustrated with Mtb. Coverage of chemistry and Mtb target spaces may also be limiting factors for the whole-cell screening data generated to date.

History