ci5003262_si_003.zip (2.42 MB)
Ligand Efficiency-Based Support Vector Regression Models for Predicting Bioactivities of Ligands to Drug Target Proteins
dataset
posted on 2014-10-27, 00:00 authored by Nobuyoshi SugayaThe
concept of ligand efficiency (LE) indices is widely accepted
throughout the drug design community and is frequently used in a retrospective
manner in the process of drug development. For example, LE indices
are used to investigate LE optimization processes of already-approved
drugs and to re-evaluate hit compounds obtained from structure-based
virtual screening methods and/or high-throughput experimental assays.
However, LE indices could also be applied in a prospective manner
to explore drug candidates. Here, we describe the construction of
machine learning-based regression models in which LE indices are adopted
as an end point and show that LE-based regression models can outperform
regression models based on pIC50 values. In addition to
pIC50 values traditionally used in machine learning studies
based on chemogenomics data, three representative LE indices (ligand
lipophilicity efficiency (LLE), binding efficiency index (BEI), and
surface efficiency index (SEI)) were adopted, then used to create
four types of training data. We constructed regression models by applying
a support vector regression (SVR) method to the training data. In
cross-validation tests of the SVR models, the LE-based SVR models
showed higher correlations between the observed and predicted values
than the pIC50-based models. Application tests to new data
displayed that, generally, the predictive performance of SVR models
follows the order SEI > BEI > LLE > pIC50. Close
examination
of the distributions of the activity values (pIC50, LLE,
BEI, and SEI) in the training and validation data implied that the
performance order of the SVR models may be ascribed to the much higher
diversity of the LE-based training and validation data. In the application
tests, the LE-based SVR models can offer better predictive performance
of compound–protein pairs with a wider range of ligand potencies
than the pIC50-based models. This finding strongly suggests
that LE-based SVR models are better than pIC50-based models
at predicting bioactivities of compounds that could exhibit a much
higher (or lower) potency.