10.1021/jm051245v.s001
Joseph R. Votano
Joseph R.
Votano
Marc Parham
Marc
Parham
L. Mark Hall
L. Mark
Hall
Lowell H. Hall
Lowell H.
Hall
Lemont B. Kier
Lemont B.
Kier
Scott Oloff
Scott
Oloff
Alexander Tropsha
Alexander
Tropsha
QSAR Modeling of Human Serum Protein Binding with Several Modeling Techniques Utilizing
Structure−Information Representation
American Chemical Society
2006
protein binding values
SVM
r 2
compound
MAE
serum protein binding
correlation coefficients
structure descriptor trends
modeling
training
ANN
support vector machines
structure descriptor space
drug design process
QSAR
Human Serum Protein Binding
MLR
data
model
2006-11-30 00:00:00
Dataset
https://acs.figshare.com/articles/dataset/QSAR_Modeling_of_Human_Serum_Protein_Binding_with_Several_Modeling_Techniques_Utilizing_Structure_Information_Representation/3044176
Four modeling techniques, using topological descriptors to represent molecular structure, were employed to
produce models of human serum protein binding (% bound) on a data set of 1008 experimental values,
carefully screened from publicly available sources. To our knowledge, this data is the largest set on human
serum protein binding reported for QSAR modeling. The data was partitioned into a training set of 808
compounds and an external validation test set of 200 compounds. Partitioning was accomplished by clustering
the compounds in a structure descriptor space so that random sampling of 20% of the whole data set produced
an external test set that is a good representative of the training set with respect to both structure and protein
binding values. The four modeling techniques include multiple linear regression (MLR), artificial neural
networks (ANN), k-nearest neighbors (kNN), and support vector machines (SVM). With the exception of
the MLR model, the ANN, kNN, and SVM QSARs were ensemble models. Training set correlation
coefficients and mean absolute error ranged from <i>r</i><sup>2</sup> = 0.90 and MAE = 7.6 for ANN to <i>r</i><sup>2</sup> = 0.61 and
MAE = 16.2 for MLR. Prediction results from the validation set yielded correlation coefficients and mean
absolute errors which ranged from <i>r</i><sup>2</sup> = 0.70 and MAE = 14.1 for ANN to a low of <i>r</i><sup>2</sup> = 0.59 and MAE
= 18.3 for the SVM model. Structure descriptors that contribute significantly to the models are discussed
and compared with those found in other published models. For the ANN model, structure descriptor trends
with respect to their affects on predicted protein binding can assist the chemist in structure modification
during the drug design process.