Journal of Biomolecular Structure and Dynamics

,

To overcome the aforementioned barriers, it would be very useful to develop computational methods aimed at identifying the interactions of drug compounds with various protein targets, such as GPCRs (G-protein-coupled receptors), protein channels, enzymes, and NRs (nuclear receptors), in cellular networking based on the sequence information of the latter.The results thus obtained can be used to pre-exclude the compounds identified not interacting with the protein targets, so as to timely stop wasting time and money on those unpromising compounds (Sirois, Hatzakis, & Wei, 2005).
Actually, considerable efforts have been made in this regard.For instance, (He, Zhang, Shi, Hu, & Kong, 2010) developed a powerful computation method for predicting drug-target interaction networks based on the functional groups and biological features.However, no Web server whatsoever was provided for their method, and hence, its practical application value is quite limited for most drug-development scientists.To make up this shortcoming, four Web server predictors, called "iGPCR-Drug" (Xiao, Min, & Wang, 2013b), "iCDI-PseFpt" (Xiao, Min, & Wang, 2013a), "iEzy-Drug" (Min, Xiao, & Chou, 2013), and "iNR-Drug" (Fan, Xiao, & Min, 2014), were recently developed for identifying the interactions of drug compounds with GPCRs, ion channels, enzymes, and NRs in cellular networking, respectively (Figure 1).Their Web sites addresses are listed in Table 1.Although each of the four Web server predictors could yield higher success rate than the original prediction method (He et al., 2010) for the same purpose, the benchmark data-set used to train and test each of the four predictors was taken from (He et al., 2010), and hence has the following problem.For the benchmark data-set in (He et al., 2010), the number of the non-interactive pair samples is much larger than that of the interactive pair samples.Although this might reflect the real world in which the interactive pairs are always the minority compared with the non-interactive ones, using this kind of highly unbalanced benchmark data-set to train a predictor would lead to the outcome that many interactive drug-target pairs might be mispredicted as non-interactive ones (Sun, Wong, & Kamel, 2009).Since the minority interactive drug-target pairs are our focus in drug development, we should take some action to optimize the benchmark data-set so as to minimize this kind of misprediction, and meanwhile avoiding the overprediction as well.This study was initiated in an attempt to address this kind of problem.
As demonstrated in a series of recent publications (see, e.g.(Chen, Feng, Deng, & Lin, 2014b;Guo, Deng, Xu, Ding, & Lin, 2014;Lin, Deng, Ding, & Chen, 2014;Liu, Xu, Lan, Xu, & Zhou, 2014a;Liu, Xiao, & Qiu, 2015;Liu et al., 2014b;Qiu, Xiao, & Chou, 2014a;Qiu, Xiao, & Lin, 2014b, 2014c;Xu, Wen, Wen, & Wu, 2014c;Xu, Zhou, Liu, He, & Zou, 2014a)) in response to the suggestion from (Chou, 2011), to establish a really useful predictor for a biological system, we need to consider the following steps: (i) select or construct a valid benchmark data-set to train and test the predictor; (ii) formulate the samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; and (v) establish a user-friendly Web server for the predictor that is accessible to the public.Below, let us elaborate how to deal with these steps one by one.The original data used in (He et al., 2010) and (Fan et al., 2014;Min et al., 2013;Xiao et al., 2013aXiao et al., , 2013b) ) were collected from KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kotera, Hirakawa, Tokimatsu, Goto, & Kanehisa, 2012) at http://www.kegg.jp/kegg/.
The original benchmark datasets used for the Web-server predictors iGPCR-Drug (Xiao et al., 2013b), iCDI-PseFpt (Xiao et al., 2013a), iEzy-Drug (Min et al., 2013), and iNR-Drug (Fan et al., 2014) as listed in Table 1 can be summarized as follows: where S GPCR-Drug ð1860Þis the benchmark data-set for the iGPCR-Drug predictor (Xiao et al., 2013b), and it contains 1860 GPCR-drug pairs of which 620 are interactive pairs belonging to the positive subset S þ GPCR-Drug ð620Þ while 1240 are non-interactive belonging to the negative subset S À GPCR-Drug ð1240Þ, and [ represents the union in the set theory; S Chl-PseFpt ð4116Þ is the benchmark data-set for the iCDI-PseFpt predictor (Xiao et al., 2013a), and it contains 4116 channel-drug pairs of which 1372 are interactive pairs belonging to the positive subset S þ Chl-Drug ð1372Þ while 2744 are non-interactive belonging to the negative subset S À Chl-Drug ð2744Þ; S Ezy-Drug ð8157Þ is the benchmark data-set for the iEzy-Drug predictor (Min et al., 2013), and it contains 8157 enzyme-drug pairs of which 2179 are interactive pairs belonging to the positive subset S þ Ezy-Drug ð2719Þ while 5438 are non-interactive belonging to the negative subset S À Ezy-Drug ð5438Þ; and S NR-Drug ð258Þ is the benchmark data-set for the iNR-Drug predictor (Fan et al., 2014), and it contains 258 NR-drug pairs of which 86 are interactive pairs belonging to the positive subset S þ NR-Drug ð86Þ while 172 are non-interactive belonging to the negative subset S À NR-Drug ð172Þ.Here, the "interactive" pair means the pair whose two counterparts are interacted with each other in the drug-target networks as defined in the KEGG database (Kotera et al., 2012); while the "non-interactive" pair means that its two counterparts are not interacted with each other in the drug-target networks.
All the detailed data for the four benchmark datasets can be found in the Supplementary Materials of (Fan et al., 2014;Min et al., 2013;Xiao et al., 2013aXiao et al., , 2013b) ) or can be directly downloaded from their corresponding Web server predictors whose Web site addresses are explicitly given in Table 1.
As we can see from Equation ( 1), for the benchmark data-set used to train and test each of the aforementioned four predictors, the size of the negative subset is two times the size of the positive subset.Although this might reflect the real world in which the non-interactive pairs are always the majority compared with the interactive ones, a predictor trained with such a skewed benchmark data-set would have the consequence that many interactive drug-target pairs might be mispredicted as non-interactive ones (Sun et al., 2009).Actually, what is really most intriguing information for the drug-development scientists is the one about the interactive pairs.Therefore, it is worthwhile to find an effective approach to optimize the unbalanced benchmark data-set and minimize the consequence of this kind of misprediction.
In this study, we use the NCR (neighborhood cleaning rule) (Laurikkala, 2001) and the SMOTE (synthetic minority over-sampling technique) (Chawla, Bowyer, Hall, & Kegelmeyer, 2011) treatments to optimize the aforementioned skewed benchmark datasets.The former is to remove some redundant negative samples from the negative subset so as to reduce its statistical noise, which can be likened to the sample-screening procedure in computational proteomics (see, e.g.(Chou & Shen, 2006)).The latter is to add some hypothetical positive samples into the positive subset so as to enhance the ability in identifying the interactive pairs, which can be likened to the seed-propagation approach in Zhang and Chou (1995) and the Monte Calo sampling approach in Chou (1993), Zhang and Chou (1992) for expanding the positive subsets.
In this study, we applied the NCR treatment (Laurikkala, 2001) according to the following criteria: (i) for each of the samples in the benchmark data-set, find its three nearest neighbors; (ii) if the sample concerned belongs to a negative subset and at least two of its three nearest neighbors belong to the positive subset, remove the sample from the benchmark data-set; (iii) if, however, it belongs to a Downloaded by [University Town Library of Shenzhen] at 00:30 23 August 2015 positive subset, then remove those of its nearest neighbors from the benchmark data-set that belong to the negative subset; (iv) if the number of samples in a negative subset is less than 200 such as in the case of NR-drug system, no action will be taken to remove the negative samples.
After the aforementioned NCR treatment, the number of samples in each of the four negative subsets was reduced, and hence, Equation (1) would become (2) See Supporting Information S1 for the detailed data obtained by the NCR treatment described above.Subsequently, to further optimize the benchmark datasets of Equation ( 2), the SMOTE approach (Chawla et al., 2011) was adopted to create some hypothetical samples for the positive subsets by the linear interpolation scheme.Finally, the benchmark datasets thus obtained can be formulated as (3) As we can see from Equation (3), the four optimized benchmark datasets via the NCR (Laurikkala, 2001) and SMOTE (Chawla et al., 2011) treatments are well balanced out, each having its positive and negative subset equal to each other in size.
Note that the hypothetical samples generated via the linear interpolation scheme in SMOTE can only be expressed by their feature vectors as defined in the next section, but not real sample codes as given in the Online Supporting Information S1.Nevertheless, it would be perfectly reasonable to do so since the data directly used to train a predictor were actually the samples' feature vectors, but not their codes.This is the key to optimize an imbalanced benchmark data-set in the current study, and the rationale of such an interesting approach will be further elucidated in Section 3.2 later.
To provide an intuitive picture, a flowchart is given in Figure 2 to illustrate the process of how to optimize an imbalance benchmark data-set.

Sample representation
Since each of the samples in the current network system contains a drug compound and a target protein.The latter can be a GPCR, ion-channel, enzyme, or NR.
Note that all the components in Equation ( 6) are subjected to a standard conversion as described by the following equation: where / h i means the average of the 736 components in Equation ( 6), and SD means the corresponding standard deviation.The converted values obtained by Equation (7) will have a zero mean value, and will remain unchanged if they go through the same conversion procedure again (Chou & Shen, 2007).
As an illustration, the 736 standard converted components for each of the 172 positive samples (including 86 hypothetical samples created by SMOTE) or each of the 172 negative samples in the NR-drug system (cf.Equation ( 3)) are given in the Supporting Information S2.

Operation engine or algorithm
In this study, the operation engine was SVM (support vector machine), which is based on the structural risk minimization principle from statistical learning theory.SVM has been widely used in the realm of bioinformatics (see, e.g.(Chen et al., 2014b;Chen, Feng, & Lin, 2013, 2014a;Ding, Deng, Yuan, & Liu, 2014;Feng, Chen, & Lin, 2013;Guo et al., 2014;Liu et al., 2014a;Liu et al., 2014b;Liu, Wang, & Chou, 2005;Qiu et al., 2014a;Xu, Wen, Shao, & Deng, 2014b)).The basic idea of SVM is to construct a separating hyper-plane so as to maximize the margin between the positive data-set and negative data-set.The nearest two points to the hyperplane are called support vectors.SVM first constructs a hyperplane based on the training data-set, and then maps an input vector from the input space into a vector in a higher-dimensional Hilbert space, where the mapping is determined by a kernel function.A trained SVM can output a class label (in our case, interactive pair or noninteractive pair) based on the mapping vector of the input vector.For a brief formulation of SVM and how it works, see the papers (Cai, Zhou, & Chou, 2003;Chou & Cai, 2002); for more details about SVM, see a monograph (Cristianini & Shawe-Taylor, 2000).In this study, the LIBSVM package (Chang & Lin, 2005) was used as an implementation of SVM, which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/, and the popular radial basis function (RBF) was taken as the kernel function.For the current SVM classifier, there were two uncertain parameters: penalty parameter C and kernel parameter c.Their values will be given later.
A package of prediction methods thus obtained is called iDrug-Target, which consists of four predictors; that is, iDrug-Target ¼ iDrug-GPCR; for drug-GPCR interaction iDrug-Chl; for drug-channel interaction iDrug-Ezy; for drug-enzyme interaction iDrug-NR; for drug-NR interaction where the two parameters for the SVM operation engine are given by which were determined by optimizing the 5-fold crossvalidation success rate for each of the four predictors on its corresponding benchmark data-set (cf.Equation ( 3)) through a two-dimensional grid search as illustrated in Figure 3.

Results and discussion
As mentioned in the beginning of this study, one of the important procedures in developing a new predictor is how to properly and objectively evaluate its quality (Chou, 2011), which actually comprises two aspects.
One is what metrics should be taken to quantitatively measure the prediction accuracy, and the other is what test method should be used to perform the test.Below, let us address these problems.

A set of four metrics for performance measurement
In order to provide an intuitive and easier-to-understand quantitative scale, here, let us adopt the criteria proposed in Chou (2001a).According to those criteria, the rates of correct predictions for the interactive drug-target pairs in the positive subset and the non-interactive pairs in the negative subset are, respectively, defined by where N þ is the total number of the interactive drug-target (e.g.drug-GPCR) pairs investigated while N þ À the number of the interactive drug-target pairs incorrectly predicted as the non-interactive drug-target pairs; N À the total number of the non-interactive drug-target pairs investigated while N À þ is the number of the non-interactive drug-target pairs incorrectly predicted as the interactive drug-target pairs.Thus, the overall success prediction rate is given by Chou (2001b) Predict drug-target interaction 2225 Downloaded by [University Town Library of Shenzhen] at 00:30 23 August 2015 It is obvious from Equations ( 10) and ( 11) that, if and only if none of the interactive drug-target pairs and the non-interactive drug-target pairs are mispredicted, that is we have the overall success rate K ¼ 1.Otherwise, the overall success rate would be smaller than 1.
It is instructive, however, to point out that the following equation is often used in literatures for examining the performance quality of a predictor (see, e.g.(Chen, Liu, Yang, & Chou, 2007)) where TP represents the true positive; TN, the true negative; FP, the false positive; FN, the false negative; Sn, the sensitivity; Sp, the specificity; Acc, the accuracy; MCC, the Mathews correlation coefficient.
The relations between the symbols in Equation ( 11) and those in Equation ( 12) are given by Substituting Equation ( 13) into Equation ( 12) and also considering Equation ( 11), we obtain From the above equation, we can see: when N þ À ¼ 0 meaning that none of the interactive drug-target pairs was mispredicted to be a non-interactive drug-target pairs, we have the sensitivity Sn ¼ 1; while N þ À ¼ N þ meaning that all the interactive drug-target pairs were mispredicted  8) and (9) for further explanation.
Downloaded by [University Town Library of Shenzhen] at 00:30 23 August 2015 to be the non-interactive drug-target pairs, we have the sensitivity Sn ¼ 0. Likewise, when N À þ ¼ 0 meaning that none of the non-interactive drug-target pairs was mispredicted, we have the specificity Sp ¼ 1; while N À þ ¼ N À meaning that all the non-interactive drug-target pairs were incorrectly predicted as the interactive drug-target pairs, we have the specificity Sp ¼ 0. When N þ À ¼ N À þ ¼ 0 meaning that none of interactive drug-target pairs in the positive subset and none of the non-interactive drug-target pairs in S À was incorrectly predicted, we have the overall accuracy Acc ¼ meaning that all the interactive drug-target pairs in the positive subset and all the noninteractive drug-target pairs in the negative subset were mispredicted, we have the overall accuracy Acc ¼ K ¼ 0. The MCC correlation coefficient is usually used for measuring the quality of binary (two class) classifications.
that none of the interactive drug-target pairs in the positive subset and none of noninteractive drug-target pairs in the negative subset was mispredicted, we have MCC ¼ we have MCC ¼ À1 meaning total disagreement between prediction and observation.As we can see from the above discussion, it is much more intuitive and easier-to-understand when using Equation ( 14) to examine a predictor for its sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient, particularly for its Mathew's correlation coefficient.
It should be pointed out, however, the set of metrics as defined in Equation ( 14) or Equation ( 12) is valid only for the single-label systems.For the multi-label systems whose emergence has become more frequent in systems biology (Chou, Wu, & Xiao, 2011, 2012;Lin, Fang, & Xiao, 2013) and systems medicine (Chen, Zeng, Cai, & Feng, 2012;Xiao, Wang, Lin, & Jia, 2013c), a completely different set of metrics as defined in (Chou, 2013) is needed.

Jackknife and target-jackknife cross-validation
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent data-set test, subsampling (or K-fold cross-validation) test, and jackknife test (Chou & Zhang, 1995).However, of the three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark data-set as elaborated in Chou & Shen (2010) and demonstrated by Equations 28-30 in (Chou, 2011).Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g.(Du, Jiang, & He, 2006;Hajisharifi, Piryaiee, Mohammad Beigi, Behbahani, & Mohabatkar, 2014;Mohabatkar, Beigi, Abdolahi, & Mohsenzadeh, 2013;Mondal & Pai, 2014;Nanni, Brahnam, & Lumini, 2014;Shen & Chou, 2010;Shen, Yang, & Chou, 2007;Xiao, Wu, & Chou, 2011;Xu, Shao, & Wu, 2013)).During the process of jackknife Trained by the optimized benchmark datasets as defined in Eq. 3, the iDrug-Target package contains four predictors for identifying the networking interactions of drugs with GPCR, channels, enzymes, and NR, respectively (cf.Eq. 4).The rates reported in this table were derived by the target-jackknife cross-validations on the original experimental benchmark datasets used by FunD (He et al., 2010) and iGPCR-Drug (Xiao et al., 2013b), iCDI-Drug (Xiao et al., 2013a), iEzy-Drug (Min et al., 2013), and iNT-Drug (Fan et al., 2014), respectively.See Section 2.5 for further explanation.b See Ref.He et al., 2010) for the Fund prediction method and its reported success rates.c See Ref. Xiao et al., 2013b for the iGPCR-Drug predictor and its reported success rates.d See Ref. Xiao et al., 2013a for the iCDI-Drug predictor and its reported success rates.e See Ref. Min et al., 2013 for the iEzy-Drug predictor and its reported success rates.f See Ref. Fan et al., 2014 for the iNR-Drug predictor and its reported success rates.
Predict drug-target interaction 2227 test, all the samples in the benchmark data-set will be singled out one by one and tested by the predictor trained by the remaining samples.
When conducting the jackknife test on the optimized benchmark data-set of Equation (3), however, some special consideration is needed.Take the optimized benchmark data-set for the GPCR-drug system as an example: both its positive subset and negative subset contain 808 samples.But, of the 808 positive samples, only ð808 À 188Þ ¼ 620 are from experimental observations (cf.Equations ( 2) and ( 3)) and the rest from the SMOTE treatment (Chawla et al., 2011).Also, in the negative subset, ð1240 À 808Þ ¼ 432 experimental samples have been removed by NCR (Laurikkala, 2001) (cf.Equations ( 1) and ( 2)).Since the validation should be carried out strictly based on the experimental data only, a special jackknife test, the so-called "target-jackknife test", was introduced.During the process of target-jackknife test, only the experiment-confirmed samples are in turn singled out as a target (or test sample) for cross-validation.Accordingly, although the predictor is trained by the optimized benchmark data-set that includes both experimental and hypothetical samples, only or all, the experiment-confirmed samples are the targets used to count its success rates regardless of whether they are part of a subset or removed from the benchmark data-set during the optimization process.For instance, for the aforementioned GPCR-drug system (cf.Equation ( 3)), only the 620 experimental positive samples need to be singled out for cross-validation; however, even the 432 experimental negative samples need to be validated as well despite they have been removed from the negative subset during the optimization process.

Comparison with the existing predictors
The scores for the four metrics as defined in Equations ( 12) or ( 14) achieved by the current iDrug-Target predictor via the target-jackknife tests are given in Table 2, where for facilitating comparison the corresponding scores by the existing predictors are also listed.From the table, we can see the following.(i) The scores of the overall accuracy (Acc) achieved by the four predictors in the iDrug-Target package are remarkably higher than those of the existing predictors for the same purposes.(ii) The scores of the Mathew's correlation coefficient (MCC) by the iDrug-Target package are also remarkably higher than those of the existing predictors.These facts indicate that the current predictors in the iDrug-Target package not only can yield higher prediction accuracy but also are more stable and consistent.
Shown in Figure 4 is a graphic comparison of the four predictors in the iDrug-Target with their counterparts via the ROC (receiver operating characteristic) curves and PR (precision-recall) curves.As we can see from the figure, the areas under both the ROC and PR curves for the four predictors in the iDrug-Target package are obviously larger than those of their counterparts, indicating a clear improvement of the new predictors in comparison with the old ones.
It is instructive to point out that, although the four predictors in the iDrug-Target package were trained by Predict drug-target interaction 2229 the four optimized benchmark datasets in which some experimental negative samples were removed from the original benchmark datasets to balance out the sizes of subsets, they were still counted in the target-jackknife cross-validation.On the other hand, although some hypothetical positive samples were added to form the optimized benchmark datasets, only the experimental samples were counted in calculating the metrics scores.
In other words, the objects counted during the crossvalidation, regardless of whether they are positive or negative samples, are exactly the same as those counted by the other methods in Table 2.

Web server and user guide
For those who are interested in using the iDrug-Target package, but not its mathematical details, a Web server was established.Below, let us give a step-by-step guide on how to use the Web server to get the desired results.
Step 1. Open the Web server at http://www.jci-bio info.cn/iDrug-Target/, and you will see the top page of the iDrug-Target on your computer screen, as shown in Figure 5. Click on the Read Me button to see a brief introduction about the iDrug-Target package and the caveat when using it.
Step 2. Click one of the four predictors according to your need.For instance, if you wish to predict the drug-GPCR interaction, click the button iDrug-GPCR, and follow the instructions on the screen to get your desired results.
Step 3. If you wish to predict the interaction of drugs with other targets, click the Close button to bring you back to the top page.Then, repeat Step 2 but click a different predictor such as iDrug-Chl, iDrug-Ezy, or iDrug-NR as you desire.

Conclusion
The strategy of optimizing the training data-set via the NCR (Laurikkala, 2001) and SMOTE (Chawla et al., 2011) approaches can remarkably improve the prediction quality of a predictor, as indicated by the rigorous targetjackknife tests in which only the experiment-confirmed data were examined.This is particularly true for the case when the predictor was originally trained by a highly unbalanced or skewed benchmark data-set in which the negative subset data-set is overwhelmingly larger than the positive one.
It is anticipated that the new package called iDrug-Target developed in this paper with the optimized training datasets will become a very useful high throughput toll for both basic research and drug development.
It is anticipated that the current strategy and novel technique can also be used to improve all those existing statistical predictors that were trained by highly unbalanced training datasets.

Figure 1 .
Figure 1.Graphical representation to show the drug-target interactions in cellular networking.Note: Panel (a) is for drug-GPCR interaction; (b) for drugchannel; (c) for drug-enzyme; and (d) for drug-NR, where the drug is represented by a green square, the target protein by a magenta circle, and the interaction between the two is by a gray edge.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this paper.)

Figure 2 .
Figure 2. A flowchart to show the process of converting an imbalanced benchmark data-set to a balanced one by NCR (neighborhood cleaning rule) and SMOTE.Note: In the figure, N 1 and N 2 represent the numbers of samples in the original positive and negative subsets, respectively; n 2 , the number of the negative samples removed by the NCR treatment; and n 1 , the number of the positive hypothetical samples created by SMOTE and added to the final balanced benchmark data-set.See the relevant text for further explanation.

Figure 3 .
Figure 3. Three-dimensional plot to show how to find the optimal values of C and c via a two-dimensional grid search.Panel (a) for iDrug-GPCR predictor; (b) for iDrug-Chl; (c) for iDrug-Ezy; and (d) for iDrug-NR.See Section 2.3 as well as Equations (8) and (9) for further explanation.

Figure 4 .
Figure 4.The ROC and PR curves to show the predictor's quality.Note: The green line is for the existing predictors (a) iGPCR-Drug (Xiao et al., 2013b), (b) iCDL-Drug (Xiao et al., 2013a), (c) iEzy-Drug (Min et al., 2013), and (b) iNR-Drug (Fan et al., 2014) while the red line for the corresponding predictors in the iDrug-Target package.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this paper.)

Table 1 .
He et al. (2010)b server predictors for identifying drug-target interaction networks based on the same benchmark data-set as inHe et al. (2010).

Table 2 .
A comparison of iDrug-Target 1 with the existing predictors on the same experiment-confirmed data.