Retrieving novel C5aR antagonists using a hybrid ligand-based virtual screening protocol based on SVM classification and pharmacophore models

C5aR antagonists have been thought as potential immune mediators in various inflammatory and autoimmune diseases, and discovery of C5aR antagonists has attracted much attention in recent years. The discovery of C5aR antagonists was usually achieved through high-throughput screening, which usually suffered a high cost and a low success rate. Currently, the fast developing computer-aided virtual screening (VS) methods provide economic and rapid approaches to the lead discovery. In this account, we proposed a hybrid ligand-based VS protocol that is based on support vector machine (SVM) classification and pharmacophore models for retrieving novel C5aR antagonists. Performance evaluation of this hybrid VS protocol in virtual screening against a large independent test set, T-CHEM, showed that the hybrid VS approach significantly increased the hit rate and enrichment factor compared with the individual SVM classification model-based VS and pharmacophore model-based VS, as well as molecular docking-based VS in that the receptor structure was created by homology modeling. The hybrid VS approach was then used to screen several large chemical libraries including PubChem, Specs, and Enamine. Finally, a total of 20 compounds were selected from the top ranking hits, and shifted to the subsequent in vitro and in vivo studies, which results will be reported in the near future.

Over the past decade, a number of C5aR antagonists have been discovered (Brodbeck et al., 2008;Fonseca et al., 2009;Hu, Xie, Chen, & Fang, 2011). Even so, there is only one small molecular C5aR antagonist, CCX-168, in clinical trials currently. Thus, discovering more potent C5aR antagonists is still needed, which could provide more lead candidates for drug development. Formerly, the discovery of C5aR antagonists was achieved through high-throughput screening, which usually suf-fered a high cost and a low success rate da Silva, Carvalho, & Taft, 2007;Yang, 2010). Currently, the fast developing computer-aided virtual screening (VS) methods provide economic and rapid approaches to the lead discovery (Atri et al., 2011;Chen et al., 2009Chen et al., , 2011Chen et al., , 2010Hage-Melim, da Silva, Semighini, Taft, & Sampaio, 2009;Huang et al., 2010;Schwardt, Kolb, & Ernst, 2003). Commonly used VS methods include structure-based methods, such as molecular docking and ligand-based approaches, for example pharmacophore-based VS (Andrianov, 2008;Andrianov & Anishchenko, 2009;Kahlon, Roy, & Sharma, 2010). Nevertheless, structure-based methods are not suitable for the case of C5aR since for which there is no crystal structure reported at present; although the structure of C5aR could be created by homology modeling (Higginbottom et al., 2005;Lee et al., 2008), the reliability of structure generated together with the existing deficiencies of molecular docking including receptor flexibility, scoring function, and solvent problems has significant impact on the performance of DB-VS (Higginbottom et al., 2005;Mohan, Gibbs, Cummings, Jaeger, & DesJarlais, 2005). Thus, the ligand-based methods are the primary option for the VS of C5aR antagonists. The most widely used ligand-based method is the pharmacophore-based VS (PB-VS). Although PB-VS is one of the best VS methods, it still bears some inherent limitations including low speed, high false-positive rate, and low enrichment factor, especially in screening large chemical databases (Kirchmair, Markt, Distinto, Wolber, & Langer, 2008;Yang, Chen, Shen, Kristal, & Hsu, 2005). Lately emerging support vector machine (SVM)-based VS (SB-VS) seems superior to the PB-VS in some aspects, in particular the screening speed (Han et al., 2008;Liew, Ma, Liu, & Yap, 2009;Liu et al., 2009). However, the SB-VS still suffers a low hit rate and a high false-positive rate, which might originate from the lack of consideration of information regarding the three dimensional (3D) pharmacophore features of small molecules (Mohan et al., 2005). A combination of these methods is expected to mutually compensate for these limitations and capitalize on their mutual strengths. The goal of this investigation is to establish a hybrid ligand-based VS approach that combines the SB-VS and PB-VS for discovering new potent antagonists of C5aR. The performance of this hybrid VS approach will be evaluated through comparing with that of individual SB-VS and PB-VS, as well as molecular docking-based VS (DB-VS) in that the structure of C5aR is created by homology modeling. Then, this hybrid VS approach will be applied to screen several large chemical libraries, including PubChem (18,831,686 compounds), Specs (202,408 compounds), and Enamine (980,000 compounds) for identifying novel C5aR antagonists. Finally, some of the retrieved hit compounds will be selected and shifted to the subsequent in vitro and in vivo studies.

Support vector machine
In this investigation, an optimal SVM modeling method, namely SVM method combined with genetic algorithm (GA) for feature selection and conjugate gradient method (CG) for parameter optimization (GA-CG-SVM), which was proposed by our group recently, was used . A brief description regarding SVM and GA-CG-SVM has been presented in Supplementary material.

Data-sets
We collected 456 compounds from different literature resources, which include 401 positives (C5aR antagonists, IC 50 < 50 μM) and 55 negatives (C5aR nonantagonists, IC 50 > 50 μM) Blagg et al., 2008aBlagg et al., , 2008bBrodbeck et al., 2008;Chen et al., 2010;Schnatbanm et al., 1999;Pellas & Wennogle, 1999;Fan et al., 2010;Qu et al., 2009;Sanganee et al., 2009;Sumichika et al., 2002;Waters et al., 2005;Zenon Konteatis & Springer, 1997). Using the protocol of "Generate Training and Test Data" of Discovery Studio 3.1(DS 3.1), we grouped the 401 positive compounds into a training set, a test set, and an independent validation set based on the fingerprints FCFP_6, ECFP_6, FPFP_6, and EPFP_6. The finally formed training set contains 282 positive compounds, which was used for SVM modeling. The test set for validation of the SVM model includes 60 positive compounds and 55 collected negative compounds (all the training set and test set compounds are given in Supplementary material). The remaining 59 positive compounds and 79,534 decoys from ChEMBL (GPCR SARfari) database comprise the independent test set, termed as T-CHEM, which was used for evaluating the performance of the proposed hybrid VS approach.
As we have known, in the SVM modeling, an adequate amount of negatives in the training set would be very important for the quality of SVM model (Mohan et al., 2005). However, the collection of negatives is difficult since very few of literature definitely indicates that a compound is inactive. Therefore, we adopted a method suggested by Chen et al. to generate putative negatives without the knowledge of actual negatives (Han et al., 2008;Liu et al., 2009). In the generation of the putative negatives, we clustered the 18.83 M compounds from the PubChem into 8913 compound families by the K-means clustering method based on their molecular descriptors calculated by the DS 3.1 software. Finally, we randomly selected 17,853 putative negatives from those families that do not contain any of the known C5aR antagonists. These compounds together with 282 known C5aR antagonists constitute the training set for the SVM modeling.

Details for the SVM modeling
All the SVM modeling calculations were carried out by using GA-CG-SVM program Zhang et al., 2009). In this investigation, the initial descriptors were calculated by the DS 3.1 program package. The termination criterion is that the generation number reaches 200 or that the fitness value does not improve during the last 10 generations. The crossover rate was set to 0.5. The mutation rate was 0.1. The starting values of the parameters (C, γ) were set to (8192, 0.38) in the program running.

Pharmacophore modeling
The HipHop algorithm implemented in DS 3.1 program package was employed for the pharmacophore modeling. Five C5aR antagonists (compounds 1-5, see Figure 1) were used as the training set compounds; these compounds were chosen since they are the most active compounds and bear structural diversity. Compound 1 (W-54,011, IC 50 = 2.2 nM) (Sumichika et al., 2002), compound 3 (IC 50= 1.5nM) (Blagg et al., 2008a(Blagg et al., , 2008b, and compound 5 (IC 50 = 7 nM) ) (see Figure 1) were selected as "Reference Compound" specifying a "Principal" value of 2 and a "MaxOmitFeat" value of 0. The "Principal" and "MaxOmitFeat" values of the remaining two compounds were set to 1 and 0. The "Minimum Interfeature Distance" value was set to 1.0 Å. All the remaining HipHop parameters were kept at their default values.

Homology modeling of the structure of C5aR
The homology modeling was carried out using Modeler program within DS 3.1. The sequence of C5aR was obtained by NCBI database (http://www.ncbi.nlm.nih. gov/) showing in Table SI (see Supplementary material). The structure of bovine rhodopsin A (PDB Code: 1F88) was used as template for the homology modeling (Higginbottom et al., 2005); the amino acid identity between C5aR and the bovine rhodopsin A is 21.4%. The model with the lowest PDF (Probability Density Functions) total energy was selected and subjected to loop refinement and side chain refinement. The energy minimization was carried out using the conjugated gradient method with iterations of 50,000 steps and a RMS gradient of 0.001.

Establishment of the SVM classification model
Firstly, we constructed a training set for generating a classification model of C5aR antagonists and nonantagonists. The training set contains 18,135 compounds, including 282 known C5aR antagonists and 17,853 putative nonantagonists (a detailed description about the generation of the training set is given in Section 2). Initially, 252 molecular descriptors, which cover diverse molecular properties such as geometrical, topological, and electronic properties, were selected. These descriptors were preprocessed to remove obvious bad ones. Here, the following descriptors were removed: (1) descriptors with too many zero values, (2) descriptors that are highly correlated with others (correlation coefficients > 95%), and (3) descriptors with very small standard deviation values (<0.5%). Seventy-nine molecular descriptors passed through this process. The descriptor values were then scaled to a range of 1 to +1, which was necessary since the use of descriptors with their values distributed in considerable different ranges would impact the quality of SVM models generated. The scaled descriptors were subjected to the GA-CG algorithm for a further descriptor selection. Finally, 35 molecular descriptors including constitutional descriptors (8), estate keys (8), lipophilicity descriptors (1), hydrogen-bonding descriptors (2), surface area and volume descriptors (5), and topological descriptors (12), were selected to build the SVM classification model (see Table 1).

Validation of the SVM classification model
The SVM classification model just developed was first validated by a fivefold cross-validation method (Hamilton, Pantelic, Hanson, & Teasdale, 2007). In the fivefold cross-validation, the training set was randomly partitioned into five subsets. Of the five subsets, one single subset was retained as the validation set, and the remaining four subsets were used as the training set. The crossvalidation process was then repeated five times, with each of the five subsets used exactly once as a validation set. The five results obtained from these folds were then averaged to produce a single estimation. The fivefold cross-validation results for the training set are given in Table 2. From Table 2, we can see that, of the 282 C5aR antagonists, 245 were correctly predicted (TP, Table 2). The prediction accuracy of C5aR antagonists (SE,    Table 2). The accuracy for the prediction of C5aR nonantagonists (SP, Table 2) was 99.87%. The total prediction accuracy (Q) was 99.66%. These demonstrated that the generated SVM classification model was quite good for the prediction of training set agents. A good SVM model needed the capability not only to classify the training set agents correctly into positives and negatives, but also to predict external agents which are outside of the training set accurately. Here, an independent validation set was employed to evaluate the established SVM model. The independent validation set comprises 60 antagonists and 55 nonantagonists. Of the 60 antagonists, 44 (TP, Table 2) were correctly predicted, indicating a prediction accuracy of 80.0% for the antagonists (SE, Table 2). For the 55 nonantagonists, 53 (TP, Table 2) were properly predicted. The accuracy for the prediction of noninhibitors (SP, Table 2) was 96.67%. Of all the 115 test agents (antagonists and nonantagonists), 97 were correctly predicted and 18 were wrongly predicted (see Table 2). The overall prediction accuracy (Q) was 88.70%, which was comparable with that for the training set.

Pharmacophore modeling
A pharmacophore model was an ensemble of essential chemical features necessary to ensure the optimal superamolecular interactions with a specific biological target and to trigger its biological response (Daisy et al., 2011). The purpose of pharmacophore modeling was to extract these essential chemical features from a set of known ligands, called training set. In this investigation, five C5aR antagonists (compounds 1-5, see Figure 1) were selected to form the training set; these compounds were chosen since they are the most active compounds and have different scaffolds. The HipHop algorithm implemented in DS 3.1 program package was used to generate common feature based pharmacophore models. In the HipHop run, 10 pharmacophore hypotheses were finally generated. The top ranking hypothesis Hypo1, which will be used for the subsequent virtual screening, contains one hydrogen-bond acceptors (A), one ringaromatic feature (Z), three hydrophobic features (H), and one hydrophobic aromatic feature (R). The 3D spatial relationship and geometric parameters of Hypo1 are shown in Figure 2A. Figure 2B depicts the alignments of Hypo1 with one active compound (compound 1). Compound 1 was mapped very well with Hypo1, which gives a fitness value (FitValue) of 5.99; a fitness value is a measure of how well the ligand fits the pharmacophore hypothesis.

Validation of the pharmacophore model Hypo1
To examine whether the pharmacophore model Hypo1 has the capacity to differentiate C5aR antagonists and nonantagonists, we constructed a test set, named T-DB, which comprised of 330 known C5aR antagonists and 4580 decoys (the selection method for the decoys see Supplementary material) from Drug Bank library. Each of the T-DB compounds was mapped onto Hypo1 by using the protocol of "Ligand Pharmacophore Mapping" within the DS 3.1. The "Maximum Omitted Feature" was set to 2, and the "Fitting Method" adopted was "Flexible". Of the 330 C5aR antagonists, 311 were found to map very well with Hypo1, indicating a yield of 94.24%. And 1053 of the 4580 (22.99%) decoys were also mapped very well with Hypo1. These results showed, on the one hand, that Hypo1 was capable of discriminating to some extent the C5aR antagonists and nonantagonists. On the other hand, sole use of Hypo1 in VS also suffered a high false-positive rate, implied the necessity of employing a hybrid VS strategy.

Performance evaluation of the hybrid SB/PB-VS and the individual SB-VS and PB-VS in virtual screening
A large independent test set, called T-CHEM, was constructed to evaluate the overall performance of the hybrid SB/PB-VS and the individual SB-VS and PB-VS in virtual screening. The T-CHEM comprises 59 known C5aR antagonists (these compounds have never been used in previous modeling and model validations), and 79,534 decoys obtained from ChEMBL (GPCR SARfari) library. The yield (percentage of predicted compounds in known antagonists), hit rate (percentage of known antagonists in predicted compounds), and enrichment factor (ratio of hit rate to the percentage of known antagonists in T-CHEM), which showed the magnitude of hit rate improvement over random selection, were evaluated for the performance of virtual screening.
First, SB-VS and PB-VS were solely employed to screen the T-CHEM database. The screening results are given in Table 3. From Table 3, we can see that for the individual use of SB-VS, the number of predicted hits is 836, in which 54 are positives, indicating a yield of 91.53% (54 out of the 59 antagonists, see Table 3). The hit rate and enrichment factor were 6.46 and 87.08%, respectively. The time used was just around 5 min (0.083 h) on a desktop PC equipped with an Intel E5420 (2500 MHz) processor if the molecular descriptors have been calculated (the time used for molecular descriptor calculations for all the T-CHEM compounds is about 0.22 h on the same computer). For PB-VS, the number of predicted positives was 15,216, and the total hits was 43, showed a yield of 72.88% (43 out of the 59 antagonists, see Table 3). The hit rate and enrichment factor were 0.283 and 3.81%, respectively. The time cost for the screening of T-CHEM by PB-VS was about 3.8 h.
Then, the hybrid VS approach was adopted to screen the T-CHEM library, in which the faster SB-VS was used as the first filter, followed by the slower PB-VS. The final number of positive compounds predicted was 330; in which the hits was 42 with a yield of 71.19% (42 out of the 59 antagonists, see Table 3). The hit rate and enrichment factor were 12.73 and 171.57%, respec-tively, which were significantly higher than the corresponding values of individual VS methods. The total time used was about 0.67 h.

A comparison of the hybrid SB/PB-VS and molecular docking-based VS (DB-VS)
As mentioned before, ligand-based VS methods were reasonably the primary choice for retrieving C5aR antagonists from virtual chemical databases because of no crystal structure reported for C5aR at present. Nevertheless, DB-VS could be another option since the structure of C5aR can be generated by homology modeling. Here, we made a comparison between the performances of the hybrid SB/PB-VS and DB-VS. In order to carry out molecular docking, we established a 3D structure of C5aR using homology modeling in advance using Modeler program within DS 3.1. The established model was evaluated by Profiles-3D. The calculated total "verify score" for the C5aR homology model was 146.06, with the "verify expected high score" and "verify expected low score" being 156.25 and 70.31, respectively. The calculated "verify score" for each of residues in C5aR model is shown in Figure 3A. Apparently, most of the  residues have a score large than zero 324/343, implied rational structures for these residues. The established C5aR model was further evaluated using Ramachandran plot, which was generally used to check the correctness of predicted torsion angles of residues in proteins. Figure 3B shows the Ramachandran plot of the homology model of C5aR. We can see that only few of residues (1.17%) locate in the region of unfavorable torsion angles. Overall, the validation results demonstrated that the established homology model of C5aR would be a reasonable structure, which will be used in the following docking studies. A regular DB-VS process was then performed to screen T-CHEM, in which GOLD was used for docking and the C5aR structure generated by homology modeling was taken as the reference receptor. The DB-VS gave 10,054 positives; here a compound was defined as a positive if its value of scoring function ChemPLP was greater than 52 (the threshold value of 52 was determined based on an initial test study made by us, in which we found that a compound has large possibility to be a positive if the value of scoring function ChemPLP is greater than 52). Among these positives, 16 were the known hits that indicated a yield of 27.12% (16 known antagonists out of 59, see Table 3). The hit rate and enrichment factor were 0.159 and 2.15%, respectively. The time cost was about 648 h on the same computer as before. Obviously, compared with the hybrid SB/PB-VS, the DB-VS was much worse in terms of the hit rate and enrichment factor as well as the screening speed. Even compared with the individual SB-VS and PB-VS, the DB-VS was still in the lower position. This situation was understandable because of the less reliability of receptor structure obtained by homology modeling, in addition to the existing deficiencies of molecular docking including receptor flexibility, scoring function, and solvent problems.

Application of the hybrid VS approach for retrieving novel C5aR antagonists
The optimal VS method, named the hybrid ligand-based VS approach, was used to screen several large chemical libraries including PubChem (18,831,686 compounds), Specs (202,408 compounds), and Enamine (980,000 compounds). All the compounds (20,014, 094) were first filtered by SB-VS. And 1031 compounds passed through this filter. These compounds were then subjected to screen by PB-VS. We selected 200 compounds as hits from top ranking ones based on the fitness values, which have been presented in Supplementary material. And a total of 20 compounds were visually chosen from the hit list and shifted to the subsequent experimental assays, which results will be reported in the near future.

Conclusion
In this investigation, we proposed a hybrid ligand-based virtual screening approach, namely SB/PB-VS, based on SVM classification and pharmacophore models for the identification of potential C5aR antagonists. The performance of SB/PB-VS approach in VS was evaluated, which results showed that it significantly increased the hit rate and enrichment factor compared with the individual use of SB-VS and PB-VS. This hybrid SB/PB-VS was also compared with DB-VS, in which the receptor structure was constructed by homology modeling. The results show that the hybrid SB/PB-VS is significantly more superior to DB-VS in terms of hit rate, enrichment factor, and screening speed. The hybrid VS approach was then used to screen several large chemical libraries including PubChem, Specs, and Enamine. Finally, a total of 20 compounds were selected from the top ranking compounds, and shifted to the subsequent in vitro and in vivo studies, which results will be reported in the near future. Overall, this study offers a strategy for the discovery of novel C5aR antagonists; this strategy has a potential to apply to other GPCRs whose structures are unknown.

Supplementary material
The supplementary material for this paper is available online at http://dx.doi.10.1080/07391102.2012.698245.