Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition

With the explosive growth of protein sequences entering into protein data banks in the post-genomic era, it is highly demanded to develop automated methods for rapidly and effectively identifying the protein–protein binding sites (PPBSs) based on the sequence information alone. To address this problem, we proposed a predictor called iPPBS-PseAAC, in which each amino acid residue site of the proteins concerned was treated as a 15-tuple peptide segment generated by sliding a window along the protein chains with its center aligned with the target residue. The working peptide segment is further formulated by a general form of pseudo amino acid composition via the following procedures: (1) it is converted into a numerical series via the physicochemical properties of amino acids; (2) the numerical series is subsequently converted into a 20-D feature vector by means of the stationary wavelet transform technique. Formed by many individual “Random Forest” classifiers, the operation engine to run prediction is a two-layer ensemble classifier, with the 1st-layer voting out the best training data-set from many bootstrap systems and the 2nd-layer voting out the most relevant one from seven physicochemical properties. Cross-validation tests indicate that the new predictor is very promising, meaning that many important key features, which are deeply hidden in complicated protein sequences, can be extracted via the wavelets transform approach, quite consistent with the facts that many important biological functions of proteins can be elucidated with their low-frequency internal motions. The web server of iPPBS-PseAAC is accessible at http://www.jci-bioinfo.cn/iPPBS-PseAAC, by which users can easily acquire their desired results without the need to follow the complicated mathematical equations involved.


Introduction
All cellular processes depend on precisely orchestrated interactions between proteins (Chou & Cai, 2006). A critical step in understanding the biological function of a protein is identification of the interface sites on which it interacts with other protein(s). Characterization of protein interactions is important for many problems covering from rational drug design to analysis of various biological networks (see, e.g. Fan, Xiao, & Min, 2014;Min, Xiao, & Chou, 2013;Xiao, Min, Lin, & Liu, 2015;Xiao, Min, & Wang, 2013a, 2013cZhong & Zhou, 2014;Zhou, 2015). The number of experimentally determined structures of protein-protein and protein-ligand complexes is still quite small, as reflected by the fact that the entries in UniprotKB/Swissprot (UniProt, 2013) is much larger than that in the Protein Data Bank (Berman et al., 2000). The limited availability of structures often restricts the identification of binding sites of proteins and their functional annotation. Furthermore, the chemical or biological experimental methods are expensive, time-consuming and labor-intensive. Therefore, as a complement to the experimental methods, it is highly demanded to develop computational methods for identifying the protein-protein binding sites (PPBSs) according to their sequences information alone (Gallet, Charloteaux, Thomas, & Brasseur, 2000;Valencia & Pazos, 2002).
Given a protein sequence, how can we identify which of its constituent amino acid residues are located in the binding site? Ofran and Rost (2003) and Yan, Dobbs, and Honavar (2004) have reported the following findings: (1) the residues involved in this kind of interactions usually tend to form clusters in sequences within four neighboring residues on either side; and (2) 97-98% of interface residues have at least one additional interface residue and 70-74% have at least four additional interface residues. Their analysis indicates that the neighboring residues of an actual interface residue have higher potential for being the interface residues, suggesting that fragments of protein sequences (referred to as sub-sequences hereafter) may contain useful information or features for discriminating between interaction and non-interaction sites. Several approaches have been proposed for predicting protein-protein interaction sites from amino acid sequence. Kini and Evans (1996), based on their observations on the frequency of proline residues occurring near the interaction sites, proposed a method for predicting the potential PPBSs by detecting the presence of proline bracket. Shortly afterward, using the multiple sequence alignment to detect correlated changes of the interacting protein domains, Pazos, Helmer-Citterich, Ausiello, and Valencia (1997) offered a different method to predict the contacting residue pairs. In 2000, Gallet et al. (2000) introduced an approach to identify the interacting residues by analyzing the sequence hydrophobicity with the method developed by Eisenberg, Schwarz, Komaromy, and Wall (1984). In 2003, Ofran and Rost (2003) used sub-sequences of nine consecutive residues to develop a neural network-based method with a post-processing filter to predict interface residues. Subsequently, Yan et al. (2004) also used subsequence of nine residues to develop a two-stage classifier by combining support vector machine (SVM) and Bayesian network classifiers, achieving a higher accuracy. Two years later, Wang et al. (2006) also developed a predictor in this regard by using SVM with features extracted from spatial sequence and evolutionary scores based on a phylogenetic tree.
Since the three-dimensional (3D) structures are unknown for most of proteins, the sequence-based method plays an important role in protein binding site prediction. Unfortunately, several issues (Chen & Jeong, 2009;Sikic, Tomic, & Vlahovicek, 2009) exist that have made the sequence-based approach particularly difficult. The main problems are as follows: (i) the effective features common to all the binding sites are hard to extract because the biological properties responsible for proteinprotein interacting are not fully understood; (ii) the prediction of binding sites is to deal with a highly imbalanced classification problem because the number of non-binding sites of a protein pair is substantially larger than that of binding ones, and hence prone to cause bias; (iii) there is no good benchmark data-set due to lack of a unique definition for the binding sites, as reflected by the fact that one definition of the binding sites is based on the distance between the carbon atoms concerned, but another on the change of the accessible surface area (ASA) value between the bounded and unbounded status.
The present study was initiated in an attempt to develop a new approach to predict the PPBSs in hope to help deal with the aforementioned problems.
As demonstrated in a series of recent publications (Ding, Deng, Yuan, & Liu, 2014;Jia, Liu, & Xiao, 2015;Liu, Xu, Lan, Xu, & Zhou, 2014;Qiu, Xiao, & Lin, 2015;Xu, Wen, Wen, & Wu, 2014;Xu, Zhou, Liu, He, & Zou, 2015) in using Chou's 5-step rule (Chou, 2011), to develop a really useful sequence-based predictor for a biological system, we should make the following five procedures very clear: (1) how to construct or select a valid benchmark data-set to train and test the predictor; (2) how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (3) how to introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) how to properly perform cross-validation tests to objectively evaluate its anticipated accuracy; (5) how to establish a user-friendly web-server that is accessible to the public. Below, we are to address the five procedures one-by-one.

Benchmark data-set
Two benchmark data-sets were used for the current study. One is the "surface-residue" data-set and the other is "all-residue" data-set, as elaborated below.
The protein-protein interfaces are usually formed by those residues, which are exposed to the solvent after the two counterparts are separated from each other. Given a protein sample with L residues as expressed by where R 1 represents the 1st amino acid residue of the protein P, R2 the 2nd residue, and so forth. The residue R i ði ¼ 1; 2; . . .; LÞ is deemed as a surface residue if it satisfies the following condition where ASA(R i |P) is the ASA of R i when it is a part of protein P, ASA(R i ) is the accessible surface area of the free R i that is actually its maximal ASA as given in Table 1 (Ofran & Rost, 2003), and / R i ð Þ is the ratio of the two. Furthermore, the surface residue R i is deemed as interfacial residue (Jones & Thornton, 1996) if where ASA(R i |PP) is the accessible surface area of R i when it is a part of protein-protein complex.
If only considering the surface residues as done in Wang, Huang, and Jiang (2014) for the 99 polypeptide chains extracted by Deng, Guan, Dong, and Zhou (2009) from the 54 heterocomplexes in Protein Data Bank, we have obtained the results that can be formulated as follows: where S surf is called the "surface-residue data-set" that contains a total of 13,771 surfaces residues, of which 2,828 are interfacial residues belonging to the positive subset S þ surf , while 10,943 are non-interfacial residues belonging the negative subset S À surf , and [ is the symbol of union in the set theory.
If considering all the residues as done in Chen and Jeong (2009), however, the corresponding benchmark data-set can be expressed by where S all is called the "all-residue data-set" that contains a total of 27,442 residues, of which 2828 are interfacial residues belonging to the positive subset S þ all , while 24,614 are non-interfacial residues belonging the negative subset S À all . For readers' convenience, given in S1 Data-set is a combination of the two benchmark data-sets, where those labeled in column 3 are all the residues determined by experiments, those in column 4 are of surface and nonsurface residues, and those in column 5 are of interface and non-interface residues.
As pointed out in a comprehensive review (Chou & Shen, 2007a), there is no need to separate a benchmark data-set into a training data-set and a testing data-set for examining the quality of a prediction method if it is tested by the jackknife test or subsampling (K-fold) cross-validation test because the outcome thus obtained is actually from a combination of many different independent data-set tests.

Flexible sliding window approach
For a protein chain as formulated by Equation (1), the sliding window approach (Chou, 2001a) and flexible sliding window approach (Chou & Shen, 2007b) are often used to investigate its various post-translational modification (PTM) sites (see, e.g. Qiu, Xiao, & Lin, 2014;Qiu et al., 2015;Xu, Ding, & Wu, 2013;Xu, Shao, Wu, Deng, 2013;Xu, Wen, & Shao, 2014; and HIV (human immunodeficiency virus) protease cleavage sites (Chou, 1996). Here, we also use it to study PPBSs. In the sliding window approach, a scaled window is denoted by Àn; þn ½ (Chou, 2001a). Its width is 2n þ 1, where n is an integer. When sliding it along a protein chain P (Equation (1)), one can see through the window a series of consecutive peptide segments as formulated by where R Àn represents the n-th upstream amino acid residue from the center, R þn the n-th downstream amino acid residue, and so forth. The amino acid residue R 0 at the center is the targeted residue. When its sequence position in P (cf. Equation (1)) is less than n or greater L À n; the corresponding P n R 0 ð Þ is defined, instead by P of Equation (1), but by the following dummy protein chain where the symbol ⇕ stands for a mirror, the dummy segment R n Á Á Á R 2 R 1 stands for the image of R 1 R 2 Á Á Á R n reflected by the mirror, and the dummy segment Figure 1). Accordingly, P(dummy) of Equation (7) is also called the mirror-extended chain of protein P.
Thus, for each of the L amino acid residues in protein P (Equation (1)), we have a working segment as defined by Equation (6). In the current study, the 2n þ 1 ð Þ-peptides P n ðR 0 Þ can be further classified into the following categories: where ∈ represents "a member of" in the set theory. Note: B stands for D or N; Z for E or Q, and X for an undetermined amino acid. a Amino acids are represented by their one-letter codes.

Using pseudo amino acid composition to represent peptide chains
One of the most challenging problems in computational biology today is how to effectively formulate the sequence of a biological sample (such as protein, peptide, DNA, or RNA) with a discrete model or a vector that can considerably keep its sequence order information or capture its key features. The reasons are as follows.
(1) If using the sequential model, i.e. the model in which all the samples are represented by their original sequences, it is hardly able to train a machine that can cover all the possible cases concerned, as elaborated in Chou (2011).
Their numerical values are given in Table 2. Thus, the peptide segment P n of Equation (10) can be encoded into seven different numerical series, as formulated by where U 1 is the hydrophobicity value of R 1 in Equation (9), U ð2Þ 2 the hydrophilicity value of R 2 , and so forth. Note that before substituting the physicochemical values of Table 2 into Equation (10), they all are subjected to the following standard conversion where the symbol h i means taking the average for the quantity therein over the 20 amino acid types, and SD means the corresponding standard deviation. The converted values via Equation (12) will have zero mean value over the 20 amino acid types, and will remain unchanged if they go thru the same standard conversion procedure again.

Stationary wavelet transform approach
The low-frequency internal motion is a very important feature of biomacromolecules (see, e.g. (Gordon, 2008;Madkan, Blank, Elson, Geddis, & Goodman, 2009;Martel, 1992) as well as a Wikipedia article at http://en.wiki pedia.org/wiki/Low-frequency_collective_motion_in_pro teins_and_DNA). Many marvelous biological functions in proteins and DNA and their profound dynamic mechanisms, such as switch between active and inactive states (Wang & Chou, 2009;Wang, Gong, Wei, & Li, 2009), cooperative effects (Chou, 1989a), allosteric transition (Chou, 1987;Schnell & Chou, 2008;Wang & Chou, 2010), intercalation of drugs into DNA (Chou & Mao, 1988), extra electron motion in DNA (Zhou, 1989), and assembly of microtubules (Chou, Zhang, & Maggiora, 1994), can be revealed by studying their low-frequency internal motions as summarized in a comprehensive review (Chou, 1988). Low-frequency Fourier spectrum was also used by Liu, Wang, and Chou (2005) to develop a sequence-based method for predicting membrane protein types. In view of this, it would be intriguing to introduce the stationary wavelet transform (SWT) into the current study. The SWT (Shensa, 1992) is a wavelet transform algorithm designed to overcome the lack of shift-invariance of the discrete wavelet transform (DWT) (Mallat, 1989). Shift-invariance is achieved by removing the downsamplers and upsamplers in the DWT and upsampling (insert zero) the filter coefficients by a factor of 2 jÀ1 in the jth level of the algorithm. The SWT is an inherently redundant scheme as the output of each level of SWT contains the same number of samples as the input; so for a decomposition of N levels, there is a redundancy of N in the wavelet coefficients. Shown in Figure 2 is the block diagram depicting the digital implementation of SWT. As we can see from the figure, the input peptide segment is decomposed recursively in the low-frequency part.
The concrete procedure of using the SWT to denote the 2n þ 1 ð Þ-tuple peptides is as follows. For each of the 2n þ 1 ð Þ-tuple peptides generated by sliding the scaled window Àn; þn ½ along the protein chain concerned, the SWT was used to decompose it based on the amino acid values encoded by the seven physicochemical properties as given in Equation (11). Daubechies of number 1 (Db1) wavelet was selected because its wavelet possesses a lower vanish moment and easily generates non-zero coefficients for the ensemble learning framework that will be introduced later.
In the preliminary study, we tested the sensitivity of the predicted outcome versus the value of parameter n from 4 to 10, and observed that when n ¼ 7; i.e. the working segments are of 15-tuple peptides, the outcomes thus obtained were most promising, as shown in Figure 3. Accordingly, we only consider the case of n ¼ 7 hereafter.
Using the SWT approach, we have generated 5 subbands (Figure 2), each of which has four coefficients: (1) a i , the maximum of the wavelet coefficients in the subband ið1; 2; . . .5Þ; (2) b i; the corresponding mean of the wavelet coefficients; (3) c i , the corresponding minimum of the wavelet coefficients; (4) d i , the corresponding standard deviation of the wavelet coefficients. Therefore, for each working segment, we can get a feature vector that contains X ¼ 5 Â 4 ¼ 20 components by using each of the seven physicochemical properties of Equation (11). In other words, we have seven different modes of PseAAC as given below:

Ensemble RF algorithm
The RF algorithm is a powerful algorithm, which has been used in many areas of computational biology (see, e.g. Kandaswamy, Martinetz, Moller, Sridharan, & Pugalenthi, 2011;Lin et al., 2011;Pugalenthi, Kandaswamy, Vivekanandan, & Kolatkar, 2012). The detailed procedures and formulation of RF have been very clearly described in Breiman (2001), and hence there is no need to repeat here. It should be pointed out, however, that the number of negative samples in the current case is much larger than that of positive ones, and most classifiers (including RF) are usually working properly for the benchmark data-sets consisting of balanced subsets. To deal with such a situation, an asymmetric bootstrap approach was adopted as elaborated in Jia, Xiao, Liu and Jiao (2011) and illustrated in Figure 4. As shown from the figure, in order to construct a balanced data-set to train each of the sub-classifiers, we randomly picked the negative training samples from S À all or S À surf making them have the same number of the corresponding positive samples in S þ all or S þ surf , respectively Also, as shown in Equation (13), a peptide segment concerned in the current study can be formulated with seven different PseAAC modes, each of which can be used to train the RF predictor. Accordingly, we have a total of seven individual predictors for identifying PPBS, as formulated by: where RFðkÞ represents the RF predictor based on the kth physicochemical property (cf. Equation (13)). Now, the problem is how to combine the results from the seven individual predictors to maximize the prediction quality. As indicated by a series of previous studies, using the ensemble classifier formed by fusing many individual classifiers can remarkably enhance the success rates in predicting protein subcellular localization , 2007c and protein quaternary structural attribute (Shen & Chou, 2009a). Encouraged by the Figure 3. A histogram to show the results of AUC (Fawcett, 2005) obtained by using different values of n for the working peptides. As we can see, when n ¼ 7, i.e. the working segments are of 2n þ 1 ð Þ= 15-tuple peptides (cf. Equation (10)), the outcomes thus obtained were most promising. For more explanation, see the text in the Section 2.2 and the legend of Figure 6 later. Figure 4. A flowchart to illustrate the 1st-layer ensemble classifier, a voting system by using the bootstrap approach to deal with the situation when the number of negative samples is overehelmingly larger than that of positive ones, as done in Jia et al. (2011). In the figure, RF denoted the RF classifier, S þ denotes either S þ surf or S þ all , and S À denotes either S À surf or S À all (cf. Equations (4)-(5)). See the text for more explanation.
previous investigators' studies, here we are also developing an ensemble classifier by fusing the seven individual predictors RF k ð Þðk ¼ 1; 2; . . .; 7Þ through a voting system, as formulated by: where RF E stands for the ensemble classifier, and the symbol ∀ for the fusing operator. For the detailed procedures of how to fuse the results from the seven individual predictors to reach a final outcome via the voting system, see Equations (30)-(35) in Chou and Shen (2007a), where a crystal clear and elegant derivation was elaborated and hence there is no need to repeat here. To provide an intuitive picture, a flowchart is given in Figure 5 to illustrate how the seven individual RF predictors are fused into the ensemble classifier. The final predictor thus obtained is called "iPPBS-PseAAC", where "i" stands for "identify", "PPBS" for "protein-protein binding site", and "PseAAC" for "pseudo amino acid composition" approach.

Result and discussion
As pointed out in the Introduction section, one of the important procedures in developing a predictor is how to properly and objectively evaluate its anticipated success rates (Chou, 2011). Toward this, we need to consider the following two aspects: one is what kind of metrics should be used to quantitatively measure the prediction accuracy; the other is what kind of test method should be adopted to derive the metrics values, as elaborated below.

Success rate metrics and validation approach
For measuring the success rates in identifying PPBS, a set of four metrics are often used in the literature. They are: (1) overall accuracy or Acc, (2) Mathew's correlation coefficient or MCC, (3) sensitivity or Sn, and (4) specificity or Sp (see, e.g. Chen, Liu, & Yang, 2007). Unfortunately, the conventional formulations for the four metrics are not quite intuitive for most experimental scientists, particularly the one for MCC. Interestingly, by using the symbols and derivation as used in Chou (2001a) for studying signal peptides, the aforementioned four metrics can be formulated by a set of equations given below (Chen et al., 2013;Lin et al., 2014;Qiu, Xiao, & Chou, 2014): where N + represents the total number of PPBSs investigated, whereas N þ À the number of true PPBSs incorrectly predicted to be of non-PPBS; Nthe total number of the non-PPBSs investigated, whereas N À þ the number of non-PPBSs incorrectly predicted to be of PPBS.
According to Equation (17), it is crystal clear to see the following. When N þ À ¼ 0 meaning none of the true PPBSs are incorrectly predicted to be of non-PPBS, we have the sensitivity Sn ¼ 1. When N þ À ¼ N þ meaning that all the PPBSs are incorrectly predicted to be of non-PPBS, we have the sensitivity Sn ¼ 0. Likewise, when N À þ = 0 meaning none of the non-PPBSs are incorrectly predicted to be of PPBS, we have the specificity Sp ¼ 1; whereas N À þ = N − meaning that all the non-PPBSs are incorrectly predicted to be of PPBS, we have the specificity Sp ¼ 0. When N þ À ¼ N À þ ¼ 0 meaning that none of PPBSs in the positive data-set and none of the non-PPBSs in the negative data-set are incorrectly predicted, we have the overall accuracy Acc ¼ 1 and MCC ¼ 1; when N þ À ¼ N þ and N À þ = N À meaning that all the PPBSs in the positive data-set and all the non-PPBSs in the negative data-set are incorrectly predicted, we have the overall accuracy Acc ¼ 0 and MCC ¼ À1; whereas when N þ À ¼ N þ =2 and N À þ = N -/2, we have Acc ¼ 0:5 and MCC ¼ 0 meaning no better than random guess. As we can see from the above discussion, it would make the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient much more intuitive and easier-to-understand by using Equation (17), particularly for the meaning of MCC.
It should be pointed out, however, the set of metrics as defined in Equation (17) is valid only for the Figure 5. A flowchart to illustrate the 2nd-layer ensemble classifier that exploits all the different groups of features, where D(1) means the decision made by RFð1Þ, D(2) means the decision made by RFð2Þ, and so forth. See the text as well as Equations (11) and (15) for further explanation.
With the evaluation metrics available, the next thing is what validation method should be used to generate the metrics values.
In statistical prediction, the following three cross-validation methods are often used to derive the metrics values for predictor: independent data-set test, subsampling (or K-fold cross-validation) test, and jackknife test (Chou & Zhang, 1995). Of the three methods, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark data-set as elucidated in Chou (2011) and demonstrated by Equations (28)-(32) therein. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g. Dehzangi et al., 2015;Hajisharifi et al., 2014;Khan, Hayat, & Khan, 2015;Kumar, Srivastava, Kumari, & Kumar, 2015;Mondal & Pai, 2014;Shen et al., 2007;Xiao et al., 2011). However, to reduce the computational time, in this study we adopted the 10-fold cross-validation, as done by most investigators with SVM and RFs algorithms as the prediction engine. In the 10-fold cross-validation test, all the samples in the benchmark data-set are divided into 10 approximately equal-sized subsets. And then each of the 10 subsets will be singled out one-by-one and tested by the predictor trained with the samples in the remaining subsets. The performance measures are then calculated as an average over the 10 different single-out subsets or divisions. In other words, during the process of 10-fold cross-validation, both the training data-set and testing data-set are actually open, and each subset will be in turn moved between the two. The 10-fold crossvalidation test can exclude the "memory" effect, just like conducting 10 different independent data-set tests. Table 3 are the values of the four metrics (cf. Equation (17)) obtained by the current iPPBS-PseAAC predictor using the 10-fold cross-validation on the surface-residue benchmark data-set S surf (Equation (4)) and the all-residue benchmark data-set S all (Equation (5)), respectively. See S1 Data-set for the details of the two benchmark data-sets. For facilitating comparison, the corresponding results obtained by the existing methods (Chen & Jeong, 2009;Deng et al., 2009) are also given there.

Listed in
As we can see from the table, the new predictor iPPBS-PseAAC proposed in this paper remarkably outperformed its counterparts, particularly in Acc and MCC; the former stands for the overall accuracy, and the latter for the stability. At the first glance, although the value of Sn by Deng et al.'s method (Deng et al., 2009) is higher than that of the current predictor when tested by the surface-residue benchmark data-set, its corresponding Sp value is more than 30% lower than that of the latter, indicating the method (Deng et al., 2009) is very unstable with extremely high noise.
Because graphic approaches can provide useful intuitive insights (see, e.g. Althaus et al., 1993;Chou, 1989bChou, , 2010Chou & Forsen, 1980;Wu, Xiao, & Chou, 2010;Zhou, 2011), here we also provide a graphic comparison of the current predictor with their counterparts via the receiver operating characteristic (ROC) plot (Fawcett, 2005), as shown in Figure 6. According to ROC (Fawcett, 2005), the larger the area under the curve (AUC), the better the corresponding predictor is. As we can see from the figure, the area under the ROC curve of the new predictor is remarkably greater than those of their counterparts fully consistent with the AUC values listed on Table 3, once again indicating a clear improvement of the new predictor in comparison with the existing ones.
All the above facts have shown that iPPBS-PseAAC is really a very promising predictor for identifying PPBSs. Or at the very least, it can play a complementary Table 3. Comparison of the iPPBS-PseAAC with the other existing methods via the 10-fold cross-validation on the surface-residue benchmark data-set (Equation (4)) and the all-residue benchmark data-set (Equation (5) Note: Text in bold inicates the predictor proposed in this paper and its results. a Results reported by Deng et al. (2009). b Results reported by Chen and Jeong (2009). c Results obtained by the current predictor using the same cross-validation method on the same benchmark data-set.
role to the existing prediction methods in this area. Particularly, none of the existing predictors has provided a web server. In contrast to this, a user-friendly and publically accessible web server has been established for iPPBS-PseAAC at http://www.jci-bioinfo.cn/iPPBS-PseAAC, which is no doubt very useful for the majority of experimental scientist in this or related areas without the need to follow the complicated mathematical equations. Why could the proposed method be so powerful? This is because many key features, which are deeply hidden in complicated protein sequences, can be extracted via the wavelets transform approach. Just like in dealing with the extremely complicated internal motions of proteins, it is the key to grasp the low-frequency collective motion (Gordon, 2008;Madkan et al., 2009) for in-depth understanding or revealing the dynamic mechanisms of their various important biological functions (Chou, 1988), such as cooperative effects (Chou, 1989a), allosteric transition (Chou, 1987;Schnell & Chou, 2008), assembly of microtubules (Chou et al., 1994), and switch between active and inactive states (Wang & Chou, 2009). Furthermore, a dual ensemble technique was used in this study: one for dealing the unbalanced training data-set via a bootstrap voting system (Figure 4), and one for selecting the most relevant one from seven classes of different physicochemical properties ( Figure 5).

Web server and user guide
As emphasized in a recent review (Chou, 2015), an open accessible web-server is very important for the impact of a prediction method. To enhance the value of its practical applications, the web-server for iPPBI-PseAAC has been established at http://www.jci-bioinfo.cn/iPPBS-PseAAC. Furthermore, to maximize the convenience for the majority of experimental scientists, a step-to-step guide is provided below.
Step 1. Opening the web-server at http://www.jcibioinfo.cn/iPPBD-PseAAC, you will see the top page of iPPBS-PseAAC on your computer screen, as shown in Figure 7. Click on the Read Me button to see a brief introduction about the PPBS-PseAAC predictor.
Step 2. Either type or copy/paste the query protein sequences into the input box at the center of Figure 7. The input sequence should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with the symbol, >, in the first column, followed by lines of sequence data in which amino acids are represented using single-letter codes. Except for the mandatory symbol >, all the other characters in the single initial line are optional and only used for the purpose  of identification and description. The sequence ends if another line starting with the symbol > appears; this indicates the start of another sequence. For the examples of sequences in FASTA format, click the Example button right above the input box.
Step 3. Click on the Submit button to see the predicted result. For example, if you use the two query protein sequences in the Example window as the input, after 20 s or so, you will see the following on the screen of your computer: (1) Sequence-1 contains 63 amino acid residues, of which 19 are highlighted with red, meaning belonging to binding site. (2) Sequence-2 contains 224 residues, of which 11 are highlighted with red, belonging binding site. All these predicted results are fully consistent with experimental observations except for residues 44 and 63 in sequence-1 and residues 52 in sequence-2 that are overpredicted.
Step 4. As shown on the lower panel of Figure 7, you may also choose the batch prediction by entering your email address and your desired batch input file (in FASTA format of course) via the "Browse" button. To see the sample of batch input file, click on the button Batch-example.
Step 5. Click on the Citation button to find the relevant papers that document the detailed development and algorithm of iPPBS-PseAAC.
Step 6. Click the Supporting Information button to download the benchmark data-set used in this study.

Conclusion
In the new PPBS predictor, each of the protein residue sites investigated is treated as a 15-tuple peptide generated by sliding the scaled window [−7,+7] (Chou, 2001b) along a protein chain with its center aligned with the amino acid residue concerned. The working peptide segment is further formulated by a general form of PseAAC via the following procedures: (1) it is converted into a numerical series via the physicochemical properties of amino acids; (2) the numerical series is subsequently converted into a 20-D feature vector by means of the SWT technique.
The operation engine to run the PPBS prediction is a dual ensemble formed by two voting systems with one for finding the best training data-set and the other for finding the most relevant physicochemical property.
It was demonstrated via cross-validations that the new predictor established with the above procedures is very powerful and promising. We anticipate that iPPBS-PseAAC predictor will become a very useful high throughput tool for identifying PPBSs, or at the very least, a complementary tool to the existing prediction methods in this area.

Supplementary material
The supplementary material for this paper is available online at http://dx.doi.org/10.1080/07391102.2015. 1095116.