iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model

As one of the most important posttranslational modifications (PTMs), ubiquitination plays an important role in regulating varieties of biological processes, such as signal transduction, cell division, apoptosis, and immune response. Ubiquitination is also named “lysine ubiquitination” because it occurs when an ubiquitin is covalently attached to lysine (K) residues of targeting proteins. Given an uncharacterized protein sequence that contains many lysine residues, which one of them is the ubiquitination site, and which one is of non-ubiquitination site? With the avalanche of protein sequences generated in the postgenomic age, it is highly desired for both basic research and drug development to develop an automated method for rapidly and accurately annotating the ubiquitination sites in proteins. In view of this, a new predictor called “iUbiq-Lys” was developed based on the evolutionary information, gray system model, as well as the general form of pseudo-amino acid composition. It was demonstrated via the rigorous cross-validations that the new predictor remarkably outperformed all its counterparts. As a web-server, iUbiq-Lys is accessible to the public at http://www.jci-bioinfo.cn/iUbiq-Lys. For the convenience of most experimental scientists, we have further provided a protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of its development process.

Because of its importance and complexity, identification of ubiquitination sites in proteins is one of the greatest challenges in gaining a full understanding of the regulatory roles of ubiquitination regulation and the molecular mechanism of the ubiquitin system. However, it is both time-consuming and labor-intensive to use conventional experimental approaches to determine the potential ubiquitination sites. Fortunately, some studies showed that ubiquitination occurs when an ubiquitin is covalently attached to lysine (K) residues of targeting proteins (Haglund & Dikic, 2005;Hershko & Ciechanover, 1998;Radivojac et al., 2010;Tung & Ho, 2008). Based on this, several computational approaches were developed to predict lysine ubiquitination sites. Tung and Ho (2008) built a prediction model using a set of physicochemical properties. Radivojac et al. (2010) used the random forest algorithm to develop a predictor called UbPred. Cai et al. (2012) proposed a predictor based on the nearest neighbor algorithm, in which the feature set was constructed by extracting the conservation scores and disorder scores from protein sequences, followed by utilizing the maximum relevance and minimum redundancy principle to identify the key features (Cai et al., 2012). Zhao, Li, Ma, and Yin (2011) offered a computational method to identify the lysine ubiquitination sites based on the ensemble approach. Chen et al. (2011) proposed a predictor called CKSAAP_UbSite using the composition of k-spaced amino acid pairs.
Although each of the aforementioned methods has its own merit and did play a role in stimulating the development of this area, they all need improvement from one or more of the following aspects. (i) The benchmark data-set used by the previous investigators needs to be either updated by incorporating some new and experiment-confirmed data, or refined by removing redundancy and duplicate sequences. (ii) The prediction quality can be further improved by introducing the state-of-the-art machine learning techniques. (iii) All the samples should be formulated purely based on the sequence information alone to lift the limitation imposed by some of the existing methods that require the availability of the structural information as well. (iv) A user-friendly web-server should be established to maximize the convenience of the most experimental scientists and to complement with those existing methods which do not have a webserver or whose web-servers are currently not working.
The present study was initiated in an attempt to develop a new predictor for identifying the ubiquitination sites in proteins from the aforementioned four aspects.
As elaborated in (Chou, 2011) and carried out in a series of recent publications (Chen, Feng, & Deng, 2014;Ding, Deng, Yuan, Liu, & Lin, 2014;Fan, Xiao, & Min, 2014;Guo, Deng, Xu, Ding, & Lin, 2014;Liu, Xu, Lan, Xu, & Zhou, 2014;Liu, Zhang et al. 2014;Qiu, Xiao, & Chou, 2014;Xu, Wen, Shao, & Deng, 2014;Xu, Wen, Wen, & Wu, 2014), to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark data-set to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly capture their essence and intrinsic correlation with the attribute to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform crossvalidation tests to objectively evaluate the anticipated accuracy; (v) establish a user-friendly web-server that is accessible to the public. Below, we are to describe how to deal with these steps one-by-one.

Benchmark data-set
To develop a statistical predictor, it is fundamentally important to establish a reliable and stringent benchmark data-set to train and test the predictor. If the benchmark data-set contains some errors, the predictor trained by it must be unreliable and the accuracy tested by it would be completely meaningless. In this study, the benchmark data-set was derived from the Swiss-Prot database (version 2013_06) at http://www.ebi.ac.uk/uniprot. Only those proteins were collected that had clear experimental annotations about their ubiquitination sites. For facilitating the description later, we adopted the peptide formulation by Chou in studying HIV protease cleavage sites (Chou, 1993), specificity of GalNAc-transferase (Chou, 1995b), and signal peptide cleavage sites (Chou, 2001b). According to Chou's scheme, a potential ubiquitination peptide, that is a peptide with Lys (namely K) located at its center (Figure 2), can be expressed as where the subscript n is an integer (cf. Equation (1)), R Àn represents the n-th upstream amino acid residue from the center, R n the n-th downstream amino acid residue, and so forth ( Figure 2). The ð2n þ 1Þ-tuple peptides P n ðKÞ can be further classified into the following categories: where 2 represents "a member of" in the set theory. As pointed out by a comprehensive review (Chou & Shen, 2007), there is no need to separate a benchmark data-set into a training data-set and a testing data-set for examining the performance of a prediction method if it is tested by the jackknife test or subsampling (K-fold) cross-validation test. Thus, the benchmark data-set for the current study can be formulated as where S þ n only contains the samples of P þ n ðKÞ, that is the ubiquitination peptides; S À n only contains the samples of P À n ðKÞ, that is the non-ubiquitination peptide (cf. Equation (2)); and [ represents the symbol for "union" in the set theory.
Since the length of the peptide P n ðKÞ is 2n þ 1 (Equation (1)), the benchmark data-set with different values of n will contain peptides of different numbers of amino acid residues, as formulated by S n contains the peptides of 9 residues, when n ¼ 4 11 residues, when n ¼ 5 13 residues, when n ¼ 6 15 residues, when n ¼ 7 . . . . . .
The detailed procedures to construct S n are as follows.
(i) Sliding a window of ð2n þ 1Þ amino acids ( Figure 3) along each of the protein sequences taken from a protein database collected were only those peptide segments with K (lysine) at the center, that is the potential ubiquitination-site-containing peptides. (ii) If the upstream or downstream in a protein was less than n, the lacking residue was filled with the same residue of its closest neighbor. (iii) The peptide samples thus obtained were put into the positive subset S þ n if their centers have been experimentally annotated as the ubiquitination sites, otherwise into the negative subset S À n . (iv) The peptide samples thus obtained were further subject to a screening procedure to winnow those that were identical to any other. (v) Excluded from the benchmark data-set were also those that were self-conflict, namely occurring in both ubiquitination subset S þ n and non-ubiquitination subset S À n . By following the aforementioned procedures, thirteen such benchmark datasets S n (n ¼ 3; 4; 5; Á Á Á ; 15) had been constructed. However, it was observed via preliminary trials (Table 1) that when n ¼ 6, that is the peptide samples concerned were formed by 13 residues, the corresponding results were most promising. Accordingly, we choose S n¼6 as the benchmark data-set for further investigation. Such a choice is also quite consistent with the treatment of the previous investigators (Aguilar & Wendland, 2003;Cai et al., 2012;Chen et al., 2011;Haglund & Dikic, 2005;Herrmann et al., 2007;Hershko & Ciechanover, 1998;Hicke & Dunn, 2003;Hoeller et al., 2006;Pickart, 2001;Schwartz & Ciechanover, 1999;Welchman et al., 2005;Zhao et al., 2011).
According to a comprehensive review (Chou, 2011), PseAAC can be generally formulated as where T is the transpose operator, while X an integer to reflect the vector's dimension. The value of X as well as the components w u (u ¼ 1, 2, Á Á Á , X) in Equation (5) will depend on how to extract the desired information from a protein/peptide sequence. Below, we are to describe how to extract the useful information from the benchmark datasets to define the peptide samples via Equation (5). Biology is a natural science with historic dimension. All biological species have developed beginning from a very limited number of ancestral species. It is true for protein/peptide sequence as well (Chou, 2004). Their evolution involves changes of single residues, insertions and deletions of several residues (Chou, 1995a), gene doubling, and gene fusion. To incorporate this kind of evolution information into Equation (5), we may consider the following.
According to (Schaffer et al., 2001), the sequence evolution information for a peptide with 2n þ 1 amino acid residues can be expressed by a ð2n þ 1Þ Â 20 matrix, as given by where m 0 i;j represents the original score of amino acid residue in the i-th ði ¼ 1; 2; Á Á Á ; 2n þ 1Þ sequential position of the peptide that is being changed to amino acid type j ðj ¼ 1; 2; Á Á Á ; 20Þ during the evolution process. Here, the numerical codes 1, 2, …, 20 are used to denote the 20 native amino acid types according to the alphabetical order of their single character codes . The ð2n þ 1Þ Â 20 scores in Equation (6) were generated using PSI-BLAST (Schaffer et al., 2001) to search the UniProtKB/ Swiss-Prot database (Release 2011_05) through three iterations with .001 as the E-value cutoff for multiple sequence alignment against the sequence of the peptide P. In order to make every element in the matrix of Equation (6) within the range of 0-1, a conversion was performed through the standard sigmoid function to make it become (1 6 i 6 2n þ 1, 1 6 j 6 20Þ (8) Now, we can use the gray model approach to extract more useful information from Equation (7) to define the components of Equation (5). According to the gray system theory (Deng, 1989), if the information of a system investigated is fully known, it is called a "white system"; if completely unknown, a "black system"; if partially known, a "gray system". The model developed on the basis of such a theory is called "gray model", which is a kind of nonlinear and dynamic model formulated by a differential equation. The gray model is particularly useful for solving complicated problems that are lack of sufficient information, or need to process uncertain information and to reduce random effects of acquired data. Following the same approach as done by Lin, Fang, and Xiao (2013), we can set the dimension of Equation (5) is X ¼ 3 Â 20 ¼ 60 and each of its 60 components is defined by In the above equation As explained in Section 2.1, the value for parameter n in Equations (6)-(8) and (11)-(12) was set at 6. In other words, each of peptide samples considered in this study contains 13 amino residues with K at its center (see also Equation (1) and Supporting Information S1). Thus, Equation (3) can be reduced to where S ¼ S 6 contained 7855 peptides, of which 659 samples were of ubiquitination belonging to the positive data-set S þ ¼ S þ 6 , while 7196 of non-ubiquitination belonging to the negative data-set S À ¼ S À 6 . The detailed sequences of the 659 þ 7196 ¼ 7855 peptide samples and their positions in proteins are given in Supporting Information S1. Now, each of these peptide sequences in the benchmark data-set S can be formulated by a 60-D vector via Equation (9), which is obtained by incorporating sequence evolutionary information into the general PseAAC (Equation (5)) using the gray system model (Equations (10)-(12)).

SVM operation engine
We used the SVM (Hearst, 1998;Vapnik, 1999) as the operation engine to perform predictions. SVM is a powerful and popular method for pattern recognition that has been widely used in computational biology as reflected by a series of publications (see, e.g. (Cai, Zhou, & Chou, 2003;Fan et al., 2014;Guo et al., 2014;Han et al., 2014;Liu, Wang, Zou, Dong, & Chen, 2013;Liu, Zhang et al., 2014;Mohabatkar et al., 2011;Qiu et al., 2014;Wan, Mak, & Kung, 2013;Xie, Fu, & Nie, 2013). The basic idea of SVM is to transform the data into a high dimensional feature space and then determine the optimal separating hyperplane using a kernel function. To handle a multi-class problem, "one-versus-one (OVO)" and "one-versus-rest (OVR)" are generally applied to extend the traditional SVM. For a brief formulation of SVM and how it works, see the papers . For more details about SVM, see a monograph (Cristianini & Shawe-Taylor, 2000).
The SVM software was downloaded from the LIB-SVM package (Chang & Lin, 2011), which provided a simple interface. Due to its advantages, the users can easily perform classification prediction by properly selecting the built-in parameters c and c. In order to maximize the performance of the SVM algorithm, the two parameters in the RBF kernel were preliminarily optimized through a grid search strategy. To obtain the optimized parameters, the search function SVMcgFor-Class was used that was downloaded from http://www. matlabsky.com.
The predictor obtained via the aforementioned procedures is called iUbiq-Lys. To provide an intuitive picture, a flowchart is provided in Figure 4 to illustrate the prediction process of iUbiq-Lys.
How to properly and objectively evaluate the anticipated accuracy of a new predictor and how to make it easily accessible and user-friendly are the two key issues that will have important impacts on its application value. Below, we are to address these problems.

Metrics for scoring prediction quality
In the literature, the following four metrics are often used to score the quality of a predictor at four different angles where TP represents the number of the true positive; TN, the number of the true negative; FP, the number of the false positive; FN, the number of the false negative; Sn, the sensitivity; Sp, the specificity; Acc, the accuracy; MCC, the Mathew's correlation coefficient. To most biologists, unfortunately, the four metrics as formulated in Equation (14) are not quite intuitive and easy to understand, particularly the equation for MCC. Therefore, we prefer to use the formulation proposed recently in Xu, Ding, & Wu, 2013;Xu, Shao, & Wu, 2013) based on the symbols introduced by Chou (Chou, 2001b) in predicting signal peptides. According to the formulation, the same four metrics can be expressed as (15) where N þ is the total number of the ubiquitination peptides investigated, while N þ À is the number of the ubiquitination peptides incorrectly predicted as the nonubiquitination peptides; N À is the total number of the non-ubiquitination investigated, while N À þ is the number of the non-ubiquitination incorrectly predicted as the ubiquitination peptides (Chou, 2001b). Now, it is crystal clear from Equation (15) that when N þ À ¼ 0 meaning none of the ubiquitination peptides was incorrectly predicted to be a non-ubiquitination peptide, we have the sensitivity Sn ¼ 1. When N þ À ¼ N þ meaning that all the ubiquitination peptides were incorrectly predicted as the non-ubiquitination peptides, we have the sensitivity Sn ¼ 0. Likewise, when N À þ ¼ 0 meaning none of the non-ubiquitination peptides was incorrectly predicted to be the ubiquitination peptide, we have the specificity Sp ¼ 1; whereas N À þ ¼ N À meaning all the non-ubiquitination peptides were incorrectly predicted as the ubiquitination peptides, we have the specificity Sp ¼ 0. When N þ À ¼ N À þ ¼ 0 meaning that none of the ubiquitination peptides in the positive data-set S þ and none of the non-ubiquitination peptides in the negative data-set S À was incorrectly predicted, we have the overall accuracy Acc ¼ 1 and MCC ¼ 1; when N þ À ¼ N þ and N À þ ¼ N À meaning that all the ubiquitination peptides in the positive data-set S þ and all the nonubiquitination peptides in the negative data-set S À were incorrectly predicted, we have the overall accuracy Acc ¼ 0 and MCC ¼ À1; whereas when N þ À ¼ N þ =2 and N À þ ¼ N À =2 we have Acc ¼ :5 and MCC ¼ 0 meaning no better than random prediction. As we can see from the above discussion based on Equation (15), the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient have become much more intuitive and easier to understand.
It is instructive to point out; however, the set of metrics in Equations (14) and (15) is valid only for the single-label systems. For the multi-label systems, such as those for the subcellular localization of multiplex proteins (see, e.g. Wu et al., 2011;Xiao et al., 2011)) where a protein may have two or more locations, and those for the functional types of antimicrobial peptides (see, e.g. (Xiao, Wang, Lin & Jia, 2013)) where a peptide may possess two or more functional types, a completely different set of metrics is needed as elaborated in (Chou, 2013).

Jackknife cross-validation
With a set of clear and valid metrics as defined in Equation (15), to measure the quality of a predictor, the next thing we need to consider is how to objectively derive the values of these metrics for a predictor.
In statistical prediction, the following three crossvalidation methods are often used to calculate the metrics of Equation (15) for evaluating the quality of a predictor: independent data-set test, subsampling test, and jackknife test (Chou, 2011). However, of the three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset. The reasons are as follows. (i) For the independent data-set test, although all the samples used to test the predictor are outside the training data-set used to train it so as to exclude the "memory" effect or bias, the way of how to select the independent samples to test the predictor could be quite arbitrary unless the number of independent samples is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing data-set might fail to keep so when tested by another independent testing data-set. (ii) For the subsampling test, the concrete procedure usually used in the literatures is the fivefold, sevenfold or 10-fold crossvalidation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark data-set is an astronomical figure even for a very simple data-set, as demonstrated by Equations (28)-(30) in (Chou, 2011). Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark data-set and a same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield a unique outcome cannot be deemed as a good one. (iii) In the jackknife test, all the samples in the benchmark data-set will be singled out one-by-one and tested by the predictor trained by the remaining samples. During the process of jackknifing, both the training data-set and testing data-set are actually open, and each sample will be in turn moved between the two. The jackknife test can exclude the "memory" effect. Also, the arbitrariness problem as mentioned above for the independent data-set test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark data-set. Accordingly, the jackknife test has been increasingly used and widely recognized by investigators to examine the quality of various predictors (see, e.g. (Du et al., 2006;Hajisharifi et al., 2014;Kandaswamy et al., 2011;Mei, 2012b;Mohabatkar et al., 2011;Mohabatkar et al., 2013;Niu, Hu, Zheng, Huang, & Feng, 2012;Sahu & Panda, 2010;Shen, 2010)). Accordingly, in this study we also used the jackknife cross-validation method to calculate the metrics in Equation (15) although it would take more computational time.

Results and discussion
As shown in the Supporting Information S1, the number of samples in the negative data-set S À is almost 11 times the size of the positive data-set S þ (cf. Equation (13)). To make the tests under a balance situation, we divided the negative data-set S À into 11 subsets; that is where the subset S À ðiÞ ði ¼ 1; 2, Á Á Á , or 10) contained 659 non-ubiquitination peptides, while S À ð11Þ contained 7196 À 659 Â 10 ¼ 606 non-ubiquitination peptides. The detailed sequence data for the 11 negative subsets are given in the Supporting Information S2. Listed in Table 2 are the jackknife test results in identifying the ubiquitination sites by iUbiq-Lys, respectively, on the benchmark datasets SðiÞ ¼ S þ [ S À ðiÞði ¼ 1; 2; Á Á Á ; 11Þ and S ¼ S þ [ S À . As we can see from the table, the average results by the current iUbiq-Lys predictor on the 11 benchmark datasets is 90.06% for Acc, .82 for MCC, 80.99% for Sn, and 99.06% for Sp.
Compared with SðiÞ, although the results obtained from the benchmark data-set S are basically about the same except for MCC, which is dropped down from .82 to .50. This is because the positive data-set and negative data-set are extremely uneven in size; the latter is 11 times the size of the former. Even though, however, it is still higher than or comparable to the MCC values reported by the existing methods, as shown in Table 3. Particularly, as we can see from the table, most of the existing methods did not provide a web-server, and hence, their practical usage is quite limited. To overcome such a shortcoming, we have established a web-server for the current predictor with the website at http://www. jci-bioinfo.cn/iUbiq-Lys. Moreover, for the convenience of the vast majority of experimental scientists, below we are to give a step-to-step guide for how to use the web-server to get the desired results.

Web-server and user guide
Step 1. Open the web-server at http://www.jci-bioinfo.cn/ iUbiq-Lys and you will see the top page of the predictor on your computer screen, as shown in Figure 5. Click on the Read Me button to see a brief introduction about iUbiq-Lys predictor and the caveat when using it Step 2. Either type or copy/paste the sequences of query proteins into the input box shown at the center of Figure 5. The input sequence should be in the FASTA format. All the input sequences should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with the symbol ">" in the first column, followed by lines of sequence data in which amino acids are represented using single-letter codes. Except for the mandatory symbol ">", all the other characters in  The positive data-set S þ contains 659 peptides with an ubiquitination site at each of their centers (cf. Equations (1)-(2)), and the negative data-set S À contains 7196 peptides with a non-ubiquitination site at each of their centers (see Supporting Information S1). Since the latter is about 11 times the size of the former, and hence is randomly divided into 11 subsets (Equation (16)  the single initial line are optional and only used for the purpose of identification and description. The sequence ends if another line starting with the symbol ">" appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button. Note that if your input protein sequences contain any symbol other than the 20 native amino acid codes (ACDEFGHIKLMNPQRSTVWY), an error message will be prompted on your screen.
Step 3. Click on the Submit button to see the predicted result. For example, if you use the two query protein sequences in the Example window as the input, after clicking the Submit button, you will see the following on your screen. (i) The 1st protein contains 6 K residues; of which only the one located at the sequence position 254 is of ubiquitination site, while all the others are of non-ubiquitination site. (ii) The 2nd protein contains 23 K residues; of which only the two located at the sequence positions 485 and 490 belong to the ubiquitination site, while all the others belong to non-ubiquitination site. All these results are fully consistent with experimental observations.
Step 4. As shown on the lower panel of Figure 5, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format) via the "Browse" button. To see the sample of batch input file, click on the button Batchexample. The maximum number of the query proteins for each batch input file is 50. After clicking the button Batch-submit, you will see "Your batch job is under computation; once the results are available, you will be notified by e-mail." Step 5. Click on the Supporting Information button to download the benchmark data-set used to train and test the iUbiq-Lys predictor ( Figure 5).

Conclusion
As one of the most important PTMs, lysine ubiquitination plays an important role in regulating a variety of biological processes.
A new predictor was developed for identifying the lysine ubiquitination sites in proteins based on a set of 13-tuple peptides generated as follows. Sliding a window of 13 amino acids along each of the protein sequences taken from a protein database, collected were only those peptide segments with K (lysine) at the center, that is the potential ubiquitination-site-containing peptides.
The new predictor is called iUbiq-Lys, in which each of the potential ubiquitination-site-containing peptides was formulated with a 60-D vector formed by incorporating the sequence evolutionary information into the general form (Chou, 2011) of pseudo-amino acid composition (Chou, 2001a(Chou, , 2005 or Chou's PseAAC (Cao et al., 2013;Du et al., 2014;Lin & Lapointe, 2013) via a gray system model.
It was observed from the rigorous cross-validations that the iUbiq-Lys outperformed any of the existing Figure 5. A semi-screenshot to show the web-server iUbiq-Lys at http://www.jci-bioinfo.cn/iUbiq-Lys.
predictors in this area, as reflected by remarkably higher scores in a set of four metrics generally used to measure the accuracy of a predictor.
Furthermore, for the convenience of most experimental scientists, the web-server of iUbiq-Lys has been established at http://www.jci-bioinfo.cn/iUbiq-Lys, with which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of the predictor.
It has not escaped our notice that the current approach can also be used to develop various methods for identifying the sites of other PTMs in proteins.

Supplementary material
The supplementary material for this paper is available online at http://dx.doi.10.1080/07391102.2014.968875.