OH-PRED: prediction of protein hydroxylation sites by incorporating adapted normal distribution bi-profile Bayes feature extraction and physicochemical properties of amino acids

Hydroxylation of proline or lysine residues in proteins is a common post-translational modification event, and such modifications are found in many physiological and pathological processes. Nonetheless, the exact molecular mechanism of hydroxylation remains under investigation. Because experimental identification of hydroxylation is time-consuming and expensive, bioinformatics tools with high accuracy represent desirable alternatives for large-scale rapid identification of protein hydroxylation sites. In view of this, we developed a supporter vector machine-based tool, OH-PRED, for the prediction of protein hydroxylation sites using the adapted normal distribution bi-profile Bayes feature extraction in combination with the physicochemical property indexes of the amino acids. In a jackknife cross validation, OH-PRED yields an accuracy of 91.88% and a Matthew’s correlation coefficient (MCC) of 0.838 for the prediction of hydroxyproline sites, and yields an accuracy of 97.42% and a MCC of 0.949 for the prediction of hydroxylysine sites. These results demonstrate that OH-PRED increased significantly the prediction accuracy of hydroxyproline and hydroxylysine sites by 7.37 and 14.09%, respectively, when compared with the latest predictor PredHydroxy. In independent tests, OH-PRED also outperforms previously published methods.


Introduction
Hydroxylation of proline and lysine residues in proteins is one of the most abundant protein post-translational modification processes that is catalyzed by three enzymes; prolyl 4-hydrolase, prolyl 3-hydrolase and lysyl hydrolase (Mitsuo & Marnisa, 2012). In recent years, protein hydroxylation has been revealed to play vital physiological roles and its dysfunction leads to many diseases such as metabolic disorder, connective tissue disorder and cancer (Gorres & Raines, 2010;Richards et al., 2006;Xie et al., 2007). Because hydroxylation is a subtle post-translational modification, adding merely 16 atomic mass units to proteins, experimental identification and characterization of protein hydroxylation sites is often time-consuming and expensive (Cockman et al., 2009;Richards et al., 2006;Webby et al., 2009). Hence, accurate computational prediction of protein hydroxylation sites represents a valuable and efficient approach to identify novel potential hydroxylation sites.
Several bioinformatics tools have been developed to predict protein hydroxylation sites. In 2009, Yang (2009) developed the first tool using a bio-kernel support vector machine (SVM) based on a limited data-set of 37 sequences. In 2010, Hu et al. (2010) developed the second tool using a nearest neighbour algorithm and the impact of physicochemical properties, biochemical properties and evolution information of amino acids on the performance were also considered. Xu et al.(2014) developed iHyd-PseAAC using the dipeptide position-specific propensity into the general form of pseudo-amino acid composition. More recently, Shi et al. (2015) built Pre-dHydroxy by also using a SVM-based approach on the position-weighted amino acids composition. The PredHydroxy tool was found to be the best predictor by reaching an area under the receiver operating characteristic curve (AUC) of 0.827, and a Matthew's correlation coefficient (MCC) of 0.690 for hydroxyproline and an AUC of 0.874 and a MCC of 0.667 for hydroxylysine in a jackknife cross-validation assessment.
Here, we developed a novel bioinformatics tool, called OH-PRED, to predict hydroxyproline and hydroxylysine sites separately by the adapted normal distribution bi-profile Bayes (ANBPB) feature extraction in combination with the physicochemical property indexes of the amino acids (AAPPI). Based on the results obtained by both jackknife and independent tests, OH-PRED significantly outperforms iHyd-PseAAC and PredHydroxy, and should be useful for the identification of protein hydroxylation sites.

Data-sets
As a comprehensive and unbiased comparison with existing methods, the training data-sets recently constructed in Shi et al. (2015) were used. The protein sequences containing experimentally verified protein hydroxylation sites were collected from the UniProtKB/Swiss-Prot database (version 2014_1, www.uniprot.org). Total 265 candidate proteins containing hydroxylated prolines and 34 candidate proteins containing hydroxylated lysines were collected, respectively. Homology reductions within the benchmark data-sets were performed with similarity threshold 70% between any two protein sequences. Then sequence segments around the hydroxylation sites and non-hydroxylation sites were extracted as positive and negative training data-sets, respectively. After removing the identical sequence, the original data-sets contain 659 positive sites and 3855 negative sites for hydroxyproline from 112 proteins, and 97 positive sites and 855 negative sites for hydroxylysine from 25 proteins. The size of the negative data-sets is much larger (approximate ratio of 1:6) than that of the positive training data-sets, which will result in a bias prediction in favour of negative data. Many previous approaches have been exploited to solve imbalanced machine learning issues, over-sampling, under-sampling and the voting method used (Chawla et al., 2011;Chou & Shen, 2006;Laurikkala, 2001;Zhang, 1992). We describe the set-up of the negative training data-sets below.
Considering a typical protein hydroxylation problem: while the number of possible hydroxylation sites grows quadratically with the number of proteins, the number of positive hydroxylation sites grows typically only linearly (i.e. small fixed number of hydroxylation sites in one protein). So we can select those peptides no definitive hydroxylation information is available. But it is not possible to verify each possible site by experimental method. It has been universally acknowledged that the similarity of protein sequences to determine the function of proteins. Hence, we selected the least similarity peptide with the known hydroxylated peptide in one protein to construct the negative data-set. Firstly, we computed the similarity of one given hydroxylation peptide with other non-hydroxylation peptides within a protein. The BLOSUM62 scoring matrix was used to compute the similarity of protein peptides, and the peptide segment with the lowest score and the lowest three scores were chosen to construct the negative data-set, respectively. Finally, ratios of 1:1 and 1:3 of the number of positive samples and the number of negative samples were used to construct the negative training set, respectively. To save running time, the training data-set with 1:1 ratio was adopted to choose the optimal features. Meanwhile, it is applied to compare with PredHydroxy for in which the same number of positive and negative samples. The ratio 1:3 was adopted to construct the optimal predictive model to reduce the false positive rate (Shao et al., 2009).
After several trials (results listed in Supplementary Table S1), the positive and negative peptides were formatted as 15-mer sequence peptides centred by hydroxylated proline and lysine residues.

Bi-profile Bayes profile (BPB)
Given a peptide sequence S, we encoded this sequence into a probability vector P = (p 1 , p 2 , … , p n , p n+1 , … , p 2n ), where p i (i = 1, 2, … , n) denotes the posterior probability of each amino acid at i-th position in the positive samples and p i (i = n + 1, n + 2, … , 2n) denotes the posterior probability of each amino acid at the i-th position in the negative samples. The posterior probability of both positive and negative samples was calculated as the occurrence of each amino acid at each position in the training data-sets (Shao et al., 2009).

Adapted normal distribution bi-profile Bayes
ANBPB is a modified version of classical BPB. In this approach, the frequency of each amino acid at each position was encoded as random variables X ij , where i (i = 1, 2, … , 20) represents the ith amino acid fA; C; D; E; F; G; H; I; K; L; M; N; P; Q; R; S; T; V; W; Yg, and j = 1, 2, … , 15 represents the jth position. The random variables X ij , (i = 1, 2, … , 20; j = 1, 2, … , 15) are independent and obey the same binomial distribution b(n, p), where n = 659 is the number of peptide sequences in the positive/negative set, and p = 1/20 is the probability of each amino acid occurring in each position. According to the de Moivre-Laplace theorem, the normal form variable has a limiting cumulative distribution function that approximates a normal distribution N(0,1). Here, we modified the standard variable normalization to highlight and emphasize the distinct distribution of each amino acid at one position. We let V j denote the standard variance of X ij (i = 1, 2, … , 20), i.e. the deviation of frequencies of each amino acid at the same jth position. Then we define X 0 ij ¼ normalization of X ij and deem it obeys the standard normal distribution. Thus, the posterior probability p j (j = 1, 2, … , 2n) was coded by the adapted normal distribution as follows: For more details about the ANBPB method, please refer to the original paper (Jia et al., 2013).

Physicochemical property indexes of the amino acids
Thirteen physicochemical features selected from the amino acid index (AAindex, http://www.genome.ad.jp/ aaindex/) database (Kawashima & Kanehisa, 2000;Kawashima et al., 2008) were used to encode each amino acid residue in a data instance. Detail information of the properties, corresponding accession numbers and the abbreviations are listed in Supplementary Table S2. The values of each amino acid for each physicochemical property are listed in Supplementary Table S3.

SVM implementation and performance evaluation
The SVM classification method has proven to be powerful in many fields of bioinformatics (Folkman et al., in press;Jia et al., 2013;Lin et al., 2014;Liu et al., 2014;Qiu et al., 2015;Shao et al., 2009;Shi et al., 2015;Xu et al., 2015). In this work, the SVM was trained with the LIBSVM package (version 3.0) (Chang & Lin, 2011) to build the model and perform the predictions. The radial basis kernel function kðx i ; x j Þ ¼ expfÀcjjx i À x j jj 2 g was selected and the parameters (c ¼ 4; c ¼ 0:25 for the hydroxyproline prediction and c ¼ 4; c ¼ 0:125 for the hydroxylysine prediction) optimized by the SVMcgFor-Class program were downloaded from http://www.matlab sky.com.
The jackknife test is deemed as the least arbitrary test that can always yield a unique outcome for a given benchmark data-set (Chou & Shen, 2013). Thus, we used the jackknife test to select important features and optimize all parameters. In comparison with other methods, both the jackknife test and independent data-set test were used.
We also assessed the overall prediction performance in terms of the receiver operating characteristic (ROC) curves. An ROC curve plots the true positive rate (sensitivity) as a function of the false positive rate (1-specificity) at different prediction thresholds. Furthermore, we calculated sensitivity (Sn), specificity (Sp), accuracy (Acc) and MCC, which were defined as follows: 3. Results and discussion 3.1. Prediction of protein hydroxylation sites using only BPB and ANBPB BPB was first proposed by Shao et al. for predicting protein methylation sites (Shao et al., 2009). One advantage of this method is that the feature vectors are encoded in a bi-profile manner, which contains information from positive and negative samples. ANBPB is a modified version of the classic BPB, which is more powerful than BPB for predicting protein O-GlcNAcylation sites (Jia et al., 2013) and protein S-nitrosylation sites (Jia et al., 2014). In this study, the ability of BPB and ANBPB to discriminate between protein hydroxylation sites and non-hydroxylation sites was first compared by the jackknife test (Table 1) These results demonstrated that the ANBPB model performs better than the BPB model in both hydroxyproline and hydroxylysine predictions. Therefore, further optimization of the predictive model was based on the ANBPB feature extraction.

Improving predictive performance by incorporating AAPPI
Because the AAPPIs surrounding the candidate hydroxylation sites obviously affect the recognition and catalytic efficiency of protein hydroxylases, incorporating physicochemical information with ANBPB might improve the accuracy of the prediction model. Thirteen representative physicochemical property indexes were selected from the AAindex database (Kawashima & Kanehisa, 2000;Kawashima et al., 2008): refractivity (AA1), flexibility (AA2), volume (AA3), transfer free energy to surface (AA4), electron-ion interaction potential values (AA5), hydrophility (AA6), polarity (AA7), hydrophobicity (AA8), isoelectric point (AA9), the optimized transfer energy parameter (AA10), the optimized side chain interaction parameter (AA11), residue volume (AA12) and the normalized van der Waals volume (AA13) (Supplementary Table S2). Initially, we evaluated the predictive performances of ANBPB combined with one of the 13 physicochemical properties on jackknife cross validation (Supplementary Table S4 for hydroxyproline prediction and Supplementary Table S5 for hydroxylysine prediction). The combinations with improved prediction accuracy are listed in Table 2. For the hydroxyproline prediction, the ANBPB + AA1 model achieved the best prediction accuracy, followed by ANBPB + AA5 and ANBPB + AA11 (Table 2). For the hydroxylysine prediction, the ANBPB + AA12 model achieved the best prediction Acc of 97.42%, followed by the second-best prediction Acc of 96.91% achieved by ANBPB + AA3, ANBPB + AA6 and ANBPB + AA7 (Table 2). These results demonstrated that the performance of the prediction model can be increased by combining ANBPB and AAPPI. Moreover, different AAPPIs were found to contribute positively to the hydroxyproline and hydroxylysine predictions. Refractivity, electron-ion interaction potential values and the optimized side chain interaction parameter specifically increased the prediction performance for hydroxyproline, whereas volume, hydrophility and polarity specifically increased the prediction performance for hydroxylysine. Because proline and lysine hydroxylation are catalyzed by prolyl hydrolase and lysyl hydrolase (Gorres & Raines, 2010;Xie et al., 2007), respectively, these results provide additional information that describes the substrate specificity of these enzymes.
We then evaluated the predictive performances of ANBPB combined with two or three improved-performing AAPPIs, and the results of these combinations are shown in Supplementary Tables S4 and S5. The combination results that improved the accuracy are also listed in Table 2. For the hydroxyproline prediction, the bestperforming combination of ANBPB + AA1 + AA5 + AA11 model reached a Sn of 91.20%, a Sp of 92.57%, an Acc of 91.88% and a MCC of 0.838. However, no improvements of the prediction accuracy by combining ANBPB with two improved-performing AAPPIs for the hydroxylysine prediction were observed.
Then the contributions of these physicochemical properties were quantified using the average value of F-score measurement (Ward-Powers, 2011). The high F-score values mean there are significant differences between hydroxylated and non-hydroxylated sites. Supplementary Figure S1 shows the average value of F-score measurement on AA1, AA5 and AA11 for hydroxyproline. These results demonstrated there are significant differences at positions 3, 6, 8, 11, 14 for AA1, at positions 3, 6, 10, 13 for AA5, and at positions 2, 8, 10, 13 for AA11. Supplementary Figure S2 shows the average value of F-score measurement on AA12 for hydroxylysine, which reveals that there are significant difference at positions 3, 6, 8, 11 and 14.
To this end, the SVM-based predictor, OH-PRED, was built using the ANBPB + AA1 + AA5 + AA11 feature extraction method based on the SVM classier with the RBF kernel function (cost parameters c = 4, γ = 0.25) to predict hydroxyproline sites and the ANBPB + AA12 feature extraction method based on the SVM classier with the RBF kernel function (cost parameters c = 4, γ = 0.125) to predict the hydroxylysine sites.
To further evaluate the predictive performance of OH-PRED, the ROC curves were plotted and the area under the ROC curve (AUC) was calculated. For hydroxyproline, the ANBPB model reached an AUC of 0.957 and the ANBPB + AA1 + AA11 + AA5 model reached an AUC of 0.973 (Figure 1). For hydroxylysine, the ANBPB model reached an AUC of 0.991 and the ANBPB + AA12 model reached an AUC of 0.996 (Figure 2). The figure indicates that a multi-feature model is more efficient than a single-feature classification.

Predictive performance of our model
In order to avoid overestimation of the models, we further conducted the jackknife fourfold, sixfold and eightfold cross-validations on the imbalanced benchmark data-set. The results are listed in Supplementary  Table S6. These results achieved by different cross-validations were approximate, and further demonstrated the robust and reliable of the predictor OH-PRED.
We compared the predictive performance of OH-PRED with two available methods, iHyd-PseAAC  and PredHydroxy (Shi et al., 2015). OH-PRED and PredHydroxy were first compared using the jackknife test with an identical training dataset (Table 3). For the hydroxyproline prediction, OH-PRED reached an ACC of 91.88% and a MCC of 0.838, which were 7.37% and 0.148 higher than that of PredHydroxy, respectively. For the hydroxylysine prediction, OH-PRED reached an accuracy of 97.42% and a MCC of 0.949, which were 10.09% and 0.282 higher than that of PredHydroxy, respectively.
We also tested OH-PRED and iHyd-PseAAC on the same imbalance data-sets given as Supplementary Information S3 and S4 in Xu et al. (2014). In order to reduce the data imbalance effect on prediction performance, we set weight parameter w1 = 3 for hydroxyproline and w1 = 1.8 for hydroxylysine, respectively. OH-PRED outperformed iHyd-PseAAC with Acc improvements of 2.51 and 12.65% for hydroxyproline and hydroxylysine, respectively (Supplementary Table S7).
To further evaluate our prediction model, we randomly split 10% of the samples from the data-set as an independent test data-set, and the remaining 90% of the samples as a training data-set and then evaluated the performance of the prediction. This approach was repeated three times and the average predictive performance is listed in Table 4. For the hydroxyproline prediction, OH-PRED outperformed PredHydroxy and iHyd-PseAAC by an accuracy improvement of 1.70 and 12.81%, respectively. For the hydroxylysine prediction, OH-PRED outperformed PredHydroxy and iHyd-PseAAC by an accuracy improvement of 3.24 and 7.41%, respectively.
For an independent test, the data-set contains 38 positive sites for hydroxyproline from 20 proteins, and 34 positive sites for hydroxylysine from 6 proteins (Supplementary Materials S1 and S2). The performances of iHyd-PseAAC, PredHydroxy and OH-PRED against this data-set are summarized in Table 5. OH-PRED was revealed to be the best predictor for hydroxyproline prediction with a Sn of 84.21% and a Sp of 91.42%. As for the hydroxylysine prediction, the OH-PRED could achieve the highest Sn among three models but the lowest Sp of 54.94%. These results demonstrated OH-PRED is a powerful tool for predicting protein hydroxyproline  sites but is still waiting to be improved for protein hydroxylysine prediction, expecially for the specificity. We think the increments of the limited training data will be very helpful.

Conclusions
OH-PRED was developed in this report to predict protein hydroxylation sites using the ANBPB feature extraction and AAPPI, which outperforms previous methods based on the results obtained by both jackknife and independent tests. OH-PRED should be a powerful tool for in silico identification of protein hydroxylation sites and help to reveal their exact molecular mechanisms in physiological and pathological processes. The MATLAB package of OH-PRED is available as Supplementary files.

Supplementary material
The supplementary material for this article is available online at http://dx.doi.org/10.1080/07391102.2016. 1163294

Disclosure statement
No potential conflict of interest was reported by the authors.   It should be noted the 20 hydroxyproline proteins and 6 hydroxylysine proteins may be included in the training data-set of iHyd-PseAAC.