10.1371/journal.pone.0181347 Lei Jia Lei Jia Yaxiong Sun Yaxiong Sun Protein asparagine deamidation prediction based on structures with machine learning methods Public Library of Science 2017 3 D structure-based properties data sets 194 Asn residues forest model causes over-engineering drug discovery process Protein asparagine deamidation prediction prediction tools hotspot residues test data deamidation case deamidation half-life NG motif Sequence-based prediction method sequence-based method methods Chemical stability protein therapeutics structure-based prediction models prediction method acid residues acid asparagine dihedral angles prediction models deamidated residues 25 proteins deamidated proteins chemical modifications crystal structures protein hotspot predictions train prediction models deamidation evaluation process non-deamidated residues 2017-07-21 17:37:44 Dataset https://plos.figshare.com/articles/dataset/Protein_asparagine_deamidation_prediction_based_on_structures_with_machine_learning_methods/5232619 <div><p>Chemical stability is a major concern in the development of protein therapeutics due to its impact on both efficacy and safety. Protein “hotspots” are amino acid residues that are subject to various chemical modifications, including deamidation, isomerization, glycosylation, oxidation etc. A more accurate prediction method for potential hotspot residues would allow their elimination or reduction as early as possible in the drug discovery process. In this work, we focus on prediction models for asparagine (Asn) deamidation. Sequence-based prediction method simply identifies the NG motif (amino acid asparagine followed by a glycine) to be liable to deamidation. It still dominates deamidation evaluation process in most pharmaceutical setup due to its convenience. However, the simple sequence-based method is less accurate and often causes over-engineering a protein. We introduce structure-based prediction models by mining available experimental and structural data of deamidated proteins. Our training set contains 194 Asn residues from 25 proteins that all have available high-resolution crystal structures. Experimentally measured deamidation half-life of Asn in penta-peptides as well as 3D structure-based properties, such as solvent exposure, crystallographic B-factors, local secondary structure and dihedral angles etc., were used to train prediction models with several machine learning algorithms. The prediction tools were cross-validated as well as tested with an external test data set. The random forest model had high enrichment in ranking deamidated residues higher than non-deamidated residues while effectively eliminated false positive predictions. It is possible that such quantitative protein structure–function relationship tools can also be applied to other protein hotspot predictions. In addition, we extensively discussed metrics being used to evaluate the performance of predicting unbalanced data sets such as the deamidation case.</p></div>