Statistical Modeling for Enhancing the Discovery Power of Citrullination from Tandem Mass Spectrometry Data
datasetposted on 16.09.2020 by Sunghyun Huh, Daehee Hwang, Min-Sik Kim
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
Citrullination is a post-translational modification implicated in various human diseases including rheumatoid arthritis, Alzheimer’s disease, multiple sclerosis, and cancers. Due to a relatively low concentration of citrullinated proteins in the total proteome, confident identification of citrullinated proteome is challenging in mass spectrometry (MS)-based proteomic analysis. From these MS-based analyses, MS features that characterize citrullination, such as immonium ions (IMs) and neutral losses (NLs), called diagnostic ions, have been reported. However, there has been a lack of systematic approaches to comprehensively search for diagnostic ions and no statistical methods for the identification of citrullinated proteome based on these diagnostic ions. Here, we present a systematic approach to identify diagnostic IMs, internal ions (INTs), and NLs for citrullination from tandem mass (MS/MS) spectra. Diagnostic INTs mainly consisted of internal fragment ions for di- and tripeptides that contained two and three amino acids with at least one citrullinated arginine, respectively. A statistical logistic regression model was built for a confident assessment of citrullinated peptides that database searches identified (true positives) and prediction of citrullinated peptides that database searches failed to identify (false negatives) using the diagnostic IMs, INTs, and NLs. Applications of our model to complex global proteome data sets demonstrated the increased accuracy in the identification of citrullinated peptides, thereby enhancing the size and functional interpretation of citrullinated proteomes.