Data from: Automatic 13C Chemical Shift Reference Correction for Unassigned Protein NMR Spectra
Xi Chena,b, Andrey Smelterf,g, and Hunter N.B. Moseleya,b,c,d,e,*
Affiliations
aDepartment of Molecular and Cellular Biochemistry
bDepartment of Statistics
cMarkey Cancer Center
dCenter for Environmental and Systems Biochemistry
eInstitute for Biomedical Informatics
University of Kentucky, Lexington KY40356, USA
fSchool of Interdisciplinary and Graduate Studies
gDepartment of Computer Engineering and Computer Science
University of Louisville, Louisville, KY 40292, USA
*Correspondence to: hunter.moseley@uky.edu
Introduction
This archive includes all the data generated from the manuscript Automatic 13C Chemical Shift Reference Correction for Unassigned Protein NMR Spectra.
To fully replicate the analysis, you can use either the data from The Re-referenced Protein Chemical shift Database1, RefDB official website, or the R data.frame data we included. We highly recommend using our provided file, which had already been stripped metadata off. The overall analysis was demonstrated in Figure 1
Figure 1. Flow diagram of the BaMORC (assigned and unassigned) method that we developed for Automatic 13C Chemical Shift Reference Correction for Unassigned Protein NMR Spectra. Unassigned BaMORC includes two algorithmic parts: grouping and correction (BaMORC). The grouping algorithm utilizes the density-based clustering algorithm DBSCAN to group peaks and report Ca/Cb spin systems as the input for the correction algorithm BaMORC. BaMORC uses the same pipeline mentioned above, which includes estimate amino acid composition, secondary structure prediction and optimization of the absolute difference between the estimated and actual amino acid composition, to report a reference correction value as the final output.
Prerequisites:
You will need a computer with minimal RAM of 8GB, higher capacity is recommended though. Additionally, please install following software and packages:
1. Software: R (required) and RStudio
|
R is a popular statistical analysis software, which our analysis based on. For installation, please refer to the R project official website. |
|
RStudio is an integrated development environment (IDE) for R. For installation, please refer to the RStudio official website. |
2. Libraries and packages for R:
cluster |
Used in the K-means classification. |
data.table |
Used for handling fast aggregation of large data (e.g. 100GB in RAM). |
seqinr |
A biological sequence data analysis package for R, used for only converting amino acid 3-letter code to 1-letter code, vis-versa. |
plyr |
A set of tools for splitting, applying and combining data. |
DEoptim |
Global Optimization by Differential Evolution. |
The command line interface (CLI) installation is:
$ R
>install.packages("cluster", "data.table", "seqinr", "plyr", "ggExtra", "ggplot2", "dplyr", "ggvis", "tidyverse", "ggpubr", "tidyr")
3. Instruction for running R
script (gif).
Warning: some script might take up to 5 days
to run depending the computation setup.
The best way to run the result code is to open the “BaMORCPaperAnalysis.Rproj” file in the RStudio IDE,
because it will automatically set up the correct directory path.
Data and Results
Links to Data/Results |
Corresponding
Figures / Tables |
Description |
All |
Included expected value (mean), standard deviation (sd): · RefDB.caStat.csv: calculated mean and sd
for Ca using all of the data from RefDB, ·
RefDB.cbStat.csv:
calculated mean and sd for Cb
using all of the data from RefDB. |
|
Figure 3 - 4 Supplemental Figure 1 |
All the RefDB carbon data, striped off metadata and sub-filed basing on the resonance types. |
|
Figure 5 |
Covariance values calculate from different
population: · CarbonCovTable_BC_old.csv: Covariance values calculated using all of the data
from RefDB, ·
CarbonCovTable_BC.csv: Covariance values calculated using only filtered data. Code: ·
Refer to Covariance
Comparison. (Figure and table
generate codes were included in the data generating codes.) |
|
Table 2, Supplemental Figure 2 |
Reference correction results calculated with different covariances
for BMR6032 dataset: ·
ResultA.RData: Using covariance matrix A, ·
ResultB.RData: Using covariance matrix B, ·
ResultC.RData: Using covariance matrix C, ·
ResultD.RData: Using covariance matrix D-Revised, with cysteine
separated as two different residues (ox-red), ·
ResultD_noSep.RData: Using covariance matrix D, · ResultE.RData: Using covariance matrix E. Codes: · Refer to BMR6032
Analysis. (Figure and table
generate codes were included in the data generating codes.) |
|
Supplemental Table 3,
Figure 7 |
Reference correction results calculated with
different covariances across RefDB data: · Results_A.RData:
Using covariance matrix A, · Results_B.RData:
Using covariance matrix B, · Results_C.RData:
Using covariance matrix C, · Results_D.RData:
Using covariance matrix D, · Results_E.RData:
Using covariance matrix E, · Results_E_revised.RData:
Using covariance matrix D-Revised, · Results_E_revised_OL.RData:
Using covariance matrix D-Revised and overlap matrix, · Results_E_revised_OL_90perc.RData: Using covariance matrix D-Revised and overlap
matrix for data with 90% completion. Codes: · Figure_7.R: will generate the Figure 11, ·
S_Table_3.R: will generate the Table 3. |
|
Supplemental Figure 4 |
Reference correction analysis results from methods including and no
including glycine residue as classifier: ·
Results_NoGly.RData: result from method not using glycine as additional
classifier, ·
Results_WithGly.RData: result from method using glycine as additional
classifier. Code: · GlycineAnalysis.R: will generate the Supplemental Figure 4. |
|
Supplemental Table 4,
Figure 8 |
Included robustness analysis data of BaMORC against varying amounts of missing alpha and beta
carbon chemical shifts (95% - 5%): · noGlyNoPrior_XX.RData.
Here XX stands for the percentage of the data completion. For example, 50%
completion will be noGlyNoPrior_50.RData. Codes: · Figure_8.R: will generate the Figure 8, ·
S_Table_4.R: will generate the Supplemental Table 4. |
|
Figure 9 |
Results comparison between Assigned BaMORC
and LACS for 90% completion data: · Assigned_BaMORC.RData: Using Assigned BaMORC
Algorithm to find referencing correction. · LACS.RData: Using LACS Algorithm to find referencing
correction. Codes: · Figure_9.R: will generate the Figure 9. |
|
Supplemental Table 5,
Supplemental Figure 5 |
Reference correction results on all RefDB data by utilizing secondary structure information
from RefDB and JPred42: · Results_D_revised_OL_90perc.RData: Using secondary structure information from RefDB. ·
JPred_Results.RData: Using secondary structure information from JPred4. Codes: · S_Figure_5.R: will generate the Supplemental Figure 5, ·
S_Table_5.R: will generate the Supplemental Table 5. |
|
Table 1 |
Reference correction results from real-life data: · Unassigned_BaMORC_Results.RData: testing Unassigned BaMORC
on 10 real-life data with secondary structure information from JPred4. Code: · Table_1.R: will generate the Table 1. |
|
Supplemental Figure 7 |
Amino Acid and Secondary Structure Frequency
Residual of Residual vs. Reference Correction Values for RefDB
datasets: · Resitual.vs.Reference.RData, Code: · S_Figure_7.R: will generate
the Supplemental Figure 7. |
|
Supplemental Figure 8 |
Comparison between global optimization and grid search approaches: ·
DEoptim_10.RData: Using DEoptim global
optimization algorithm with max iteration equals to 10. ·
DEoptim_20.RData: Using DEoptim global
optimization algorithm with max iteration equals to 20. ·
DEoptim_50.RData: Using DEoptim global
optimization algorithm with max iteration equals to 50 · Results_E_revised_OL.RData:
Using grid search approach. Code: · S_Figure_8.R: will generate
the Supplemental Figure 8. |
Source Code and Outputs
Output file
will not exist until each corresponding code runs successfully. Results will
not show in the output folder unless code runs.
Source Code Folder
Links |
Source Code
Description |
Expected Outputs (Only exist after run
the code!) |
The source codes were used to illustrate cysteine
should be treated as two different residues: · Cysteine_Analysis.R: plot alanine and cysteine scatter plot with
correlation and marginal histogram and save 3 sets of data: alanine,
cysteine, and cysteine separation, which included class indicator. |
Results sub-directory will be: output/Cysteine_Analysis/ Data: ·
Alanine_Beta_Strand.RData ·
Alanine_Coil.RData ·
Alanine_Helix.RData ·
Cysteine_Beta_Strand.RData ·
Cysteine_Coil.RData ·
Cysteine_Helix.RData ·
Cysteine_Seperated_Beta_Strand.RData ·
Cysteine_Seperated_Coil.RData ·
Cysteine_Seperated_Helix.RData Figures: ·
Alanine_Beta_Strand.png ·
Alanine_Coil.png ·
Alanine_Helix.png ·
Cysteine_Beta_Strand.png ·
Cysteine_Coil.png ·
Cysteine_Helix.png ·
Cysteine_Seperated_Beta_Strand.png · Cysteine_Seperated_Coil.png · Cysteine_Seperated_Helix.png |
|
The source codes will output the new covariance and new datasets
based on RMSD values of each dataset as a criterion · DataFiltration_CovRecal.R: generates figures and output two result data. |
Results sub-directory will be: output/DataFiltration_CovRecalculation/ Data: · Filtered_Cov.RData · Filtered_Datasets.RData Figures: · Figure1.png: histogram of Ca; · Figure2.png: histogram of Cb; · Figure3.png: histogram of Ca-Cb; · Figure4.png: histogram of absolute value of (Ca-Cb); · Figure5.png: histogram of · Figure6.png: histogram of · Figure7.png: histogram of |
|
The source codes will generate figure on comparison
of covariance values before and after the data filtering step: · Covariance_comparison.R |
Results sub-directory will be: output/Covariance_Comparison/ Figure: Cov_comparison.png |
|
The source codes were used for testing 6 different types of
covariance matrices on bmr6032 dataset: ·
Covariance_A.R: testing with covariance matrix A ·
Covariance_B.R: testing with covariance matrix B ·
Covariance_C.R: testing with covariance matrix C ·
Covariance_D.R: testing with covariance matrix D ·
Covariance_E.R: testing with covariance matrix E ·
Covariance_E-Revised.R: testing with covariance matrix D-Revised, with
using the RMSD values of each dataset as a criterion to further filter out
datasets that are likely not derived from a single NMR experiment |
Results sub-directory will be: output/BMR6032_Analysis/ Data: · BMR6032_Analysis/Covariance_A.RData · BMR6032_Analysis/Covariance_B.RData · BMR6032_Analysis/Covariance_C.RData · BMR6032_Analysis/Covariance_D.RData · BMR6032_Analysis/Covariance_E.RData · BMR6032_Analysis/Covariance_E-Revised.RData Figures: · BMR6032_Analysis/Covariance_A.png · BMR6032_Analysis/Covariance_B.png · BMR6032_Analysis/Covariance_C.png · BMR6032_Analysis/Covariance_D.png · BMR6032_Analysis/Covariance_E.png · BMR6032_Analysis/Covariance_E-Revised.png |
|
The source code will generate the overlap matrix and
classifier weights: · OverlapMatrix_Weights_Calculation.R |
Results sub-directory will be: output/OverlapMatrix_Weights/ Data: ·
Overlap_Weights.RData |
|
The source code will generate reference correction results on all the
RefDB data using D-Revised covariance matrix with
the prediction overlap matrix · E-Revised_OLMatrix.R |
Results sub-directory will be: output/E-Revised_OLM/ Data: · results_E.Revised_OLMatrix.RData |
|
The source code will generate reference correction results on RefDB data with 90% completion using D-Revised covariance matrix with the prediction overlap matrix · E-Revised_OLMatrix_90.R |
Results sub-directory will be: output/E-Revised_OLM_90/ Data: ·
results_E.Revised_OLMatrix_90.RData |
|
The source code will generate reference correction results on all the
RefDB data with condition of including and not
including glycine residue: ·
Gly.R, · noGly.R |
Results sub-directory will be: output/Glycine_Analysis/ Data: · results_gly.RData · results_nogly.RData |
|
The source code will generate reference correction
results on 90% completion RefDB data utilizing
secondary structure information from JPred42: ·
JPred_Analysis.R |
Results sub-directory will be: output/JPred_Analysis/ Data: ·
JPred_Analysis_result.RData |
|
The source code will generate robust test results with data
completion from 90% to 50%: · Robust_Testing_XX.R. Here XX stands for the percentage of the data
completion. For example, 50% completion will be Robust_Testing_50.R. |
Results sub-directory will be: output/Robust_Testing/ Data: · Robust_Testing_XX.RData. Here XX
stands for the percentage of the data completion. For example, 50% completion
will be Robust_Testing_50.RData. |
|
The source code will generate Assigned BaMORC results with RefDB data: · Assigned_BaMORC.R. Using assignment information to estimate the
referencing correction value. |
Results sub-directory will be: output/Assigned_BaMORC/ Data: · Assigned_BaMORC.RData. |
|
The source code will generate BaMORC results with RefDB data via global optimization algorithm: · BaMORC_GO.R. Similar to the original BaMORC
script, but using globa
optimization algorithm to reduce running time. |
Results sub-directory will be: output/BaMORC_GO/ Data: · BaMORC_GO.RData. |
Bibliography
1. Zhang, H., Neal, S. & Wishart, D. S.
RefDB: a database of uniformly referenced protein
chemical shifts. J. Biomol. NMR 25,
173–195 (2003).
2. Drozdetskiy, A., Cole, C. & Procter, J. JPred4: a
protein secondary structure prediction server. Nucleic acids … (2015).