Data from: Automatic 13C Chemical Shift Reference Correction for Unassigned Protein NMR Spectra

Xi Chena,b, Andrey Smelterf,g, and Hunter N.B. Moseleya,b,c,d,e,*

 

Affiliations

aDepartment of Molecular and Cellular Biochemistry

bDepartment of Statistics

cMarkey Cancer Center

dCenter for Environmental and Systems Biochemistry

eInstitute for Biomedical Informatics

University of Kentucky, Lexington KY40356, USA

 

fSchool of Interdisciplinary and Graduate Studies

gDepartment of Computer Engineering and Computer Science

University of Louisville, Louisville, KY 40292, USA

 

*Correspondence to: hunter.moseley@uky.edu

 

Introduction

This archive includes all the data generated from the manuscript Automatic 13C Chemical Shift Reference Correction for Unassigned Protein NMR Spectra.

To fully replicate the analysis, you can use either the data from The Re-referenced Protein Chemical shift Database1, RefDB official website, or the R data.frame data we included. We highly recommend using our provided file, which had already been stripped metadata off. The overall analysis was demonstrated in Figure 1

Figure 1. Flow diagram of the BaMORC (assigned and unassigned) method that we developed for Automatic 13C Chemical Shift Reference Correction for Unassigned Protein NMR Spectra. Unassigned BaMORC includes two algorithmic parts: grouping and correction (BaMORC). The grouping algorithm utilizes the density-based clustering algorithm DBSCAN to group peaks and report Ca/Cb spin systems as the input for the correction algorithm BaMORC. BaMORC uses the same pipeline mentioned above, which includes estimate amino acid composition, secondary structure prediction and optimization of the absolute difference between the estimated and actual amino acid composition, to report a reference correction value as the final output.

 

Prerequisites:

You will need a computer with minimal RAM of 8GB, higher capacity is recommended though. Additionally, please install following software and packages:

1.    Software: R (required) and RStudio

R is a popular statistical analysis software, which our analysis based on. For installation, please refer to the R project official website.

RStudio is an integrated development environment (IDE) for R. For installation, please refer to the RStudio official website.

2.    Libraries and packages for R:

cluster

Used in the K-means classification.

data.table

Used for handling fast aggregation of large data (e.g. 100GB in RAM).

seqinr

A biological sequence data analysis package for R, used for only converting amino acid 3-letter code to 1-letter code, vis-versa.

plyr

A set of tools for splitting, applying and combining data.

DEoptim

Global Optimization by Differential Evolution.

The command line interface (CLI) installation is:

$ R

>install.packages("cluster", "data.table", "seqinr", "plyr", "ggExtra", "ggplot2", "dplyr", "ggvis", "tidyverse", "ggpubr", "tidyr")

 

 3. Instruction for running R script (gif).

Warning: some script might take up to 5 days to run depending the computation setup.

../../../Snagit/Autosaved%20Captures.localized/2017-08-01_18-50-40.gif

The best way to run the result code is to open the “BaMORCPaperAnalysis.Rproj” file in the RStudio IDE, because it will automatically set up the correct directory path.

 

Data and Results

Links to Data/Results

Corresponding Figures / Tables

Description

RefDB Statistics

All

Included expected value (mean), standard deviation (sd):

·      RefDB.caStat.csv: calculated mean and sd for Ca using all of the data from RefDB,

·      RefDB.cbStat.csv: calculated mean and sd for Cb using all of the data from RefDB.

RefDB Clean Data

Figure 3 - 4

Supplemental Figure 1

All the RefDB carbon data, striped off metadata and sub-filed basing on the resonance types.

Covariances Analysis

Figure 5

Covariance values calculate from different population:

·      CarbonCovTable_BC_old.csv: Covariance values calculated using all of the data from RefDB,

·      CarbonCovTable_BC.csv: Covariance values calculated using only filtered data.

Code:

·      Refer to Covariance Comparison. (Figure and table generate codes were included in the data generating codes.)

BMR6032 Results

Table 2, Supplemental Figure 2

Reference correction results calculated with different covariances for BMR6032 dataset:

·      ResultA.RData: Using covariance matrix A,

·      ResultB.RData: Using covariance matrix B,

·      ResultC.RData: Using covariance matrix C,

·      ResultD.RData: Using covariance matrix D-Revised, with cysteine separated as two different residues (ox-red),

·      ResultD_noSep.RData: Using covariance matrix D,

·      ResultE.RData: Using covariance matrix E.

Codes:

·      Refer to BMR6032 Analysis. (Figure and table generate codes were included in the data generating codes.) 

All RefDB Reference Correction Results

Supplemental Table 3, Figure 7

Reference correction results calculated with different covariances across RefDB data:

·      Results_A.RData: Using covariance matrix A,

·      Results_B.RData: Using covariance matrix B,

·      Results_C.RData: Using covariance matrix C,

·      Results_D.RData: Using covariance matrix D,

·      Results_E.RData: Using covariance matrix E,

·      Results_E_revised.RData: Using covariance matrix D-Revised,

·      Results_E_revised_OL.RData: Using covariance matrix D-Revised and overlap matrix,

·      Results_E_revised_OL_90perc.RData: Using covariance matrix D-Revised and overlap matrix for data with 90% completion.

Codes:

·      Figure_7.R: will generate the Figure 11,

·      S_Table_3.R: will generate the Table 3.

Analysis on Glycine

Supplemental Figure 4

Reference correction analysis results from methods including and no including glycine residue as classifier:

·      Results_NoGly.RData: result from method not using glycine as additional classifier,

·      Results_WithGly.RData: result from method using glycine as additional classifier.

Code:

·      GlycineAnalysis.R: will generate the Supplemental Figure 4.

Robust Analysis

Supplemental Table 4, Figure 8

Included robustness analysis data of BaMORC against varying amounts of missing alpha and beta carbon chemical shifts (95% - 5%):

·      noGlyNoPrior_XX.RData. Here XX stands for the percentage of the data completion. For example, 50% completion will be noGlyNoPrior_50.RData.

Codes:

·      Figure_8.R: will generate the Figure 8,

·      S_Table_4.R: will generate the Supplemental Table 4.

Assigned BaMORC

Figure 9

Results comparison between Assigned BaMORC and LACS for 90% completion data:

·      Assigned_BaMORC.RData: Using Assigned BaMORC Algorithm to find referencing correction.

·      LACS.RData: Using LACS Algorithm to find referencing correction.

Codes:

·      Figure_9.R: will generate the Figure 9.

JPred Results

Supplemental Table 5, Supplemental Figure 5

Reference correction results on all RefDB data by utilizing secondary structure information from RefDB and JPred42:

·      Results_D_revised_OL_90perc.RData: Using secondary structure information from RefDB.

·      JPred_Results.RData: Using secondary structure information from JPred4.

Codes:

·      S_Figure_5.R: will generate the Supplemental Figure 5,

·      S_Table_5.R: will generate the Supplemental Table 5.

Unassigned BaMORE Results

Table 1

Reference correction results from real-life data:

·      Unassigned_BaMORC_Results.RData: testing Unassigned BaMORC on 10 real-life data with secondary structure information from JPred4.

Code:

·      Table_1.R: will generate the Table 1.

Residual of Residual vs. Reference Correction Values

Supplemental Figure 7

Amino Acid and Secondary Structure Frequency Residual of Residual vs. Reference Correction Values for RefDB datasets:

·      Resitual.vs.Reference.RData,

Code:

·      S_Figure_7.R: will generate the Supplemental Figure 7.

Global Optimization vs. Grid Search

Supplemental Figure 8

Comparison between global optimization and grid search approaches:

·      DEoptim_10.RData: Using DEoptim global optimization algorithm with max iteration equals to 10.

·      DEoptim_20.RData: Using DEoptim global optimization algorithm with max iteration equals to 20.

·      DEoptim_50.RData: Using DEoptim global optimization algorithm with max iteration equals to 50

·      Results_E_revised_OL.RData: Using grid search approach.

Code:

·      S_Figure_8.R: will generate the Supplemental Figure 8.

 

Source Code and Outputs

Output file will not exist until each corresponding code runs successfully. Results will not show in the output folder unless code runs.

Source Code Folder Links

Source Code Description

Expected Outputs

(Only exist after run the code!)

Cysteine Analysis

The source codes were used to illustrate cysteine should be treated as two different residues:

·      Cysteine_Analysis.R: plot alanine and cysteine scatter plot with correlation and marginal histogram and save 3 sets of data: alanine, cysteine, and cysteine separation, which included class indicator.

Results sub-directory will be: output/Cysteine_Analysis/

Data:

·      Alanine_Beta_Strand.RData

·      Alanine_Coil.RData

·      Alanine_Helix.RData

·      Cysteine_Beta_Strand.RData

·      Cysteine_Coil.RData

·      Cysteine_Helix.RData

·      Cysteine_Seperated_Beta_Strand.RData

·      Cysteine_Seperated_Coil.RData

·      Cysteine_Seperated_Helix.RData

Figures:

·      Alanine_Beta_Strand.png

·      Alanine_Coil.png

·      Alanine_Helix.png

·      Cysteine_Beta_Strand.png

·      Cysteine_Coil.png

·      Cysteine_Helix.png

·      Cysteine_Seperated_Beta_Strand.png

·      Cysteine_Seperated_Coil.png

·      Cysteine_Seperated_Helix.png

Data Filtration and Coveriance Re-Calculation

The source codes will output the new covariance and new datasets based on RMSD values of each dataset as a criterion

·      DataFiltration_CovRecal.R: generates figures and output two result data.

Results sub-directory will be: output/DataFiltration_CovRecalculation/

Data:

·      Filtered_Cov.RData

·      Filtered_Datasets.RData

Figures:

·      Figure1.png: histogram of Ca;

·      Figure2.png: histogram of Cb;

·      Figure3.png: histogram of Ca-Cb;

·      Figure4.png: histogram of absolute value of (Ca-Cb);

·      Figure5.png: histogram of ;

·      Figure6.png: histogram of ;

·      Figure7.png: histogram of ;

Covariance Comparison

The source codes will generate figure on comparison of covariance values before and after the data filtering step:

·      Covariance_comparison.R

Results sub-directory will be: output/Covariance_Comparison/

Figure:

Cov_comparison.png

BMR6032 Analysis

The source codes were used for testing 6 different types of covariance matrices on bmr6032 dataset:

·      Covariance_A.R: testing with covariance matrix A

·      Covariance_B.R: testing with covariance matrix B

·      Covariance_C.R: testing with covariance matrix C

·      Covariance_D.R: testing with covariance matrix D

·      Covariance_E.R: testing with covariance matrix E

·      Covariance_E-Revised.R: testing with covariance matrix D-Revised, with using the RMSD values of each dataset as a criterion to further filter out datasets that are likely not derived from a single NMR experiment

Results sub-directory will be: output/BMR6032_Analysis/

Data:

·      BMR6032_Analysis/Covariance_A.RData

·      BMR6032_Analysis/Covariance_B.RData

·      BMR6032_Analysis/Covariance_C.RData

·      BMR6032_Analysis/Covariance_D.RData

·      BMR6032_Analysis/Covariance_E.RData

·      BMR6032_Analysis/Covariance_E-Revised.RData

Figures:

·      BMR6032_Analysis/Covariance_A.png

·      BMR6032_Analysis/Covariance_B.png

·      BMR6032_Analysis/Covariance_C.png

·      BMR6032_Analysis/Covariance_D.png

·      BMR6032_Analysis/Covariance_E.png

·      BMR6032_Analysis/Covariance_E-Revised.png

Overlap Matrix and Weights Calculation

The source code will generate the overlap matrix and classifier weights:

·      OverlapMatrix_Weights_Calculation.R

Results sub-directory will be: output/OverlapMatrix_Weights/

Data:

·      Overlap_Weights.RData

E-Revised + Overlap Matrix

The source code will generate reference correction results on all the RefDB data using D-Revised covariance matrix with the prediction overlap matrix

·      E-Revised_OLMatrix.R

Results sub-directory will be: output/E-Revised_OLM/

Data:

·      results_E.Revised_OLMatrix.RData

E-Revised + Overlap Matrix with 90% Completion

The source code will generate reference correction results on RefDB data with 90% completion using D-Revised covariance matrix with the prediction overlap matrix

·      E-Revised_OLMatrix_90.R

Results sub-directory will be: output/E-Revised_OLM_90/

Data:

·      results_E.Revised_OLMatrix_90.RData

Glycine Analysis

The source code will generate reference correction results on all the RefDB data with condition of including and not including glycine residue:

·      Gly.R,

·      noGly.R

Results sub-directory will be: output/Glycine_Analysis/

Data:

·      results_gly.RData

·      results_nogly.RData

JPred Analysis

The source code will generate reference correction results on 90% completion RefDB data utilizing secondary structure information from JPred42:

·      JPred_Analysis.R

Results sub-directory will be: output/JPred_Analysis/

Data:

·      JPred_Analysis_result.RData

Robust Testing

The source code will generate robust test results with data completion from 90% to 50%:

·      Robust_Testing_XX.R. Here XX stands for the percentage of the data completion. For example, 50% completion will be Robust_Testing_50.R.

Results sub-directory will be: output/Robust_Testing/

Data:

·      Robust_Testing_XX.RData. Here XX stands for the percentage of the data completion. For example, 50% completion will be Robust_Testing_50.RData.

Assigned BaMORC

The source code will generate Assigned BaMORC results with RefDB data:

·      Assigned_BaMORC.R. Using assignment information to estimate the referencing correction value.

Results sub-directory will be: output/Assigned_BaMORC/

Data:

·      Assigned_BaMORC.RData.

BaMORC Global Optimization

The source code will generate BaMORC results with RefDB data via global optimization algorithm:

·      BaMORC_GO.R. Similar to the original BaMORC script, but using globa optimization algorithm to reduce running time.

Results sub-directory will be: output/BaMORC_GO/

Data:

·      BaMORC_GO.RData.

 

 

 Bibliography

 1.        Zhang, H., Neal, S. & Wishart, D. S. RefDB: a database of uniformly referenced protein chemical shifts. J. Biomol. NMR 25, 173–195 (2003).

2.     Drozdetskiy, A., Cole, C. & Procter, J. JPred4: a protein secondary structure prediction server. Nucleic acids … (2015).