figshare
Browse
pcbi.1009541.s001.pdf (3.74 MB)

Supplementary text, figures, and tables.

Download (3.74 MB)
journal contribution
posted on 2021-10-29, 17:37 authored by Petar I. Penev, Claudia Alvarez-Carreño, Eric Smith, Anton S. Petrov, Loren Dean Williams

Fig A. ROC curves for classifiers with different parameters built from the BaliBASE dataset, tested against the rProt dataset. Parameters shown here are segment boundaries (length threshold), TWC intensity for detection of positive positions (intensity threshold), and what percentage gaps should be used for removal of alignment columns (gap threshold). Each subplot represents different combination of the intensity and length thresholds. Colored lines within subplots represent different gap thresholds. Cutting only alignment positions with more than 80–90% gaps produce better distinction between true positive and true negatives. Complete data including all tested parameters and datasets is available in S2 Dataset. Fig B. ROC curves of training classifiers with different penalties and gamma parameters. The first four subplots (A-D) test different penalties and the last four test different gamma values (E-H). Each subplot represents a different dataset that was used for training and testing. (A) and (E) are PROSITE, (B) and (F) are BaliBASE, (C) and (G) are INDELible, (D) and (H) are rProtein dataset. For testing each dataset was split in 3 folds. Each fold produces an ROC curve, we plot the mean of the three results as single curve and plot the standard deviation of the true positive rate as a shaded region around it. Complete data is available in S3 Dataset. Fig C. ROC curves generated from HHalign alignments from the four datasets: BaliBASE, rProtein, INDELible, and PROSITE. Colored lines within subplots represent different gap thresholds used for column exclusion. Fig D. Comparison of structural mapping between Zebra2 and TwinCons. A) Zebra2 results and B) TwinCons results from sequence alignment for uL2 between archaeal and bacterial sequences mapped on the E. coli uL2 structure from PDB 4V9D [22]. C) Zebra2 results and D) TwinCons results from the same sequence alignment mapped on the P. furiosus uL2 structure from PDB 4V6U [36]. In panels A) and C) red indicates signatures. In panels B) and D) dark green indicates alignment positions with high conservation of residues, purple indicates signature positions, gray indicates heavily gapped regions in the composite alignment. Orange circles indicate signature positions. Fig E. TwinCons mapped for a short α-helix region in uL2 with analogous sequence between Bacteria and Archaea. Residues depicted here are listed in Table C in S1 Appendix. (A) stick representation for E. coli uL2. (B) stick representation for P. furiosus uL2. (C) cartoon representation of E. coli uL2. (D) cartoon representation of P. furiosus uL2. (E) and (F) show different angle for the E. coli and P. furiosus uL2. Conserved residues are colored green, signatures are colored purple, and random positions are white. Heavily gapped regions, present in a single group, are colored gray. Figure generated with PyMOL. PDB IDs and chains used for the figure are available in Table E in S1 Appendix. Fig F. TwinCons segment with significant sequence similarity between (A, B) bL33 and (C, D) aL42. The segment is shown with full opacity cartoon, non-segment regions are shown with transparent cartoon. Conserved residues are colored green, signatures are colored purple, and random positions are white. Heavily gapped regions, present in a single group, are colored gray. Segment definitions are available in S6 Dataset. Figure generated with PyMOL. PDB IDs and chains used for the figure are available in Table E in S1 Appendix. Fig G. TwinCons score for Archaea and Bacteria composite alignments of the small and large subunits. (A) Secondary structure of the P. furiosus 16S rRNA with mapped TwinCons. (B) Secondary structure of the P. furiosus 5S and 23S rRNAs with mapped TwinCons. (C) Surface representation of the 16S rRNA for P. furiosus ribosome. (D) Surface representation of the 5S and 23S rRNAs for P. furiosus ribosome in crown view. Both the small and large subunits are shown from the subunit interface direction. Gray indicates heavily gapped regions, present only in bacterial or archaeal sequences; dark green indicates highly conserved regions between both bacterial and archaeal sequences; dark purple indicates signature regions between bacterial and archaeal sequences; white indicates sequence variable regions. In panels (A) and (B) blue numbers indicate helical numbering and ribosomal domains are indicated with brown. Panels (A) and (B) are generated with RiboVision, panels (C) and (D) are generated with PyMOL. PDB IDs and chains used for the figure are available in Table E in S1 Appendix. Fig H. TwinCons signatures differ based on the substitution matrix used. TwinCons results mapped on (A) metacaspase, (C) caspase, and (B) β-sheet superimposition of both structures, using the Blosum62 matrix. TwinCons results mapped on (D) metacaspase, (F) caspase, and (E) β-sheet superimposition of both structures, using structure-informed substitution matrices. A position with differing result is highlighted between panels (B) and (D) with red. Set of residues, representing the composite alignment column for the highlighted position, are shown between (B) and (E). Structure-informed matrices produce stronger signature signal between the two groups for this alignment position. Structures are generated with PyMOL. PDB IDs and chains used for the figure are available in Table E in S1 Appendix. Fig I. Distribution of TwinCons scores from the E. coli rRNA, based on three composite alignments between Archaeal and Bacterial sequences of 23S, 16S, and 5S rRNA. (A) Histogram of TwinCons scores showing three peaks of distribution around the minimum score, score zero, and the maximum score. (B) Scatter plot of TwinCons scores with group assignment by k-means clustering algorithm. The y-axis holds randomly assigned values and is only illustrative. Scores from different groups are colored with the viridis gradient. The red and green lines indicate the calculated thresholds of the groups spanning the lowest (red) and highest (green) scores. Thresholds calculated from each composite alignment are available in Table B in S1 Appendix. Fig J. TwinCons segments with significant sequence similarity between the P-loop domains of (A, C) aIF5 and (B, D) EF-Tu. Segments are shown with full opacity cartoon, while non-segment regions are shown with transparent cartoon. GDP from the EF-Tu structure is shown with sticks. Conserved residues are colored green, signatures are colored purple, and random positions are white. Heavily gapped regions, present in a single group, are colored gray. Segment definitions are available in S6 Dataset. Figure generated with PyMOL. PDB IDs and chains used for the figure are available in Table E in S1 Appendix. Fig K. TwinCons segment with significant sequence similarity between bS1 and domain 7 of RNAP mapped on the RNAP7 structure. (A) and (B) two views of the segment mapped on the RNAP7 structure. Segment is shown with full opacity cartoon, while non-segment regions are shown with transparent cartoon. Conserved residues are colored green, signatures are colored purple, and random positions are white. Heavily gapped regions, present in a single group, are colored gray. Segment definitions are available in S6 Dataset. Figure generated with PyMOL. PDB IDs and chains used for the figure are available in Table E in S1 Appendix. Fig L. TwinCons segment with significant sequence similarity between bL34 and aL37. (A) representation of E. coli bL34, (B) representation of P. furiosus aL37, (C) 90-degree rotation view of E. coli bL34, and (D) 90-degree rotation view of P. furiosus aL37. The segment is shown with full opacity cartoon, non-segment regions are shown with transparent cartoon. Conserved residues are colored green, signatures are colored purple, and random positions are white. Heavily gapped regions, present in a single group, are colored gray. Segment definitions are available in S6 Dataset. Figure generated with PyMOL. PDB IDs and chains used for the figure are available in Table E in S1 Appendix. Table A. Substitution matrices available for TwinCons calculation and references for full descriptions. Table B. TwinCons thresholds calculated with 5 k-clusters for different subsets of rRNA. First two rows, tagged with ‘ribosome’, include sequences from the 23S, 5S, and 16S. Entries tagged with LSU include sequences from the 23S and 5S. Entries tagged with SSU include only rRNA from the 16S rRNA. TwinCons was calculated against the Archaea-Bacteria composite alignment of the rRNA. Standard deviations were calculated after repeating the calculation 100 times. Full script used to generate this data can be found at https://github.com/LDWLab/TWC_distribution. Table C. TwinCons and ConSurf statistics for α-helical region in uL2. Positions with low Shannon entropy, low ConSurf score, and high TwinCons score are detected as highly conserved. Positions with TwinCons below -0.6 are detected as signature positions. Signature positions detected with TwinCons, that are detected as conserved by ConSurf are highlighted with blue. Table D. Composite alignments used in sequence similarity analysis. Table E. Protein and rRNA structures used to map sequence similarity analysis. When multiple PDBs are used in a single row they are separated by a semicolon. When multiple chains are used from a single PDB they are separated by &.

(PDF)

History