Modularity in protein structures: study on all-alpha proteins

Modularity is known as one of the most important features of protein’s robust and efficient design. The architecture and topology of proteins play a vital role by providing necessary robust scaffolds to support organism’s growth and survival in constant evolutionary pressure. These complex biomolecules can be represented by several layers of modular architecture, but it is pivotal to understand and explore the smallest biologically relevant structural component. In the present study, we have developed a component-based method, using protein’s secondary structures and their arrangements (i.e. patterns) in order to investigate its structural space. Our result on all-alpha protein shows that the known structural space is highly populated with limited set of structural patterns. We have also noticed that these frequently observed structural patterns are present as modules or “building blocks” in large proteins (i.e. higher secondary structure content). From structural descriptor analysis, observed patterns are found to be within similar deviation; however, frequent patterns are found to be distinctly occurring in diverse functions e.g. in enzymatic classes and reactions. In this study, we are introducing a simple approach to explore protein structural space using combinatorial- and graph-based geometry methods, which can be used to describe modularity in protein structures. Moreover, analysis indicates that protein function seems to be the driving force that shapes the known structure space.


Introduction
Biological systems are complex, but modular in nature. Modularity has been observed in almost all levels. Starting from biomolecules (e.g. protein structures), pathways (e.g. gene regulation, metabolic networks), and cellular to organism level, the concept of modularity has been explored (Hleap, Susko, & Blouin, 2013;Lammert, Schug, & Onuchic, 2009;Pereira-Leal, Levy, & Teichmann, 2006). Integration and complex interplay among different levels of cellular components (or modules) are crucial for survival of an organism (Rorick, 2012;Tasdighian et al., 2013). In generic terms, modules could be described as the mutually exclusive individual components of a system which act coherently (Pereira-Leal et al., 2006;Wagner, Pavlicev, & Cheverud, 2007). Concept of modularity is the key feature of design as it provides necessary robustness and flexibility to the system and are found in (or adopted) in various engineering principles and computer programming models (Kitano, 2004). But in case of biological systems, defining "module" becomes a challenging problem, especially in the context of protein structures (Banerji & Ghosh, 2011;Rorick, 2012). Identification of the smallest component of protein's modular structural arrangement will provide an insight into its evolution and structural-functional space.
Using component-based approach (by considering the arrangement of secondary structural elements (SSEs)), in this study, we have presented a systematic approach to identify structural modules in alpha (α) helical proteins. In strict geometrical sense, protein fold could be defined as architecture of SSEs with their chain connectivity, relative orientation, and packing; also referred as topology (Skolnick et al., 2009;Srinivasan & Rose, 1995;Taylor et al., 2009). It has been proposed that folds are explored by genetic mechanisms of gene duplication, insertion, deletion, and circular permutations, which are incorporated into protein structure (Kolodny et al., 2013) and can be accounted by analyzing protein structures at secondary structure level (Gordeev & Efimov, 2013;Przytycka, Srinivasan, & Rose, 2002). Hence, using topology, fold space can be explored aptly. There has been different attempts to identify structural modules in proteins without considering the evolutionary or functional relation (Efimov, 1997;Gordeev & Efimov, 2013;Skolnick & Gao, 2013;Taylor, 2002;Wathen & Jia, 2013). PCBOST is an emerging approach which tries to mimic the hierarchical folding by constructing a "structural tree" which can trace back to a "root motif" (Efimov & Kondratova, 2003;Gordeev & Efimov, 2013). In its recent addition of α/β structural tree (Gordeev & Efimov, 2013), it has been illustrated that occurrence of some structural motifs are higher than others whose reasons are yet to be acknowledged. Similar observation has been recorded for β -proteins. The theory of limited fold space and biased distribution of protein among different folds have been supported by recent studies (Cuff et al., 2009;Grainger, Sadowski, & Taylor, 2010;Hleap et al., 2013), which not only increases the perplexity of the problem but also makes it more interesting to understand the contributing factors shaping protein fold space. Limited use of structural space may be induced by different structural properties Figure 1. Secondary structures (SS) are numbered sequentially in protein chains. Contact matrix captures details of SS pair wise contact formed in a protein as shown in, (a) 2B4 J-C (pdbid-chain) has 5 alpha helices with 4 helix pair in contact, as per defined criteria (see text). Using relative helix orientation, matrix element has been assigned as "a" (anti-parallel), "p" (parallel), or "r" (orthogonal) for helices which are in contact otherwise "0". Off-diagonal elements of the "Contact Matrix" capture the type of contact between SS as well as its distance in terms of intervening SS. First off-diagonal element (shown in "red") shows the contact between adjacent SS in protein chain and henceforth. "Contact String" has been made by stitching the off-diagonal elements in to one string, which has been stacked with respect to chain SS distance (shown in different color blocks). Similarly, contact string has been extracted (example (b) Contact String for 4 HS protein). (Paola et al., 2012;Tasdighian et al., 2013) or due to certain functional objective (Myers-Turnbull et al., 2014) or because of easy accessibility of folded structure (Cossio et al., 2010;Gianni et al., 2009;Onuchic & Wolynes, 2004).
Graph abstraction or network-based methods have been widely applied to explain and understand complex systems across disciplines (Kloczkowski et al., 2009;Kun & Scheuring, 2009;Lorenz, Jeng, & Deem, 2011;Tasdighian et al., 2013). Protein structures are also explained in terms of edges and nodes (contact networks), which can be describe as a short molecular formula for macromolecule (Dokholyan et al., 2002;Hleap et al., 2013;Kamat & Lesk, 2007). In this study, we have proposed an efficient methodology, to annotate and compare protein structures based on contact between SSEs (or tertiary contact). In present study, we have focused on all-alpha proteins by considering alpha helices (Kabsch & Sander, 1983) as the secondary structure and defined contact among them. By scanning known structural space, we have found that the protein structure space is modular and act as "building blocks" (or "legos") for larger protein structures. In addition, it has been observed that some of the modules are repeatedly used to build proteins having diverse functionality.

Material and methods
Protein structures have been extracted from Protein Data Bank (PDB) (Berman et al., 2000). PDB, being the primary repository of bimolecular structural data, is subjected to multiple levels of redundancy (Berman et al., 2000;Thiruv et al., 2005) introduced mainly by evolutionary conserved proteins and different variants of same protein. These redundancies have been removed by clustering approach based on sequence identity cut-offs and selecting representative protein sequences among these clusters (Li & Godzik, 2006;Wang & Dunbrack, 2005).
To encounter rapid growth in structural data and for robustness of proposed result, we have formed four datasets of same structural quality (protein structure soled by X-ray diffraction method, resolution ≤2.5 Å and R-free <.30) but different sequence homology cut-offs and releases (Table S1). The main data-set D1 (#April, 2014) is composed of 977 protein chain with alpha helix ranging from 3 to 10. Similarly, data-set D2 (#June, 2013), D3 (Astral25 #June, 2009, SCOP v1.75), and D4 (with 25% sequence identity of D1) have been prepared for the study. Details of data selection process have been  Note: Data-set D1 has been grouped into different. a HS (Helix Set = number of alpha helices in a pdb chain) and number of protein chains (# PDBs) and corresponding total pattern count has been provided in the first three columns.
b "x f " is the expected pattern frequency i.e. fraction of protein chain with a particular pattern in each HS if all patterns have equal proportion of representation in the data-set (x f = 1/(Total_Pattern(HS))). In each HS, all patterns have been classified into two sets ("Prevalent" and "Others") depending on their (c) observed pattern frequency (x ob = 1/(number_of_representing_pdb_chain)). For both classes, number of patterns (# patterns); percentage of pattern in the set (% patterns), and corresponding percentage of protein chain (% PDBs) has been shown in corresponding sections. The last row shows the sum of protein chains, total number of patterns across HS, and number of prevalent as well as other patterns observed in data-set D1.
provided in the supplementary ( Figure S1, S2, and Table S1). We have focused our analysis mainly on the latest data-set D1, whereas other data-sets (D2, D3, and D4) have been checked for consistency of results.
In each protein chain, contact between two helices has been considered if at least three contacting residues are established as per atom-atom contact (Chothia, 1981;Hleap et al., 2013;Konagurthu, Stuckey, & Lesk, 2008). An atom to atom contact is considered to be formed if distance between atoms is less than sum of their van der Waals radii with a threshold of .6 Å. If residues from different helices have at least one atom to atom contact, Relative percentile scores in each case has been shown by dark dotted line (Q = 75) for first quartile and gray dotted line for third quartile (i.e. Q < 25). For each case, pattern which is observed in respective HS has been shown by dark bars (labeled as "Pattern") and as modules in higher HS has gray colored bars (labeled as "modules").
we consider them as contacting residues. For a helix contact pair to form, it must have minimum of three criteria for selection of contact residues (Chothia, 1981). The contact definition has been validated for different threshold values ( Figure S3) and found that the used criteria is comparable to reported studies (Day et al., 2010;Konagurthu, Lesk, & Allison, 2012;Plaxco, Simons, & Baker, 1998;Yuan, Chen, & Kihara, 2012;Zuo, Wang, & Wang, 2006). Orientation of contacting helices have been assigned to be anti-parallel (a), parallel (p), and orthogonal(r) based on their relative helix-axis angles (Chothia, 1981;Chothia & Lesk, 1985) as illustrated in Figure 1.
For concise representation of contact among SSEs in proteins, we have developed "contact string". Similar to "fingerprints", given a set of SSEs, "contact string" can uniquely represent all different contact patterns. For efficiency in comparison and analysis, contact string have been comprised of contact information segments (separated by "-"), which describes the sequential distance in terms of SSEs (i.e. the first segment of contact string describes contact between SSEs which are sequentially adjacent and henceforth). Contact string is an abstraction of contact matrix (D), which represents all possible pairwise contact between SSEs as illustrated in Figure 1. This simplified representation can be utilized as a tool in various protein structure studies like 3D structure comparison, identifying folding patterns, and analyzing structural modules. However, increase in alpha helix content (h) simultaneously amplifies possible number of patterns (2 h − 1) and length of contact string (l = h × (h − 1)/2). This increases gap between observed and theoretically possible patterns (Figure 2). For systematic study, proteins in data-sets have been grouped as per number of alpha helix content (referred henceforth as Helix Set/HS) and patterns have been evaluated by constructing contact string. By considering the relative orientation of each contacting helix pair, the possible pattern space will further expand with the exponent of 4 (i.e. 4 h ). But, for statistical significance of protein distribution in each HS, pattern wise study has been undertaken (p-value <.01 Table S3, S4 for details)) (Mann, 1999). 5 (x f = .0101) 0100-000-00-0 .0529 5.P-1 0001-000-00-0 .04761 5.P-2 … 0000-001-00-0 .0105 5.P-39 1111-001-10-0 .0052 5.P-40 … 0000-000-01-0 .0052 5.P-99 6 (x f = .0088) 01010-0000-000-00-0 .0291 6.P-1 11000-1010-000-00-0 .0218 6.P-2 … 00001-0000-000-00-0 .0145 6.P-15 11111-0101-000-00-0 .0072 6.P-16 … 00000-0000-000-01-0 .0072 6.P-113 Notes: Example of ranking and indexing of patterns observed in data-set D1 in each HS has been shown in above table. For each HS (shown from 3 to 6 HS), corresponding expected pattern frequency (x f ) has been reported in parenthesis of first column. Observed pattern frequency (x ob ) of each pattern in a HS has been provided. Patterns in each HS has been ranked in descending order of x ob . All observed patterns has been indexed as described in the last column representing helix set and corresponding rank associated to that pattern (pattern index = {HS}.P-{Rank}). In each HS, dotted rows show the discontinuity in pattern index (for detail index pattern see supplementary data).

2672
T. Khan and I. Ghosh Modules or substructures of a protein (with HS >3) could be explored using contact string. Similar to subgraph search method, for a protein with "h" alpha helix (i.e. HS = "h"), all possible substructures with HS (from 3 to h − 1 helix set) have been generated. Contact strings of helix subsets have been formed in sliding window-fashion (window size = 1 alpha helix), scanning from N to C terminal of protein chain. Formed module set is comprised patterns of all present substructures of helix subsets (i.e. HS = [3, h − 1]) in a protein with "h" alpha helix.
To compare different structural properties, we have analyzed structural descriptors of proteins within patterns of each HS. Structural descriptors like relative contact order (rCO) (Plaxco et al., 2000), normalized radius of gyration (nRg) (Ivankov et al., 2009), and loss in solvent Helix Set (X f ) Cartoon of prevalent patterns (PP; X f < X ob) 6 (.0088) 7 (.0102) Notes: Illustration of cartoon representation of patterns that belongs to prevalent set in data-set D1. Cartoon represents alpha helices (in circles) with chain connectivity (bold lines) from N (in-arrow) to C terminal (out-arrow). Contact between alpha helices has been shown in dotted gray lines. In each HS, corresponding pattern index, contact string (in braces), and observed pattern frequency (X ob ) has been shown. accessibility of helix (LSAH) of protein chains have been calculated and patternwise comparison has been performed. Used descriptors are reported to have the ability to describe higher variance in structural properties of proteins (Paola et al., 2012;Tasdighian et al., 2013).
Protein function has been analyzed based on homology (Engelhardt et al., 2011;de Lima Morais et al., 2011) and enzyme reaction comparison methods . Function annotation has been assigned and compared based on latest release of functional data of superfamily database (de Lima Morais et al., 2011). Whereas, enzyme reaction similarity has been analyzed using EC-BLAST using similarity scores of bond change (BC), reaction center (RC), and reaction structure similarity Rahman et al., 2014). All similarity scores are reported in Jaccard coefficient ranging between 0 and 1. Using the reaction information from KEGG (Hattori et al., 2003) and (Alcántara et al., 2012) database, mapped reactions have been produced and their similarities have been evaluated.
All protocols (protein contact string generation, classify motif belongs to prevalent or not, calculation of protein descriptors and functional annotation) have been automated using python programming language, with SQL database (available as "ProLego" in http://ccbb.jnu. ac.in/proLego/index.html).

Results and discussion
We have investigated modularity in protein structure in a non-hierarchical approach, treating each HS (Helix Set) independently and analyzed observed pattern space. By clustering proteins in data-set as per number of alpha helix content, possible pattern (or topology) space for each HS has been explored ( Figure 2). Comparing number of observed patterns to number of theoretically possible patterns from 3 to 10 HS, it can be observed that with current known structures only a fraction of possible topology space can be explored. Analyzing proteins from 2009 Astral data-set (i.e. D3) and current data-set (D1), a marginal expansion in observed patterns over the years can be noticed (from 524 in D3 to 660 in D1) (Table S2). However, study from all data-set suggests that observed pattern space is only fraction of possible pattern space ( Figure S2).
Observed pattern space in each HS confirms the presence of certain patterns that have occurred more than expected frequency (x f ) ( Table 1). For comparative analysis across HS, we have ranked observed patterns in each HS, as per fraction of occurrence in "relative percentile scores" (Q-score) and "observed frequency" (x ob ). These percentile scores are further classified into quartiles depending on Q-score (Figure 3). With corresponding x ob , each pattern has been assigned a "pattern index" which represents the relative ranking of that pattern in HS (Table 2). In each HS, only a small fraction of patterns have Q-score ≥ 75, which have observed frequency (x ob ) greater than expected frequency (x f ) (~12% in all HS). These high frequent pattern sets in each HS has been referred as "prevalent patterns". Figure 3 shows the observed pattern space in 3 and 4 HS and their relative percentile scores. In case of 3 HS, prevalent patterns (like "01-0" and "11-1") are more frequent (as Q-score > 75), occurring in~50% of structure, out of 7 possible patterns (Table 1 and Figure 3(a)). Similarly, as shown in Table 1 and Figure 3(b), frequent patterns for 4 HS proteins have been identified. The prevalent pattern sets are found to be consistent for all four data-sets with marginal variation in quartile scores. Across HSs, prevalent patterns could be identified for cases when available representative structure is statistically comparable to observed pattern space (results from χ 2 -test for statistical significance Table S3, S4).
Limitation in observed pattern space can be explored by inclusion of subpatterns or "modules" in each HS. Using the contact string, structural modules in protein structure can be analyzed as described in "Material and methods" section. Inclusion of modules as patterns for each HS increases the observed pattern space by~3.4 times in data-set D1 (from 660 to 2296) ( Figure 2) and similar growth rate has been observed across data-sets (see Table S2, Figure S4). However, the expansion in observed pattern space is marginal as compared to possible space (as shown in Figure 2), although the 3-fold increase in pattern space is reflected in new patterns. Analyzing distribution of percentile rank, we have observed the relatively high frequency of same prevalent patterns (Figure 3) (i.e. up to 7 HS Table S4). Some changes have been noticed in the quartile ranking of prevalent patterns, but fraction of observed occurrence (x ob ) is always higher than expected (x f ). We have checked the presence of prevalent pattern sets in D3 by explicitly analyzing orthogonal and aligned helix orientation (Supplementary Section 1.2) and found similar sets of prevalent pattern sets with considerable deviation of observed frequency (x ob ≥ x f (+.005)). Combining these prevalent patterns and modules (statistically significant) across all HS, a prevalent pattern (PP) set has been proposed (Table 3). Cartoon depiction of representative protein structures from some of the prevalent pattern sets has been illustrated in Figure 4.
From observations (Tables 1 and 3 and Figure 3), it might be proposed that the prevalent patterns (of lower HS) has been reused by protein as building blocks for complex and large proteins (as higher HSs). The recurrence of common structural motifs (α-hairpin and αβ -motifs) (Broom et al., 2012) has recently been identified and discussed (Cuff et al., 2009;Gordeev & Efimov, 2013;Orengo et al., 1997). These motifs or supersecondary structures are smaller than a domain but could be used as "Lego-blocks" or components to give more complex architecture ("Russian doll effect") by duplication and (/or) fusion events (Broom et al., 2012;Höcker, 2014;Orengo et al., 1997). It has also been shown that protein domain evolve their optimal structural design by exploring structure space at the level of contact between SSEs which can be referred as modules (Emmert-Streib & Mushegian, 2007). In the present study, we have vindicated the known hypothesis of modular architecture and identified prevalent patterns in case of mainly alpha proteins.
We attempted to analyze the rationale behind prevalence of certain pattern sets using different structural descriptors (described in "Material and methods"). Pattern wise analyses are found to be within known average values and deviations (Ivankov et al., 2009) but havenot provided significant difference between prevalent and other patterns ( Figure 5, S6). To understand why protein space has few prevalent pattern sets, it is crucial to study the underlying functional requirements that the protein scaffold fulfill. We have compared protein function with known functional terms (de Lima Morais et al., 2011) as well as by analyzing enzyme and its reaction (Tables 4, 5 and Figure 6). Data-set D1 has 224 distinct number of distinct scop superfamilies (sf) (SCOP v1.75, all-alpha has 507 sf). Different proportions of superfamily functional terms (from SUPERFAMILY) have been observed in data-set HS. X-axis shows the pattern index (as described in Table II) and corresponding average structural descriptor values are given in the y-axis. In all cases, prevalent pattern sets are marked as star, whereas, other patterns are marked as circle. D1 (Figure 6(a)) with major fractions from "Regulations" (29.5%), "Metabolism" (23.5%) representing biological processes like anabolic, catabolic processes, and DNA binding. A comparative study on number of distinct functional terms associated with proteins in prevalent with rest (or "other") pattern sets has been performed. As each HS has been evaluated independently, functional diversity has been estimated for patterns considering number of different functional terms that helix set (HS) covers. Averaging functional diversity scores for prevalent and other pattern, we can observe that prevalent pattern set has higher functional diversity score as compared to the "other" pattern set in each HS (Figure 6(b)). This trend is distinctly observed for HS 3-7, but for higher HS such difference diminishes. Such pattern is also visible by analyzing functional diversity at detail functional levels ( Figure S7).
Distribution of enzymes across structural classes is uneven and the presence of~82% non-enzyme in D1 data-set ( Figure S8) is inevitable due to the choice of protein class (alpha protein), which have also been observed in earlier study (Hegyi & Gerstein, 1999).
Comparing EC numbers at substrate specificity level (i.e. 4th level), higher functional diversity for prevalent pattern has been observed in most of the HSs (Table 5). However, presence of only 79 unique fully resolved EC numbers limits the underlying statistical significance (especially for HS 7 onwards). Hence, quantitative estimation of the contribution of different patterns to protein functions in terms of enzyme reaction will be crucial. Similarity between enzyme reactions may not be comprehended quantitatively only by comparison of EC numbers hence, we have used EC-BLAST  (as discussed in "Material and methods" section). Independent of protein sequence or structure, this method can quantify the possible diversity in reactions performed by a set of proteins with a specific scaffold Martinez Cuesta et al., 2014).
For example, although proteins in prevalent and other pattern sets has different ECs (Table 4 for HS 5), comparing reactions on the basis of substrate similarity (SS), BC and RC indicates relatively higher reaction diversity for enzymes with prevalent patterns that other patterns. The similarity scores of all three scoring matrices being low (<.3) shows that enzyme reactions performed by proteins having prevalent patterns are diverse in nature (Table 5), whereas, that of proteins containing other than prevalent patterns have comparatively higher (SS = .98; BC = .78) similarity scores representing similar reactions.  Notes: Enzymes are compared in three scores by EC-BLAST on balance reaction files of tabulated enzymes. The case has been extracted from HS 5, prevalent-patterns (a) while, (b) shows the similarity scores for "other" pattern of 5 HS.
This ability to perform diverse function by prevalent pattern sets has been identified across all HS's and palpable from analyzing different functional annotation methods from homology-based to reaction-based methods ( Figure S9, S10). This distinct difference leads to conclude that function might be the driving force for proteins to repeat (or reuse) "prevalent modules".

Conclusion
It has been well known that evolution uses the preexisting parts (or modules) to invent new functionality or to conserve certain function (Hleap et al., 2013;Lorenz et al., 2011). As observed in the present study, it can be proposed that protein structures evolve using limited set of patterns which act as "building block" to form more complex proteins and emerge newer modules, if necessary, mostly to satisfy cell's functional requirements. These frequent or prevalent modules can be responsible for providing the required scaffold for wide spectrum of catalytic functions as present study shows. Influence of polarity between scaffold and active site regions on protein functional evolution has been reported in the context of enzyme activity with limited fold sets (Dellus-Gur et al., 2013;Khersonsky & Tawfik, 2010). The balance between scaffold and active sites not only increases the chance of evolution, but also influences organism's fitness by improving functional efficiency (Tomatis et al., 2008). In present study, using non-hierarchical analysis of all-alpha proteins, we have reported prevalent patters (total 82 among 5 types of HS from 3 to 7) which arẽ .14 fraction of observed pattern space (Table 1). Figure 6. Figure shows the percentage of different "general" functions in data-set (D1). (a) Illustrates percentage of 7 "general" functions in observed in data-set D1 and (b) shows the average functional diversity score (with standard deviation as error bar) for prevalent (star) and other (circle) pattern sets in each HS. X-axis in Figure 6(b) shows HS (from 3 to 10). For each HS, average functional diversity score for prevalent and other pattern has been given in y-axis.
These patterns (or modules) may provide necessary structure for diverse functional enzyme activities in proteins they occur and help in designing proteins with desired functions. It extends the support for the hypothesis that evolution shapes the protein structure space in the lights of function and prefers to reuse the existing known folds instead of exploring the astronomical available space, explaining the occurrence of the large gap between combinatorial possibility and observed patterns (Figure 2), and current limitations of unknown space (or "Dark matter") of structure and function universe (Taylor et al., 2009). However, topological preference caused by easy kinetic accessibility to diverse structural space (Cossio et al., 2010;Leopold, Montal, & Onuchic, 1992;Wang et al., 2012) has not been addressed in the present study but in progress using advanced techniques like coarse-graining methods (Takada, 2012).

Author disclosure statement
The authors declare that no conflict of interests exist.