A sequence space search engine for computational protein design to modulate molecular functionality

Abstract De-novo protein design explores the untapped sequence space that is otherwise less discovered during the evolutionary process. This necessitates an efficient sequence space search engine for effective convergence in computational protein design. We propose a greedy simulated annealing-based Monte-Carlo parallel search algorithm for better sequence-structure compatibility probing in protein design. The guidance provided by the evolutionary profile, the greedy approach, and the cooling schedule adopted in the Monte Carlo simulation ensures sufficient exploration and exploitation of the search space leading to faster convergence. On evaluating the proposed algorithm, we find that a dataset of 76 target scaffolds report an average root-mean-square-deviation (RMSD) of 1.07 Å and an average TM-Score of 0.93 with the modeled designed protein sequences. High sequence recapitulation of 48.7% (59.4%) observed in the design sequences for all (hydrophobic) solvent-inaccessible residues again establish the goodness of the proposed algorithm. A high (93.4%) intra-group recapitulation of hydrophobic residues in the solvent-inaccessible region indicates that the proposed protein design algorithm preserves the core residues in the protein and provides alternative residue combinations in the solvent-accessible regions of the target protein. Furthermore, a COFACTOR-based protein functional analysis shows that the design sequences exhibit altered molecular functionality and introduce new molecular functions compared to the target scaffolds. Communicated by Ramaswamy H. Sarma


Introduction
Computational protein design (CPD), also known as the inverse folding problem, deals with the prediction of amino acid sequences that folds to defined three-dimensional structures.Such predictions enhance our fundamental understanding regarding the sequence-structure compatibility of proteins.CPD is one of the most important protein engineering techniques widely used since the last few decades for the construction of biocatalysts (Bommarius & Paye, 2013) and pharmaceuticals (Chevalier et al., 2017;Fischman & Ofran, 2018).Also, CPD has enabled the design of membrane proteins (Slovic et al., 2003;Yin et al., 2007), enzymes (Chen et al., 2009;Murphy et al., 2009;Privett et al., 2012), and protein complexes (Bale et al., 2016) to subsequently make way for protein-based therapeutics.In recent applications, CPD has been used to identify critical interface residues in protein-protein interactions responsible for disease conditions (Banerjee et al., 2018;Banerjee & Mitra, 2020;Rajesh et al., 2019).
De-novo protein design allows us to explore the hugely untapped sequence space that is otherwise less explored in the evolutionary process relying mostly on incremental mutation and selection operations.A sequence-structure compatibility measure (energy function) and a search strategy guiding the residue mutations (while probing the sequence space) are the two principal components of CPD.
Several energy functions like statistical potential-based (Boas & Harbury, 2007;Jones, 1994), physics-based (Bazzoli et al., 2011;Dahiyat & Mayo, 1997;Koehl & Levitt, 1999;Mitra et al., 2013b), machine learning-based (Mitra et al., 2013a), and hybrid (combination of different potentials) (Leaver-Fay et al., 2011;O'Meara et al., 2015) have been explored in the context of CPD.The sequence space can be probed adhering to either deterministic or stochastic search strategies.In Dead End Elimination (DEE) (Desmet et al., 1992) high energy rotamer combinations are eliminated to arrive at a sequence corresponding to the global minimum energy conformation.However, the DEE method doesn't scale up to larger proteins and there is no guarantee of convergence for all target scaffolds.Stochastic search strategies like Genetic Algorithms (Jones, 1994) and Monte Carlo (Mitra et al., 2013b;Pal et al., 2022) simulations are computationally inexpensive and are often used to predict amino acid sequences that optimally fit a target protein structure.
In this work, we propose a greedy simulated annealingbased Monte Carlo search algorithm for efficient convergence in CPD.We consider a completely random amino acid sequence to initiate the Monte Carlo simulation.Subsequently, we adhere to a pre-computed distribution of amino acids observed in existing proteins while generating candidates (sequences) during the sequence-structure compatibility search process.At each step, we create a fixed number of sequences by randomly mutating the candidate sequence found to best fit the target protein structure till the preceding step in the simulation.A fixed threshold on the maximum number of mutations is enforced and an evolutionary profile based on the structural homologs of the target protein is constructed and consulted to mutate random amino acids while generating the candidates.The candidates are generated and their compatibility to the target structure is evaluated with the help of an empirical physics-based force field (Schymkowitz et al., 2005).We exploit the inherent parallelism to evaluate the candidate sequences in parallel and subsequently rely on our greedy algorithm to consider the best candidate in a particular step of the Monte Carlo simulation.We adopt a parametric cooling schedule for simulated annealing with the help of which the Boltzmann temperature for the Metropolis criteria is updated in each step of the simulation.While the considered distribution of amino acids and the evolutionary profile of the target protein guide the sequence search, the greedy approach ensures that the search space is sufficiently exploited during the simulation.The cooling schedule adopted in the proposed algorithm makes sure that the sequence search doesn't get entrapped in the local minima ensuring sufficient exploration of the search space.
We consider a dataset of 76 protein structures to evaluate the goodness and efficacy of the proposed protein design algorithm.Considering the I-TASSER (Yang et al., 2015) predicted structures of the designed sequences, we find that our algorithm reports an average root-mean-square-deviation (RMSD) of 1.07 Å and an average TM-score of 0.93 with the target protein structures.High sequence recapitulation in designed sequences is one of the hallmarks of a good protein design algorithm (Ding & Dokholyan, 2006;Kuhlman & Baker, 2000;Leaver-Fay et al., 2013).While starting simulation from a completely random sequence, the proposed protein design algorithm reports an overall sequence identity of 34% with the native sequence of the target protein structures.The overall sequence identity increases to 48.7% and 59.4% for the overall buried (solvent inaccessible) residues and for buried hydrophobic residues, respectively.Considering hydrophobic and non-hydrophobic residues, we compute intra-group conservation of residues in the designed proteins.The proposed algorithm reports an overall intra-group sequence identity of 68.4% which increases to 77.7% and 93.4% for buried residues and buried hydrophobic residues respectively.A high average Pearson correlation coefficient (PCC) of 0.80 between the decrease in Boltzmann temperature values and the decrease in FoldX (Schymkowitz et al., 2005) energy of the candidate sequences for the 76 target proteins validate the goodness of the cooling schedule considered for simulated annealing.Our 100 ns Molecular Dynamics (MD) simulations on a representative set of design sequences (after modeled by protein folding software I-TASSER (Yang et al., 2015)) suggests well-folded and stable conformations.We also carry out a ClustENMD (Kaynak et al., 2021) based analysis to establish the fitness of the designed sequences to near native conformers of the proteins in the representative dataset.The designed proteins report lower mean FoldX energy for the ensemble of conformations for all the target proteins.A COFACTOR (Roy et al., 2012) based analysis reveals new molecular functions and altered binding sites in the designed sequences corresponding to the target proteins in the representative dataset.
The low root-mean-square-deviation and high sequence conservation both establish the goodness of the proposed protein design algorithm.The designed sequences also make way for modified molecular functionalities and altered binding patterns while energetically fitting well in the target protein structures.The complete code and executable of the proposed algorithm can be accessed at http://cse.iitkgp.ac.in/ $pralay/resources/SADIE/.

The protein design algorithm
In Figure 1, we present the SADIE (Simulated Annealing based greeDy approach to proteIn dEsign) algorithm for denovo protein design.The three-dimensional backbone only structure (hereafter referred to as scaffold) of the target protein acts as an input to the SADIE algorithm wherein we carry out an intelligent decoy sequence-scaffold compatibility search across a fixed number of iterations.We define decoy sequence as an amino acid sequence that is probed for scaffold compatibility and will be denoted as a decoy in further discussions.We consider a non-redundant subset of the PDB (excluding proteins with incomplete backbone coordinates) to identify the distribution of the 20 amino acids in the consituent proteins and now onwards the same will be referred to as the curated dataset.We ensure that the proportion of each of the 20 amino acids adheres to the identified residue distributions (falls within the mean ± one standard deviation range) in each of the decoys (D i ) generated during the sequence search.This allows us to ensure that no amino acid (or group of amino acids) is represented disproportionately and the resulting decoys follow the residue distribution observed in existing protein structures.We generate 20 random numbers from the mean±(one standard deviation) range (amino acid-specific) for the individual amino acids for random reshuffling of the chosen residues to prepare the initial decoy (D 1 ).
Once the initial decoy is generated, we iteratively search the sequence space for Max_steps instances.At each step, we adopt a greedy approach in which Branch_fact number of temporary decoys (TDs) are created by mutating the decoy accepted in the previous step.We mutate the sequences in consultation with an evolutionary profile (Evo_prof) (Banerjee et al., 2021) of the target protein.Keeping the target scaffold in mind, all amino acids might not be equally probable at each residue index and in such a situation the evolutionary profile encodes the amino acid variability at each residue position.Although the probability of the amino acids to be present at each index is encoded according to the amino acid variability at each position, all amino acids are still considered for mutation with a certain probability.The structural homologs represent similar fold level information and hence their residue-specific amino acid distribution is crucial in guiding the sequence space search.The protein structures from the curated dataset with TM-Score (Zhang & Skolnick, 2005) more than 0.7 are considered as structural homologs of a target scaffold.However, the TM-Score is gradually relaxed up to 0.5 in instances where we do not get required number of structural homologs.In case we still do not identify the necessary number of homologous structures, each amino acid is considered to be equally probable for each residue position.The detailed mathematical formulation of the evolutionary profile can be accessed from Banerjee et al. (2021).
We randomly choose residue positions for mutation and allow a maximum percentage for of residues to be substituted at each iteration (Mut_rate).For each TD, we consult the Evo_prof to randomly mutate the previously accepted decoy.Each TD is then accommodated in the target scaffold using FASPR (Huang et al., 2020).FASPR is an open-source tool for fast and accurate side-chain packing of amino acid sequences in target scaffolds.It is important to note that SADIE considers a fixed backbone of the target protein and does not carry out explicit structure prediction of the decoy sequences.The FASPR method optimizes the side chain packing of the decoy sequences in the target protein and does not alter the backbone of the input structure.We use an empirical physics-based force field (FoldX (Schymkowitz et al., 2005)) to compute the DG folding for each of the TDs accommodated in the target scaffold.A lesser value of DG folding indicates a better energetic fit which in turn indicates towards better decoy-scaffold compatibility.The energy e TDj for each TD is computed ase TDj ¼ ðDG where DG target folding correspond to the FoldX computed DG folding of the target scaffold containing the native amino acid sequence in its experimentally reported conformation and DG TD j folding (DG D1 folding ) correspond to the FoldX computed DG folding of the target scaffold accomodating the TD j (D 1 ) respectively.K is a constant and is empirically considered as 50.
In our greedy approach, at each step Branch_fact number of TDs and e TD s are generated and evaluated parallelly.The TD (TD best ) with the highest e TD (e TD best ) is considered and is compared with the energy of the decoy shortlisted in the previous iteration (e D iÀ1 ).If e TD best > e D iÀ1 , the TD best is chosen as the best decoy of the present iteration.Else, the TD best can still be chosen as the prevailing decoy in the present iteration provided the Metropolis criteria (line 12 in Figure 1) for the Boltzmann temperature (temp i ) corresponding to the particular step is satisfied.Otherwise the D iÀ1 is retained as the shortlisted decoy of the present iteration.It is to be noted that even if all the TDs have lesser DG folding than the decoy shortlisted in the previous iteration, the greedy approach retains the best and proceeds one step in the iteration.The greedy approach explores Branch_fact number of TDs generated by different mutation permutations from D iÀ1 , and hence results in better exploration of the sequence search space.We adopt a parametric cooling schedule from Atiqullah (2004) for simulated annealing according to which the Boltzmann temperature for the Metropolis criteria is updated at each iteration.The temperature is updated as: The values of a and f are again considered from Atiqullah (2004) where a ¼ 2 and f ¼ 1/3.The chosen values of a and f result in the reduction of the initial temperature to half when the number of iterations reduces to one-third.The value of temp i and temp Max_steps is empirically chosen to be 0.002 and 0.0002 respectively.The chosen cooling schedule ensures that there is no rapid decrement in temperature (quenching) thereby disallowing entrapment in a local minima conformation.During the middle stages of simulation, the decoys are expected to be in the locality of the optimum, and towards the end, slow temperature decrement is carried out to ensure flat convergence.
On the completion of Max_steps greedy exploration of decoy-scaffold compatibility, the final decoy D Max steps is the output generated from the SADIE algorithm.We estimated optimum performance with Max_steps ¼ 7500 and Branch_fact ¼ 20.However, a user can adjust SADIE as per his/her own specifications.

Dataset
We use the same dataset as was used by Banerjee et al. (2021) to evaluate the performance of the SADIE algorithm.The dataset consists of 76 proteins varying in length from 52 to 197 residues and belonging to diverse SCOP classes (a, b, a þ b, a/b, and small proteins).All the chosen proteins have complete backbone coordinates and a high X-ray crystallographic resolution ( 1.6 Å).The PDB (Berman et al., 2007) files corresponding to these 76 proteins are considered for de-novo protein design.

Molecular dynamics simulation
We consider a representative dataset for detailed Molecular Dynamics (MD) simulation and COFACTOR (Roy et al., 2012) based analysis of the designed sequences.The five proteins chosen for the representative dataset consist of heterogeneous nuclear ribonucleoprotein K domain (hnRNPK, PDB ID: 1ZZK, chain A), thioredoxin domain (PDB ID: 1R26, chain A), cytokine-independent survival kinase phox homology domain (CISK-PX, PDB ID: 1XTE, chain A), light oxygen voltage domain (Lov2, PDB ID: 2V0U, chain A), and Translation Initiation Factor 1 (TIF1, PDB ID: 3I4O, chain A) proteins.
We perform MD simulation using Gromacs 5.1.5(Pronk et al., 2013) with the OPLS-AA (Jorgensen & Tirado-Rives, 1988) force field.Initially, the design predicted structures and the target scaffolds are solvated in a cubic box while making sure that all the constituent atoms are at 1.0 nm from their respective box edges.Next, we use the steepest descent minimization algorithm for 1000 steps to minimize the energy of the solvated system.Further, the energy-reduced systems are equilibrated for 100 ps at 298 K to maintain the desired temperature of 298 K. Subsequently, the systems are equilibrated for 1 ns at 298 K to arrive at the desired pressure of 1 bar.Finally, we simulate for 100 ns at 298 K in the solvated systems.

ClustENMD based sampling of conformation space
ClustENMD (Kaynak et al., 2021) is a hybrid method combining Anisotropic Network Model (ANM) with MD simulations for the efficient sampling of the conformational space of a target protein.For the present implementation we consider the first 5 softest (global) modes with a cutoff distance of 15 Å and a maximum RMSD of 1 Å for ANM computations.We rely on the default amber99_obc implicit solvent model and amber99sbildn force field in ClustENMD for OpenMM (Eastman et al., 2017) energy minimization and subsequent 2fs MD simulations on the ANM conformations.Finally for each protein 90 near native conformations generated using three runs of ClustENMD are considered.

Analyzing goodness of the design sequences
We carry out computational protein design on the target scaffold of the 76 proteins using the SADIE algorithm.Subsequently, we use I-TASSER (Yang et al., 2015) to model the three-dimensional structures of the design sequences (henceforth referred to as design predicted structures, designed structures and designed proteins interchangeably).We perform comparative analysis of the design sequence and the modeled design structure with the target scaffolds.A comparative analysis of the sequences as designed by the SADIE and the sequences as designed by existing techqniues is detailed in Table 1.While assessing the foldability of the design sequences, we find that SADIE reports an average TMscore of 0.93 indicating high fold level similarity between the target scaffolds and modeled design proteins.In Table S1 (column 6), we find that 94.7% (72 out of 76) modeled design proteins report TM-scores greater than 0.8, while 80.1% (61 out of 76) design predicted structures report TMscores greater than 0.9.We also find that SADIE reports an average RMSD of 1.07 Å between the target scaffold and the modeled design proteins.The proposed method performs significantly better than the other two methods and the low RMSD value further indicates the goodness of the design sequences.96.1% (73 out of 76) design predicted structures

Recapitulation of native sequence
Native sequence recapitulation is one of the salient attributes used to benchmark protein design algorithms (Ding & Dokholyan, 2006;Kuhlman & Baker, 2000;Leaver-Fay et al., 2013).The native sequence recapitulation rate is the ratio of the number of residues those are not mutated in a design sequence to the number of residues in the design seqeunce.
A high sequence recapitulation rate reflects the capability of an algorithm to generate native-like protein sequences.It is important to note that the Monte Carlo simulation starts from a completely random sequence for de-novo protein design.However, certain residues in the native amino acid sequences are crucial to the structural integrity of the individual proteins.An efficient sequence space search will be able to retain such residues in de-novo protein design.While analyzing, we separately evaluate the native sequence recapitulation in the buried residues that contribute to the hydrophobic core of the protein.As per the definition by Kuhlman and Baker (2000), the residues that have more than (less than) 20 C b (15 C b ) atoms within their 10 Å sphere radius are considered as buried (exposed) residues.Here, we consider the C a atoms for the Glycine residues.
Several design studies by Kuhlman and Baker (2000), Saunders & Baker (2005), Ding and Dokholyan (2006), Bazzoli et al. (2011), and Mitra et al. (2013a, 2013b;Shultis et al., 2019) report overall sequence recapitulation ranging from 24% to 33% on different datasets.SADIE reports an overall native sequence recapitulation of 34% while comparing the designed sequences with the corresponding native residues in the target proteins.On further analysis (Table S2), we find that 48.7% of the buried residues are conserved in the design sequences.A much higher native sequence recapitulation of 59.4% is observed in the hydrophobic core residues in the design sequences corresponding to the target proteins.In Figure 2 (and in Table S3), we illustrate the native sequence recapitulation of individual amino acids in the entire protein, buried residues and in the exposed residues corresponding to the design sequences.Considering buried hydrophobic residues, we find significant (greater than 50%) conservation in the Proline (90.5%),Glycine (77%), Alanine (66.7%),Leucine (64.2%),Valine (62.4%) and Isoleucine (52.9%) residues.The Phenyalanine (44.8%),Tryptophan (37.2%) and Methionine (30.5%) residues are comparatively less conserved among the hydrophobic residues in the designed sequences.Remarkably, the Proline (65.6%) and Glycine (75%) residues are even well conserved in the exposed regions of the target proteins.

Intra-group residue conservation
In addition, we report the intra-group residue conservation in the designed sequences (Table S2).Intra-group residue conservation measures the proportion of hydrophobic (nonhydrophobic) residues in native amino acid sequences that are replaced by hydrophobic (non-hydrophobic) residues in the corresponding designed sequences.We find high intragroup residue conservation of 68.4% in the entire protein for the design sequences.Considering the intra-group conservation of buried residues, we find that 93.4% of hydrophobic residues are replaced by the same or other hydrophobic residues in the design sequences.This indicates that SADIE successfully conserves the hydrophobic core of the target proteins that in turn contributes to good quality folded conformations corresponding to the design sequences.Although Banerjee et al. (2021) report higher (41%) native sequence recapitulation than SADIE, the predicted structures corresponding to the design sequences report lower TM-Score and RMSD values with the target structures.Our further analysis indicates that despite higher native sequence recapitulation, the design sequences in Banerjee et al. have lower recapitulation (56.7%) of hydrophobic buried residues.Similarly, Banerjee et al. report lower intra-group conservation (91.4%) of hydrophobic buried residues in the designed sequences.Although SADIE has lower overall native sequence recapitulation, the higher recapitulation of hydrophobic residues in the buried region ensures the conservation of the hydrophobic core of the target protein.SADIE simultaneously ensures the non-conservation of large number of residues in the exposed regions in the designed sequences.The resulting modeled structures have different surface characterization than the target proteins resulting in the possibility of modified binding interactions and novel functionalities.

Efficacy of the cooling schedule
In Table S1, we list (column 8) the PCC between the drops in FoldX energy in the subsequent decoys with the reduction in temperature as computed using simulated annealing for 7500 steps for the 76 designed sequences.A high PCC of 0.80 indicates that the intended effect of the cooling schedule is achieved, where the reduction in temperature ensures that the decoys reach the locality of the optimum during the middle stages and the FoldX energy demonstrate flat convergence towards the end of the simulation.The proposed greedy approach along with the cooling schedule ensures that we arrive at good quality designed sequences in only 7500 iterations in a single simulation trajectory.The convergence of SADIE in a maximum of 7500 iterations is a significant improvement from the maximum 3,00,000 simulation steps (spread across 10 simulation trajectories) used in Mitra et al. (2013a) and Banerjee et al. (2021).This establishes the efficiency of the proposed sequence space search engine algorithm.
We predict the secondary structures of the design sequences using PSSPRED (Yan et al., 2013) and compare them with the native secondary structure of the target

Molecular dynamics based validation of the representative dataset
We conduct an in-depth structure-based analysis on a representative dataset of the design proteins.In Table 2, we find that the design structures report the lowest RMSD of 1.35 Å with the respective target scaffolds listed in the representative dataset.The design structures aligned with the target scaffolds are also illustrated in Figure 3(a)-(e).We observe a marginal rise in the TM-Score (0.91) suggesting better fold level similarity with the target scaffolds.The PSSPRED predicted secondary structures corresponding to the designed sequences also exhibit higher similarity (87%) with the native secondary structures of the target scaffolds.
The RMSD, TM-Score, and secondary structure similarity values of the individual proteins in the representative dataset are provided in Table S4.Additionally, we go through the structure prediction of the designed sequences corresponding to the representative dataset using the state of art RoseTTAFold software (Baek et al., 2021).The predicted models and their RMSD with the target proteins are provided in Figure S1.The design predicted structures report an impressive average RMSD of 0.77 Å with the correponding target proteins.We perform MD simulations using Gromacs 5.1.5(Pronk et al., 2013) to assess the conformational stability of the design structures.In Figure 3(f)-(j), we present the RMSD plots of the modeled design structures along with the target scaffolds for the representative proteins.We find that in two (PDB ID: 1R26, chain A, PDB ID: 2V0U, chain A) out of the five instances the modeled design structures report lower RMSD than the target scaffold for almost the entire duration of the 100 ns MD simulation.In the remaining three instances, no significant trend in unfolding of the modeled design structures is observed in the RMSD plots.The RMSD plots of the five modeled design structures suggest stable folded conformations indicating that the designed sequences fit well in the target scaffolds of the proteins in the representative dataset.The average RMSD values of the target scaffold and modeled design structures across the 100 ns of MD simulation are provided in Table S4.The radius of gyration (Rg) measure is computed as the mass-weighted root mean squared distances of all atoms from the center of mass of a given protein.The Rg measures the compactness of proteins and an unfolded conformation of a globular protein will have a much higher Rg value in comparison to its properly folded conformation.Figure S2 shows that the Rg value of the target scaffolds and the design structures for 100 frames across 100 ns of MD simulation fluctuates within similar boundaries for the five proteins.Interestingly, we observe that the averaged Rg value for each of the five design structures (Table S4, column 9) is lesser than the averaged Rg value of the corresponding target scaffolds (Table S4, column 8).Such an observation again indicates no trends of unfolding as the design predicted structures stay more compact than their corresponding target scaffolds across the duration of the MD simulation.
Finally, we perform a root-mean-square-fluctuation (RMSF) analysis of individual residues for the target scaffolds and design structures across 100 ns of MD simulation.The RMSF plots for the five modeled design structures are provided in Figure 3(k)-(f).Additionally, we compute the PCC (Table S4, column 10) between the fluctuation of individual residues for the target scaffolds and design structures.A high average PCC of 0.71 suggests that the individual residues in the design predicted structures fluctuate similarly to the corresponding residues in the target scaffolds.The RMSD, Rg, and RMSF based analysis of the design predicted structures across 100 ns of MD simulation indicate stable conformations and establish the goodness of the designed sequences for the representative dataset.

ClustENMD based fitness analysis of the designed sequences
We execute ClustENMD simulations to sample the near native conformations of the target proteins in the representative dataset.Analyzing the designed proteins using the same ClustENMD simulations helps us in evaluating the fitness of the design sequences in the context of the ensemble of conformers similar to the target protein.Corresponding to each target protein and designed protein pair in the representative dataset, 90 near native conformers are constructed using ClustENMD.The mean ± standard deviation (SD) and the minimum FoldX energy values for the target protein and the corresponding designed protein in the representative dataset are provided in Table S5.The fact that the designed proteins report lower mean FoldX energies for all the five instances indicate that the designed sequences are better compatible with the near native conformations of the target protein.In four out of five instances the design proteins report lower minimum FoldX energy indicating that there is a conformation corresponding to the designed sequence that is energetically fitter than any conformation sampled by the target protein.Overall, this suggests that SADIE designs sequences that are compatible with the corresponding target scaffolds as well as their near native energetically better conformers.

Modulating molecular functionality using protein design
Protein design deals with the identification of new amino acid sequences that will energetically fit well into a target scaffold.However, the variance in the distribution of amino acids in the context of the target scaffold, particularly in the exposed regions can potentially alter the molecular functionality and interacting landscape of a protein.We perform a COFACTOR (Roy et al., 2012) based analysis of the design structures in the representative dataset to analyze the change in molecular functionality and interacting residues introduced in the designed sequences.
Table S6 describes the COFACTOR predicted (with confidence score greater than 0.5) Gene Ontology (GO) molecular function for the target scaffold (containing the native amino acid sequence) and the design structure (containing the designed sequence) for the proteins in the representative dataset.The green colored entries in Table S6 correspond to the GO molecular function which is found in both the target scaffold and the modeled design structure for a corresponding protein.We find a higher confidence score for molecular functions like catalytic activity and protein disulfide oxidoreductase activity in the design structure for the thioredoxin domain (PDB ID: 1R26, chain A).However, there are certain functionalities like heterocyclic compound binding and protein disulfide isomerase activity that are particularly predicted for the target scaffold and modeled design structure respectively.In CISK-PX (PDB ID: 1XTE, chain A), we identify only the phosphatidylinositol binding function associated with the target scaffold, while the same functionality is predicted with a higher confidence score for its designed counterpart.
Several different molecular functions like phospholipid binding, catalytic activity, etc. are listed for the modeled design structure of the CISK-PX protein.Considering hnRNPK (PDB ID: 1ZZK, chain A), we find that the modeled design structure retains four out of the six molecular functions of the target scaffold with similar prediction confidence scores.However, molecular functions like heterocyclic compound binding and transcription factor activity, sequence-specific DNA binding are particularly predicted for the target scaffold and modeled design structure of hnRNPK respectively.Similarly, in LOV2 (PDB ID: 2V0U, chain A) there are three out of six molecular functions in the target scaffold which are reported in the modeled design structure with a higher confidence score.Again, molecular functions like transferase activity and ATP binding are found to be particularly exclusive to the target scaffold and modeled design structure respectively.Finally, in TIF1 (PDB ID: 3I4O, chain A), we find that the modeled design structure retains only three out of the eight molecular functions of the target scaffold.Certain functionalities like heterocyclic compound binding, ribonucleoprotein complex binding, etc. predicted in the target scaffold are absent in the modeled design structure of the TIF1 protein.The COFACTOR-based analysis indicates that the designed sequences predicted by SADIE modulate existing functionality and also introduce a new molecular function to the target scaffolds of the proteins in the representative dataset.
In addition, our binding site analysis as predicted by COFACTOR is listed in Table S7 for only two representative proteins.The maximum pairwise confidence score of prediction for the binding sites of the remaining three proteins is below 0.35 and hence has not been considered for further analysis.We find that 73.3% (11 out of 15) binding sites have been preserved in the design predicted structure of the hnRNPK protein.Interestingly, in the LOV2 protein, while 65.4% (17 out of 26) binding site residues are preserved, two new binding site residues are predicted in the modeled design structure.Although the target scaffold and the modeled design structure have an excellent RMSD of 0.82 Å, SADIE introduces variations in the binding landscape of the LOV2 protein.

Conclusion
For the last two decades, computational protein design has been an area of active research.With immense potential in areas of protein engineering and molecular therapeutics, the prediction of novel amino acid sequences that will fold to a target three-dimensional structure is an important and challenging problem.In spite of good amount of research, there is still enough potential for the formulation of novel energy functions and efficient search algorithms.In the present work, we propose greedy simulated annealing-based Monte Carlo search algorithm for efficient convergence in protein design.While an evolutionary profile guides the investigation, the greedy approach and the cooling schedule implemented during the Monte Carlo simulation ensures sufficient exploration and exploitation of the sequence-structure compatibility search space.The inherent parallelism in the proposed algorithm confirms the exploration of multiple sequences in a single step, which in turn leads to an efficient implementation of the greedy search strategy.
The SADIE algorithm when evaluated on a dataset of 76 proteins reports an RMSD of 1.07 Å and a TM-Score of 0.93 with the corresponding target proteins.Starting from random amino acid sequences, the proposed algorithm successfully generates alternative residue sequences that are structurally similar to the three-dimensional scaffolds of the target proteins.The high sequence recapitulation observed in the design sequences, more so in the buried and hydrophobic buried residues again establishes the goodness of the proposed algorithm.A high (93.4%)intra-group recapitulation of hydrophobic residues in the buried region indicates that SADIE preserves the core residues in the protein and provides alternative residue combinations in the exposed regions of the target protein.While this approach maintains the structural specifications in the designed sequences, the variability of residues in the exposed regions possibly gives rise to new molecular functionalities and alternative binding preferences.A high PCC (0.80) between the reduction in Boltzmann temperature and the corresponding reduction in FoldX energy in the decoy sequences achieves the intended effect in the cooling schedule.The proposed simulated annealing approach ensures that the decoys reach to the minimum locality during the middle stages and demonstrate flat convergence during the end of the simulation.Rigorous MD simulation-based validation of the design sequences for a representative dataset further establishes the efficacy of the proposed algorithm.A COFACTOR-based analysis shows that the designed sequences corresponding to the target structures in the representative dataset exhibit altered molecular functionality.The same idea can be extended to the selective design of interface residues to modulate affinities in protein-protein interactions.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
The author(s) reported there is no funding associated with the work featured in this article.

Figure 1 .
Figure 1.The SADIE Algorithm for computational de-novo protein design.

Figure 2 .
Figure 2. Native sequence recapitulation of the 20 amino acids in the entire protein, buried residues and exposed residues in the 76 designed sequences.

Figure 3 .
Figure 3.Further analysis with the proteins bearing PDB IDs: 1R26 (chain A), 1XTE (chain A), 1ZZK (chain A), 2V0U (chain A), and 3I4O (chain A), respectively.(a-e) The design structures (in lime green) aligned with the target scaffolds (in magenta); (f-j) RMSD of the predicted structures (in lime green) and of the target scaffolds (in magenta) for 100 ns of MD simulation; (k-o) RMSF of the individual residues in the design predicted structures (in lime green) and in the target scaffolds (in magenta) for 100 ns of MD simulation.

Table 1 .
Comparison of SADIE with existing methods based on goodness of design measures.

Table 2 .
Comparison of SADIE with existing methods on the representative dataset.
scaffolds.SADIE reports an average secondary structure similarity of 82.5% for the 76 target scaffolds despite a 34% sequence identity with the native sequences