Molecular evolution and structural analyses of the spike glycoprotein from Brazilian SARS-CoV-2 genomes: the impact of selected mutations

Abstract The COVID-19 pandemic caused by SARS-CoV-2 has reached by February 2022 more than 380 million cases and 5.5 million deaths worldwide since its beginning in late 2019, leading to enhanced concern in the scientific community and the general population. One of the most important pieces of this host-pathogen interaction is the spike protein, which binds to the hACE2 cell receptor, mediates the membrane fusion and is the major target of neutralizing antibodies against SARS-CoV-2. The multiple amino acid substitutions observed in this region, specially in RBD have enhanced the hACE2 binding affinity and led to several modifications in the mechanisms of SARS-CoV-2 pathogenesis, improving the viral fitness and/or promoting immune evasion, with potential impact in the vaccine development. In this work, we identified 48 sites under selective pressures, 17 of them with the strongest evidence by the HyPhy tests, including VOC related mutation sites 138, 142, 222, 262, 484, 681, and 845, among others. The coevolutionary analysis identified 28 sites found not to be conditionally independent, such as E484K-N501Y. The molecular dynamics and free energy estimates showed the structural stabilizing effect and the higher impact of E484K for enhanced binding affinity between the spike RBD and hACE2 in P.1 and P.2 lineages (specially with L452V). Structural changes were also identified in the hACE molecule when interacting with B.1.1.7 RDB. Despite some destabilizing substitutions, a stabilizing effect was identified for the majority of the positively selected mutations. Communicated by Ramaswamy H. Sarma


Introduction
The Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the etiological agent of the COVID-19 pandemic. The virus is composed of an enveloped positive-sense singlestranded RNA genome, which encodes 16 non-structural proteins (nsp1-16), four structural proteins (Spike, Membrane, Envelope, and Nucleocapsid), and other accessory proteins (ORFs 3a,6,7a,7 b,8,10) (Fehr and Perlman 2015;Liu et al., 2014). The spike (S) glycoprotein is necessary for the viral binding to ACE2 host cell receptor, it is common to multiple coronaviruses (e.g. MERS-CoV and SARS-CoV) (Fehr and Perlman 2015), and is the major target of neutralizing antibodies against SARS-CoV-2 Yuan et al., 2020). Spike has a homotrimeric form where each monomer (protomer) is composed by a S1 subunit that mediates the receptor binding and an alpha-helix rich S2 subunit that promotes the subsequent membrane fusion .
Since the emergence of the first SARS-CoV-2 in the early Wuhan epidemic, several mutations have been identified, both occurring isolated or as a signature of a broader mutational complex. Although S protein represents only 13% of the viral genome, substitution at this portion, specially in Receptor Binding Domain (RBD), has been overrepresented suggesting immunodominance of this protein (Greaney et al. 2021;Weisblum et al., 2020). These mutations can enhance receptor binding, either directly or allosterically, alter viral fitness or promote immune evasion, ensuing occasional reinfection and possibly impacting the efficacy of developed vaccines (Greaney et al. 2021;Weisblum et al., 2020;Harvey et al., 2021;Korber et al. 2020).
The S protein mutation D614G (aspartic acid to glycine substitution at amino acid position 614) has been progressively dominant worldwide since March 2020, playing an essential role in infectivity and augmented viral load (Korber et al. 2020;Groves et al., 2021;Jackson et al., 2021;Yurkovetskiy et al., 2020). This substitution may have emerged independently and promptly became dominant in several viral strains, outcompeting D614 harboring viruses, even where they originally appeared. D614G abolishes a hydrogen bond between this position and a threonine present in S2 of the neighbour protomer. The consequent conformational changes acts allosterically favouring the maintenance of RBD in an activated "up" or "open" position, which exhibits its ACE2 binding site (Receptor Binding Motif) for longer periods. This allows a more efficient binding to ACE2 (Groves et al., 2021), with consequent higher replication and viral loads. However, G614 mutants are similarly (or even more) susceptible to immune neutralization as the original D614 variant (Plante et al., 2021;Weissman et al., 2021). Notably, this enhanced epitope exposure of "up" RBD may be a "weakness", since RBD is poor in predicted O and Nlinked glycan sites required for immune shielding. In theory, this undesirable effect could be compensated by accumulation of other RBD mutations, leading to increased infectiousness, and rendering D614G unnecessary. Specifically, the "HMN 19B variant", recently described in France, is a direct descendent from the earlier 19B that have predominated before D614G harboring lineages took over (Fourati et al., 2021).
The codon 501 in RBD has been considered another major mutational hotspot. Substitutions at this position have been associated with increased binding affinity to ACE2 (Gu et al., 2020;Starr et al., 2020). The non-synonymous mutation from an asparagine to a tyrosine at position 501 (N501Y) has first appeared in samples from Wales and other parts of the world. However, its actual emergence and dissemination have been related to coevolution with different mutations. The deletion of H69-V70 in the N-terminal-domain, co-occurring in B.1.1.7 lineage, is particularly important for the N501Y emergence (Meng et al. 2021). N501Y has increasingly been demonstrated in coevolution with many different mutational signatures, for instance, in the Variants of Concern (VOCs) B.1.351 and P.1, as well as in the more recent HMN 19B aforementioned. Data suggests that some N501Y harboring viruses could be 50% more transmissible and up to 61% more lethal, mainly due to major changes in the electrostatic interaction between Spike and ACE2 receptors (Washington et al., 2021;Davies et al., 2021). This would be related to the substitution of a relatively small asparagine (N) for a larger aromatic tyrosine (Y), allowing for an extra contact site with ACE2 (Gu et al., 2020;Ali et al., 2021;Tang et al., 2021). Interesantly, N501Y increases host diversity allowing direct infection of non-humanized mice (Rathnasinghe et al., 2021).
The E484K mutation in RBD has been drowning progressive attention. It is an amino acid replacement of glutamic acid to lysine at amino acid position 484, which could significantly alter the complementarity of antibodies to the RBD region leading to immune evasion (Greaney et al. 2021;Weisblum et al., 2020;Baum et al., 2020). In fact, both in a Deep Mutational Scanning (DMS) experiment and in a selective pressure study using 19 types of mAbs, different substitutions at this position (E484K/Q/P/A/D/G) were consistently demonstrated as being the most critical for decreasing antibody neutralization titers (Greaney et al. 2021;Harvey et al., 2021). Since this mutation arose independently in multiple lineages (Ferrareze et al., 2021), at the beginning it may represent a common evolutionary solution for viral maintenance. E484K emerged in a few selected lineages where it co-exists with other signature substitutions (e.g. in B.1.351 and in P1), similarly to the phenomenon previously described for N501Y. Importantly, the combination of E484K and N501Y seems to induce more conformational changes than the N501Y mutant alone, potentially altering antibodies binding to this region and resulting in the immune evasion phenomena (Nelson et al., 2021).
The emergence of three independently evolving lineages (B.1.1.7, B.1.351, and P.1) Rambaut, 2020;Tegally et al., 2021), all characterized by a constellation of mutations (including the aforementioned del 69-70, K417N/T, E484K, and N501Y convergent mutations) in RBD, has prompted concerns about the evolutionary forces leading to the SARS-CoV-2 adaptation. The presence of these sets of consistently present substitutions are highly suggestive of coevolutionary mutational processes (Martin et al., 2021). In addition to these convergent mutations, other substitutions of these VOCs are also associated with positive selection. For example, there is evidence that eight out of the 10 lineagedefining mutations in the spike protein of the P.1 lineage are under diversifying positive selection . It is possible to detect selection along prespecified lineages that affect certain subsets of codons in a protein-coding gene. Sites 484 and 501 (located in the RBD) definitively show a pattern of nucleotide diversification that can be linked with positive selection . On the other hand, alternative evolutionary processes that could have led to this plethora of substitutions have been rarely described thus far. Recombination, which is a recurrent and well documented process of molecular evolution of coronaviruses, has been described between B.1.1.7 and other lineages , but their real role in SARS-CoV-2 evolution remains elusive.
The present work aimed to evaluate the effect of positively selected mutations maintained in the Brazilian SARS-CoV-2 lineages and to check for mutational coevolution evidence. Additionally, we evaluated the impact of some mutations identified in some VOC and VOI lineages (C.37, B.1.1.7, P.1 and P.2) of Brazilian samples on the structural stability of the spike protein, as well as their possible association with more aggressive infection profiles by binding affinity estimation in the RBD -hACE2 complex.

Spike phylogenetic analysis
The multiple sequence alignment with NC_045512.2 as reference was performed by the MAFFT v.7 web server (Katoh et al. 2019) with default parameters and 1PAM/j ¼ 2 0 scoring matrix for closely related DNA sequences. For the spike sequence analysis, the region between the positions 21,562 and 25,384 (spike regions in the NC_045512.2 reference genome) were selected from the multiple sequence alignment previously performed. The deletion of sequence duplicates (identical spike sequences) kept 2,901 unique spike sequences. The clade assignment and variant calling was performed by Nextclade (https://clades.nextstrain.org/). The phylogenetic analysis was started by the inference of the best evolutionary model by ModelTest-NG (Darriba et al., 2020), which identified GTR þ R4 in all selection strategies. The phylogenetic tree reconstruction was performed by the Maximum Likelihood method in the IQ-TREE program (Nguyen et al., 2015), using 1,000 replicates of ultrafast bootstrap (Hoang et al., 2018) and a Shimodaira-Hasegawa-like approximate likelihood ratio test (SH-aLRT) with 1,000 replicates (Guindon et al., 2010), 2,000 iterations and the optimization of the UFBoot trees by NNI on bootstrap alignment. The tree visualization was performed by FigTree software (http://tree.bio.ed. ac.uk/software/figtree/).

Spike selection and coevolutionary analyses
The multiple sequence alignment of the 2,901 unique spike sequences and the phylogenetic tree previously built were used as input in the HyPhy program (https://www.hyphy.org/). To perform different selection tests, the methods: (i) Fast Unconstrained Bayesian AppRoximation (FUBAR) (Murrell et al., 2013), (ii) Fixed Effects Likelihood (FEL) (Kosakovsky Pond & Frost, 2005), and (iii) Single-Likelihood Ancestor Counting (SLAC) (Kosakovsky Pond & Frost, 2005) were evaluated. The analysis of coevolution across sites in the spike sequences was performed by BGM (Bayesian Graphical Model) (Poon et al., 2007), which detects coevolutionary interactions between amino acid positions in a protein using a Markov Chain Monte Carlo (MCMC) method.

Spike structural stability
The analysis of the spike structural stability by changes in free energies upon mutation (DDG kcal/mol) was performed using PDB ID 6XR8 (chain A) as the structure of the fulllength prefusion conformation of the SARS-CoV-2 spike protein (Cai et al., 2020) using five methodologies: DynaMut (Rodrigues et al., 2018), FoldX Suite v5.0 (Schymkowitz et al., 2005), iMutant3.0 (Capriotti et al., 2005), MAESTRO (Laimer et al., 2016), and PremPS  DynaMut. It performed the analyses of vibrational entropy and total energy using PDB ID 6XR8 (chain A). DynaMut implements a Normal Mode Analysis (NMA) through Bio3D (Grant et al., 2006) and ENCoM (Frappier & Najmanovich, 2014) approaches, providing rapid and simplified access to insightful analyses about protein motions. Moreover, DynaMut also enables rapid analysis of the impact of mutations on dynamics and stability of proteins resulting from vibrational entropy changes.
FoldX. For the energy estimation, the analysis and correction of structural problems resulting from crystallography in PDB ID 6XR8 were performed using the RepairPDB command and additional parameters -water ¼ crystal, -pH ¼ 7.4 and -temperature ¼ 309.65. Posteriorly, the mutagenesis process was applied for all the positively selected sites with the BuildModel function (additional parameters: -water ¼ crystal, -pH ¼ 7.4, -temperature ¼ 309.65 and -numberOfRuns ¼ 5).
Mutant3. It was selected to estimate the free energy changes by a support vector machine (SVM)-based tool. This system predicted the sign of the protein stability change upon mutation and as a regression estimator predicted the related DDG values in the physiological pH (7.4) and temperature (36.5 C).
Maestro:. The Multi AgEnt STability pRedictiOn software was used to estimate the changes in unfolding free energy upon point mutation through a machine learning system.
PremPS. The estimation of the unfolding Gibbs free energies was performed by a random forest regression scoring function using the ProTherm database for parameterization. Negative values indicate stabilization by the decrease of the free energies.

Spike RBD comparative homology modelling
To perform an accurate estimation of the binding free energy associated with mutations in the spike protein belonging to different lineages, spike RBD -ACE2 protein complexes were modelled for the reference SARS-CoV-2 spike (YP_009724390.1) and lineages C.37, B.1.1.7, P.1, P.2, and P.2 þ 452. The PDB file 6M0J was selected as the best template (2.45 Ð and 194 amino acids) to the modeling using the MODELLER pipeline (Webb & Sali, 2016). Five models were generated for each sequence. The thirty resulting structures were evaluated in relation to stereochemical parameters and structural quality by the programs PROCHECK (Laskowski et al., 1993) and VERIFY3D (Eisenberg et al., 1997), available on SAVES v6.0 web server (https:// saves.mbi.ucla.edu/). The analysis of the Ramachandran plot statistics, G-factors and residue properties (PROCHECK), as well as the evaluation of the 3D-1D score using VERIFY3D, allowed the selection of the best modelled RBD structures to be subsequently used for the energy calculations. The generation of the spike RBD -ACE2 complexes for each different lineage was performed with the PyMOL software by the structural alignment of the RBD models with the 6M0J crystal template, which was the source of the ACE2 structural coordinates.
Molecular dynamics and binding free energy estimation.
The structures of the fragments comprising residues 333 until 526 of the wild-type (reference) spike protein, as well as of the variants P.1 (K417T, E484K, N501Y), P.2 (E484K), P.2 þ 452 (E484K, L452V), C.37 (L452Q, F490S), and B.1.1.7 (N501Y) complexed with the human ACE2 protein (residues 19 until 615) in the pdb format were used as input for classical molecular dynamics simulations using GROMACS (Abraham et al., 2015) with the AMBER03 force field (Duan et al., 2003). The AMBER family is one of the most widely used and validated force fields for classical simulations of biological molecules, with many optimizations and refinements leading to a satisfactory level of accuracy (Georgoulia & Glykos, 2019). Among the AMBER family the AMBER03, a third-generation point-charge all-atom force field, is one of the most used within the GROMACS suite of programs. In the preparation of the systems for the molecular dynamics simulations, the GROMACS default protonation state for pH 7 was used.
The spike-ACE2 complexes were simulated in cubic boxes with periodic boundary conditions, solvated with TIP3P water molecules (Jorgensen et al., 1983) and with sodium and chloride ions corresponding to physiological concentration. The van der Waals interactions were calculated using a cutoff radius of 1.2 nm and the electrostatic interactions were calculated with the Particle Mesh Ewald (PME) method (Darden et al., 1993). All systems were initially energetically minimized using conjugate gradients and steepest descent algorithms. After minimization, they were submitted to a 500 ps position-restrained simulation where the coordinates of both proteins were restrained, allowing the solvent and ions to relax without disturbing the geometry of the complex. After the position restrained simulation, a thermalization phase consisting of a sequence of three unrestricted molecular dynamics simulations with 5 ns each in temperatures of 200 K, 240 K, and 280 K was carried out. The production phase consisted of 200 ns long simulation runs, in the NPT ensemble, using a Nos e-Hoover thermostat (Hoover, 1985;Nos e, 1984) and a Parrinello-Rahman barostat (Parrinello & Rahman, 1981).
After simulations, the trajectories were visually analyzed using VMD (Humphrey et al., 1996) and GROMACS tools to quantify the structural and thermodynamic stability of the complexes, the number of hydrogen bonds and the interface area. The interface area was computed taking SASA (Solvent Accessible Surface Area) (Eisenhaber et al., 1995) of ACE2 and spike minus SASA of the complex and dividing the result by two (because the contact was counted twice).
Spike RBD-ACE2 binding free energy calculation. The spike-ACE2 binding free energy was calculated using Molecular Mechanics Poisson-Boltzmann Surface Area (MM/ PBSA) binding free energy calculations (Baker et al., 2001;Homeyer & Gohlke, 2012), employing sets of 200 configurations (1 ns spaced snapshots obtained from the molecular dynamics trajectories) for each system. The calculations were carried out using the program g_mmpbsa (Kumari et al., 2014), which is compatible with GROMACS, using as settings a gridspace of 0.5 A, salt concentration of 0.150 M, solute dielectric constant of 2, and estimating the nonpolar solvation energy using the solvent accessible surface area (SASA). The contributions of the residues to the binding free energy were calculated with the same program. Although this method does not yield a quantitative estimate of absolute binding free energies, nevertheless it is known to be reliable in the estimation of relative binding free energies comparing similar systems (Rifai et al., 2019), as in the present case.

Results
Spike protein phylogenetic analyses. The spike phylogenetic analysis of 2,901 unique sequences showed the formation of a large monophyletic group containing two main subclades: one formed by P.1 and P.1.2 sequences and another formed by the B.1.1.7. B.1.1.28 and P.2 sequences were located at the basis of the P.1 subclade, while some B.1.1.33/N.9 sequences are basal to the B.1.1.7 subclade. Interesting to note that B.1.1.7 formed one monophyletic group ( Figure 1A). Based on spike sequences, the P.2, B.1.1.28 and B.1.1.33 lineages did not form individual clades. Finally, it was possible to observe the formation of a large basal clade with B.1.1.28 and B.1.1.33 sequences, with the B lineage (among others) as its ancestor. Despite P.2 being considered as derivative from B.1.1.28 by the inclusion of the E484K mutation (D614G þ E484K þ V1176F), the presence of this substitution along the tree (as well D614G and V1176F) indicated that the formation of all the small clades is probably due to the high variable combinations of mutations found in these lineages.
The genetic analysis of these 2,901 unique Brazilian spike sequences showed that 575 amino acid sites presented missense substitutions in different lineages, with a mutation rate of 7.79 amino acid substitutions per spike sequence/genome, since January 2020. With a range between 0 and 16 amino acid substitutions and a different average count in each lineage ( Figure 1B), the spike sequence set was mostly represented by lineages such as P.1 (45.71%), P.2 (14.34%), B.1.1.28 (13.82%), B.1.1.33 (10.62%), and B.1.1.7 (4.03%), from 53 identified lineages. The evaluation of the mutation rate before and after the P.1 first occurrence (October 2020) indicated that the spike sequences showed an estimated amino acid substitution of 2.12 events per genome between January and September, 2020. However, from October 2020 up to June 06 2021, this rate was increased to 8.99 amino acid mutations for each spike sequence.

Spike selection analyses. The selection analysis performed
with HyPhy for site-to-site tests evaluated the presence of diversifying and purifying selection in the spike protein using a set of 2901 sequences, which represents the genetic variability for 11,054 genomes (24 sequences were excluded due to the presence of >5% of ambiguous sites or truncating insertions in the final alignment). The FUBAR method identified 20 sites under adaptive pressure along the phylogeny (Table 1 and Supplementary File 2). Among these, the mutation in site 484 is represented by three missense substitutions: (i) E484K, which was found in 71.36% of all genomes, belonging to 11 different lineages, (ii) E484Q, present in 0.22% of the sequences, including the B.1, B.1.1.28 and P.5 lineages, and (iii) a new substitution found in only one genome, E484D. Other sites related to known mutations from VOC lineages such as P.1, B. The FEL method was mainly tested to detect negatively selected sites, since more powerful methods such as FUBAR do not evaluate purifying selection. With the initial assumption that the selective pressure for each site is constant along the entire phylogeny, FEL indicated sites that are evolving under positive and negative selective pressures in the spike protein sequence. As result, 239 sites were marked as targets of purifying selection (Supplementary File 3) along the phylogeny, while other 44 sites were suggested to be under adaptive selection. Interestingly, all positively selected sites indicated by the FUBAR method were found by FEL (Table 1 and Figure 2A), including those with low frequency in the analysed genomes. Some of them were indeed related to the VOC lineages. Among the sites identified only by the FEL test are the residues 29, 49, 63, 76, 78, 367, 483, 553, 585, 689, 747, 1027, 1084, 1124, 1133, and 1167.
Finally, using a modified version of the Suzuki-Gojobori counting algorithm, the SLAC method assumes that the selective pressure for each site is constant. Despite being the weakest method of this group (Kosakovsky Pond & Frost, 2005), the analysis of pervasive site-selection with SLAC indicated the presence of 130 sites under negative selection, of which 112 were previously found by FEL (Supplementary File 3 and Figure 2B). In relation to the 29 positively selected sites, 17 of them were previously identified by the FUBAR and FEL methods (Table 1), with eight also found by FEL (12, 26, 98, 570, 653, 684, 846, and 1078) and four only identified by SLAC (27, 701, 769, and 1228). Considering the consensus sites identified by the three selection methods, the residues 5, 21,67,75,138,142,222,262,484,572,614,681,688,845,1176, 1219, and 1264 presented the strongest evidences for positive selection pressure on the spike protein (Table 1 and   Figure 2A). Following are those recognized by two methods such as sites 12, 26, 98, 417, 501, 570, 653, 684, 1078, and 1260. These missense substitutions were identified in up to 15 different lineages (except for mutations in site 614). This is the case of V1176F, which arose as a B.1.1.28 lineage-defining mutation and is now spread along the phylogenetic tree. The arithmetical average calculation (without site 614) showed an occurrence of 3.22 lineages per each amino acid substitution type, while N501Y was found in 10 lineages, N501T was described in only four. Despite the occurrence of different substitutions in the same site for different lineages, generating an average of 2.19 amino acid mutation possibilities for each mutated site, some lineages were predominantly found. From the 37 identified lineages (excluding those from site 614), the variability of P.1 genomes comprised 45 of the 48 sites under adaptive selection considering all predicted sites by FUBAR, FEL, and SLAC, in some cases, covering multiple substitutions on the same site (e.g.: A67S/V/P, P681H/L/R). For the sites predicted by the three methods  Spike structural analyses. To analyze the impact of the spike sites under adaptive selection on the protein stability and host-pathogen interaction, two main categories of structural analyses were applied related to: (i) the folding/unfolding free energies and the vibrational entropy: a reference spike structure PDB ID 6XR8 (relative to the NC_045512.2 genome) was used to estimate the energy changes caused by these positively selected single mutations in the prefusion   conformation of the spike protein and (ii) protein molecular dynamics: SARS-CoV-2 spike RBD models belonging to different viral lineages in a complex with the human ACE2 were simulated using molecular dynamics and submitted to estimation of binding free energies to provide a better understanding of the effect of those sites under positive selection (single and combined) in the viral fitness.
Spike structural stability. The evaluation of the structural stability of the spike protein considered the results of five different methodologies in order to find a majority consensus. The application of DynaMut, FoldX, iMutant, MAESTRO, and PremPS provided the estimate of the unfolding and total free energies, as well as the vibrational entropy of the mutated structures. Despite some sites presenting opposite trends in relation to the stabilizing/destabilizing impact, which denotes the specific differences in energy calculations among these algorithms, a majority consensus result (found by 3 or more tests) was obtained (Table 3, Supplementary File 5). Sites without resolution in the crystallographic structure, located at N/C-terminal regions, as those in the furin cleavage site (among others), were not analyzed for the energy estimation. The PremPS algorithm was used to calculate the changes in unfolding Gibbs free energy generated by each single mutation in the full-lenght prefusion conformation of the spike protein. As observed in Table 3, important mutations for the VOC lineages B.1.1.351, B.1.1.7 and P.1, such as E484K and N501Y, seem to stabilize the protein structure by the decrease of the unfolding free energy change (-0.16 and À0.89 kcal/mol, respectively). Substitutions as those from residues 138 and 585 presented the lowest DDG values, which may suggest a higher stabilizing effect of the mutations D138Y and L585F in lineages as P.1 and P.2, among others. However, for site 585 (Supplementary File 5), a destabilizing effect was found by three methods (iMutant, DynaMut and FoldX).
The analysis of the impact of the positively selected mutations on protein dynamics and stability was performed with DynaMut. The higher vibrational entropy (DDG Vib ENCoM) found in sites 21 (R!S) (Table 3), 78 (R!S/T) and 417 (K!T/ N) (Supplementary File 5) indicated an increased molecular flexibility (>0.5 kcal/mol) due to the loss of molecular interactions. Moreover, the consensus results correlated with the predicted destabilizing impact. In other sites, such as 49 (Supplementary File 5), the conformational molecular rigidity caused by a reduced vibrational entropy (À2.809 kcal/mol) follows the stabilizing effect predicted by DynaMut, MAESTRO, and FoldX. Especially, site 689 (Supplementary File 5) achieved the lowest DDG vib , with an estimated reduction in the molecular flexibility of À4.514 kcal/mol in the serine to isoleucine substitution. Although the substitution of a negatively charged glutamic acid by a positively charged lysine increased the vibrational entropy in the E484K-mutated protein, slightly decreasing the molecular rigidity (DDG vib ¼ þ0.246 kcal/mol), a potential stabilizing effect of this mutation was predicted by PremPs, DynaMut, MAESTRO, and FoldX analyses (Table 3).
The iMutant3.0 web server was used to evaluate the changes in the thermodynamic stability of folded proteins (Table 3, Supplementary File 5). Using the physiological pH (7.4) and temperature (36.5 C) as parameters, the SVM algorithm provided the estimation of the free energy change (DDG) between the wild-type and the mutated 6XR8 spike structures. According to the analysis of the thermodynamic properties, the mutations marked as positively selected seem to destabilize the protein structure. However, the majority consensus evaluation did not confirm this trend, suggesting 51 substitution events (from 33 sites) as stabilizing mutations.
Finally, the FoldX pipeline repaired the 6XR8 PDB file by the correction of residues with bad torsion angles or van der Waals clashes and performed the energy minimization testing different rotamer combinations. As result of the mutagenesis and total energy estimation, the mutations in sites T29A (Supplementary File 5), G142D (Table 3), and A570D (Supplementary File 5) were considered as highly destabilizing, increasing the total energy in 1.936, 2.848 and 2.939 kcal/mol, respectively. Despite some energy differences being categorized as neutral, the increase or decrease of these values indicated by sign (±) suggests the potential effect of these mutations on the structural stability of spike, in the tested conditions.   Sites with no covered position by the structural analysis: 5, 12, 75, 76, 681, 684, 688, 1167, 1176, 1219, 1228, 1260, and 1264 The comparison between the mutational effect predicted by the five methods and the majority consensus result are presented in Figure 3. The evaluation of all existing substitutions for each positively selected site by the majority consensus indicated the prevalence of stabilizing mutations, which was observed for the important VOC and VOI mutated sites. The analysis of the positively selected sites according to the prevalent substitution per site obtained a similar profile, with 25.71% of the mutated sites generating a destabilizing effect in the spike prefusion structure (Figure 4).
Spike RBD-ACE2 structural modelling. In order to generate optimized RBD structures with corrected rotamer conformations upon mutations, a homology modelling protocol was applied. In total, 30 theoretical models with 194 amino acid length for the spike RBD -ACE2 protein complexes were obtained using PDB ID 6M0J as template, being five models generated for each lineage (reference, P.1, P.2, C.37, B.1.1.7, and P.2 þ 452). The model quality was evaluated by analysis of DOPE score and through PROCHECK/ VERIFY3D parameters (Supplementary File 6). The models covered the region between the amino acid positions 333 and 526 of the reference spike protein sequence (relative to the genome NC_045512.2) and from five SARS-CoV-2 lineages: C.37 (L452Q þ F490S), B.1.1.7 (N501Y), P.1 (K417T þ E484K þ N501Y), P.2 (E484K), and P.2 þ 452 (E484K þ L452V) ( Figure 5). The final selected models for the reference as well as for the lineages C.37, B.1.1.7, P.1, P.2, and P.2 þ 452 had 88.7 up to 89.9% of the residues in favoured regions and 10.1 up to 11.3% of the residues in allowed regions, which indicated the good model quality. There were no residues in generously allowed or disallowed regions (Supplementary File 7). The overall G-Factor values ranged between À0.06 and À0.12. For all selected models, the compatibility of the atomic model with the amino acid sequence was defined with 100% of the residues achieving an average 3 D-1D score ! 0.2 in the VERIFY3D analysis.

Molecular dynamics and binding free energy estimation.
The analysis of the RMSD, center of mass and minimum distances (Supplementary File 8) indicated that all systems were structurally stable, in the considered time window, with overall conservation of the arrangement of the complex, as can also be seen by visual inspection of the snapshots (Figures 6  and 7). In Figure 6 we show 50 ns spaced snapshots on the whole trajectories whereas in Figure 7 the last snapshot of each simulation is shown, fitted to the reference (wild-type spike). Concerning the structural evolution, the lowest changes were found in the reference and lineages P.1 and C.37, meaning higher structural stability. Some structural changes in spike were found in the P.2 and B.1.1.7 lineages, whereas for P.2 þ 452 and B.1.1.7 the structural changes were also noticeable in relation to the ACE2 molecule. For all systems, the spike-hACE2 interaction was characterized by the presence of many close contacts, which is reflected in a substantial interface area, and many hydrogen bonds ( Figure  8 and Table 4). Figure S8 shows the ACE2 -Spike distance plots (center of mass and minimum distance). Despite the similar results, indicating that the relative position between the spike and ACE2 is not altered during the simulation time, the lineages P.2 þ 452, C.37 and B.1.1.7 showed a small alteration in the  distance, with a slight approximation between spike and ACE2 (Supplementary File 8). In the reference system, there were on average 9.51 ± 2.09 hydrogen bonds and an interface area 10.15 ± 0.55 nm 2 . The systems P.2, P.2 þ 452, and C.37 displayed a higher number of hydrogen bonds and interface area than the reference, whereas the lineage P.1 showed larger interface area but less hydrogen bonds and the lineage B.1.1.7 showed smaller interface area and number of hydrogen bonds. About the number and temporal evolution of the hydrogen bonds, the order followed: P.2 > P.2 þ 452 > C.37 > reference > P.1 > B.1.1.7. The intensity of the hydrogen bonds in the lineages P.2 and P.2 þ 452 is reflected in the interaction intensity with ACE2. As for the contact area between spike and ACE2, proportional to the intensity of the van der Waals interactions (also reflected in the interaction intensity), the order followed with a slight change: P.2 þ 452 > P.2 > P.1 > C.37 > reference > B.1.1.7 ( Figure 8 and Table 5). Table 5 shows the binding free energies calculated using MM/PBSA. In all cases, the binding free energy of the mutant spikes was found to be more negative (stronger binding) than the native one and the order is roughly similar to the intensity of interface area or hydrogen bonds: P.2 þ 452 > P.2 > P.1 > B.1.1.7 > C.37 > reference. Specifically, for P.2 þ 452 and P.2, the binding intensity was significantly stronger than the reference, while for B.1.1.7, C.37 and reference the interaction intensity was almost the same (Table 5).
In relation to the hotspots for the RBD -ACE2 interaction, the residue distances for the wild-type (reference) and the mutated sites were calculated ( Supplementary File 9-13). The spike K417 interacts with hACE2 residues T27, D30, K31, and H34 in the wild type, while the mutation for T417 in P.1 maintains the same contacts. In the P.2 lineage, K417 is affected by the E484K mutation, changing the residue interactions to the hACE2 D30, N33, H34, and P389, which is the same profile identified in B.1.1.7, despite the higher distance variation during the simulation time. The impact of the E484K þ L452V combination in P.2 þ 452 changes the residue interaction of K417 to D30, H34, Q388, and P389. For the C.37 lineage, the effect of L452Q and F490S modified the residue interaction of K417 to T27, D30, H34, and P389. The residue N501 in the wild type RBD-ACE2 complex interacts with L351, G352, K353, G354, D355, and Y41. This same pattern is kept by P.1, with a slight reduction in the Y501-K353 distance. N501 of P.2, P.2 þ 452 and C.37 lineages presented the same ACE2 interactions. However, C.37 showed a slight increase in the distance value for F356 and D355 in the middle and end of the simulation time. Finally, Y501 in B.1.1.7 showed the highest variation, creating a new interaction with hACE2 residue H34.
In relation to the magnitude of interaction, all three topranked lineages bear the same E484K mutation, which yielded a particularly strong binding, remarkably stronger than in the native case. Indeed, the analysis of the contribution of the residues to the free energy of binding showed that the mutation of the negatively charged residue GLU 484 to the positive charged residue LYS 484 yielded a substantial negative free energy contribution to the stabilization of the complex. The mutation in the residue 501 (ASN to TYR, both polar non charged residues) presented in the lineages P.1 and B.1.1.7 can also contribute to the stabilization of the complex, but much less than E484K (Table 6).

Discussion
The spike protein is one of the most rapidly evolving regions in the SARS-CoV-2 genome, with slightly different evolution rates according to the clades, emerging in the beginning and middle 2020 (Pereson et al., 2021). Despite the SARS-CoV-2 evolution may be driven by non-selective forces as genetic drift and founder effect, the importance of the adaptive selection can be seen by the maintenance of novel advantageous mutations in geographically distant and different lineages, as well by the arising and fast spread of multiple nonsynonymous substitutions (MacLean et al. 2021). The emergence of the P.1 lineage, at the end 2020, with a possibly increased rate of molecular evolution, represents an important change in the SARS-CoV-2 evolutionary history . That is the main reason why lineage assignments should be carefully evaluated, since a considerable number of sequences generally described as P.2 or B.1.1.28 (e.g.) presents a mutation set consistent with the P.1 lineage and are grouped in the same monophyletic clade  even in a spike-based tree. A systematic and detailed phylogenomic analysis should be conducted in order to better understand and classify new SARS-CoV-2 genomes. The prevalence of P.1 containing mutations in the set of sites indicated to be under adaptive selective pressure may suggest the improvement of viral fitness in this lineage. In fact, a high number of mutations were found in P.1 variants, besides the 10 lineage-defining mutations (https://outbreak. info/situation-reports?pango=P.1). Interestingly, eight of them were already found to be positively selected in the study of Faria and colleagues (2021), using a smaller set of P.1 and P.2 sequences between November, 2020 and January, 2021. However, our analyses included 2901 unique spike sequences from several lineages and identified pervasive adaptive selection signatures in sites 26, 138, 417, 484, 501, and 1027, whose mutations are found occurring in more than nine Brazilian lineages. Therefore, the possible change of a neutral genetic drift to a strong selective pressure may be related to the P.1 mutational advantage facing the population-level immunity, according to Van Egeren and colleagues (2021). In this regard, it is interesting to note that P.1 has emerged in a region with up to 75% demonstrated previous seroprevalence of SARS-CoV-2. Similarly, lineagedefining mutated sites from B.1.1.7 (570 and 681) and C.37 lineages (75 and 76) were also identified in the positive selection tests, strongly suggesting that similar phenomena of selective pressure have been the crucial in the emergence of others lineages as well. In March, 2020, the general rates of nucleotide substitution found when considering the whole SARS-CoV-2 genome suggested a major contribution of neutral evolution (Bai et al., 2020), although strong evidence already indicates positive selection in spike protein sites such as 484, which is determinant for the adaptation to human hosts (Cagliani et al., 2020). In August, Korber and colleagues (2020) showed that site D614 may be under positive selection due to the increased G614 frequency within different populations. Another work from Emam et al. (2021) identified strong evidence of positive selection for sites P26, N148 and M153 of the spike protein, besides others in ORF1ab and E gene sequences.
According to Shindyalov et al. (1994), mutations in certain positions are fixed in a correlated manner along the evolution. This may be the result of structural or functional constraints imposed for the maintenance of the protein integrity and stability. Although close chain neighbours have higher probabilities to mutate in a conditional way, even distant structural positions can follow this pattern through a number of different mechanisms. Coevolution may be the result of compensatory, fitness recovery interactions. Alternatively, mutations that enhance infectivity, such as N501Y, may increase the chance of further viral evolution. Recently, for example, the acquisition of type I interferon resistance, a seemingly key factor for pathogenesis in SARS-CoV, has been found in the newer S harboring mutant lineages (Guo et al., 2021;Kim & Shin, 2021). A third potential cause of viral coevolution through mutation linkage could be viral immune escape. The H69-V70 deletion, for instance, present in NTD from the B.1.1.7 lineage, abolishes an important immunogenic loop by allowing the inward movement of the amino acids 71-75 from that domain. The analysis performed by Islam et al. (2021) indicated the potential coevolution between the mutations D614G from spike and P323L from NSP12, as consequence of an affected RNA-dependent RNApolymerase (RdRp) fidelity (P323L) and the improved cell fusion and entry (D614G). In this work, the analysis of coevolution with the Bayesian Graphical Model showed that 28 pairs of sites (including the couple 484 À 501) can be statistically related with a posterior probability ! 0.8, of which 13 pairs of sites with a posterior probability ! 0.9, indicating that they are probably not conditionally independent. The co-occurrence of E484K and N501Y is an interesting fact, since N501Y emerged independently in P.1 and B.1.351 lineages (Winger and Caspari 2021), as occurred with E484K (Ferrareze et al., 2021).
We have demonstrated that, individually, some mutations may stabilize or destabilize the protein structure, since their occurrence triggers different effects on energy balances and potentially affects viral fitness. As shown by the result of our majority consensus analysis and demonstrated by Pucci and Rooman (2021), mutations such as A222V (B.1.617.2), N501Y  (Winger and Caspari, 2021). Several studies Gan et al., 2021) have demonstrated the increased binding affinity upon the amino acid changes E484K þ N501Y þ K417T (P.1), superposing the K417T effect that decreases the binding affinity. The threonine substitution at site 417 avoids the salt bridge formation with D30 in hACE2. Consequently, there is expected diminution in the binding affinity for the mutated structures (Winger and Caspari, 2021). As previously elucidated, the binding free energy between RBD and ACE2 reflects the infectivity (Qu et al. 2005). Previous analysis of the substitutions L452R, F490S, and N501Y presented relatively high binding free energy changes suggesting that these sites may generate more infectious lineages . Our findings corroborate the improved binding pattern found in P.1 models, showing the greater contribution of E484K mutation to the binding free energy reduction and the RBD -hACE complex stabilization in P.1 and P.2 lineages, especially in the presence of a mutated L452V site. Indeed the Free Energy of Binding is much more negative for the lineages P.2 and P.2 þ 452, mainly due to stronger electrostatic interactions and also to a lesser degree, van der Waals interactions. These findings are well correlated with the larger interface area and number of hydrogen bonds, yielding a stronger interaction. Despite L452Q (C.37) suggesting similar properties to L452R (B.1.617.2), increasing viral infectivity (Acevedo et al., 2021), nothing is known about the valine substitution and its effect on P.2 viruses. However, a significant interaction effect with E484K was observed in this study, also with indication of structural changes in the ACE2 molecule (as for B.1.1.7 model). Interestingly, the escape from neutralizing antibodies may be related to mutants with increased ACE2 binding affinities (Van Egeren et al., 2021). As shown by our molecular dynamic simulations, the comparison of the combined binding energies for the evaluated lineages indicates a major reduction in the binding free energy of those containing E484K, with only a small difference between the reference and lineages such as C.37 and B.1.1.7.
The selection of theoretically "destabilizing" or "stabilizing" mutations still remains to be completely elucidated. One possible explanation for the maintenance of such mutations would be the occurrence of "copy choice" recombinations capable of selecting very quickly for a set of mutations that have coevolved for the aforementioned reasons. The other possibility would be rapid evolution mediated by lineages enhanced with greater replicative fitness, as a consequence of key RBD mutations and/or other allosteric interactions (e.g., D614G). The higher viral turnover would account for a tremendous selective pressure and fast viral selection, provided the viruses find an appropriate environment for transmissions. In this regard, it is interesting to note that both B.1.351 and P.1 appeared to have emerged in a very intense selective pressure scenario. In some Amazonas State cities in Brazil, where the P.1 lineage has emerged, up to 75% exposure were found for previous first-wave lineage . Similarly, B.1.351 has emerged in South Africa in a scenario with up to 30% of the population being previously exposed to other lineages (Zhou et al. 2021). Taken together this evidence supports the role of consecutive accumulation of mutations rather than recombinations in order to explain the emergence of multiple lineages with specific sets of mutations, characteristic of the VOC viruses. Finally, it is possible that an immunosuppressed host would be critical for selecting these viruses, since the first occurrence of important, but intrinsically "destabilizing" substitution, would prevent further evolution if the virus encounters a robust immune response. In other hand, an immunocompromised patient would allow the progressively mutated virus to evolve, eventually acquiring fitness recovery and/or escape mutants permitting transmission to normal hosts. Although this hypothesis is merely speculative at this time, previous works involving multiple genotyping of a single chronically immunosuppressed patient (Choi et al. 2020;Kemp et al. 2021) suggested that this could be a way of rapidly selecting virus harboring mutations that, on an individual basis, are "destabilizing".

Conclusions
In this work, we showed that some VOC and VOI related mutations (142, 222, 262, 484, 501, and 681, among others) are suggested to be under adaptive pressure by the consensus analysis of the HyPhy selection tests. These results potentially indicate the maintenance of the mutations in SARS-CoV-2 populations along the time in front of some fitness improvement, as seen in B.1.1.7 and P.1 lineages. The mutated sites 484 and 501, specifically, are indicated to be conditionally dependent. Although most of the positively selected mutations seem to affect the spike structure in a stabilizing trend, different patterns were found to the same site according to the amino acid substitution. Energy differences were also observed by the molecular dynamics simulations, where E484K-L452V presented the greatest impact in the binding free energies, adding a more negative weight to the RBD -hACE complex and stabilizing the structure. On the other hand, lineages as B.1.1.7, C.37 achieved energy values similar to the reference (wild-type) structure, despite B.1.1.7 also input structural hACE2 modifications. In conclusion, this work aimed to help the elucidation and understanding of the evolutionary paths of SARS-CoV-2 in the pathogenesis and adaptation by the analysis of the structural effects of spike mutations.