In silico assessment of DNA damage response gene variants associated with head and neck cancer

Abstract Head and neck cancer (HNC), the sixth most common cancer globally, stands first in India, especially Northeast India, where tobacco usage is predominant, which introduces various carcinogens leading to malignancies by accumulating DNA damages. Consequently, the present work aimed to predict the impact of significant germline variants in DNA repair and Tumour Suppressor genes on HNC development. WES in Ion ProtonTM platform on ‘discovery set’ (n = 15), followed by recurrence assessment of the observed variants on ‘confirmation set’ (n = 40) using Sanger Sequencing was performed on the HNC-prevalent NE Indian populations. Initially, 53 variants were identified, of which seven HNC-linked DNA damage response gene variants were frequent in the studied populations. Different tools ascertained the biological consequences of these variants, of which the non-coding variants viz. EXO1_rs4150018, RAD52_rs6413436, CHD5_rs2746066, HACE1_rs6918700 showed risk, while FLT3_rs2491227 and BMPR1A_rs7074064 conferred protection against HNC by affecting transcriptional regulation and splicing mechanism. Molecular Dynamics Simulation of the full-length p53 model predicted that the observed coding TP53_rs1042522 variant conferred HNC-risk by altering the structural dynamics of the protein, which displayed difficulty in the transition between active and inactive conformations due to high-energy barrier. Subsequent pathway and gene ontology analysis revealed that EXO1, RAD52 and TP53 variants affected the Double-Strand Break Repair pathway, whereas CHD5 and HACE1 variants inactivated DNA repair cascade, facilitating uncontrolled cell proliferation, impaired apoptosis and malignant transformation. Conversely, FLT3 and BMPR1A variants protected against HNC by controlling tumorigenesis, which requires experimental validation. These findings may serve as prognostic markers for developing preventive measures against HNC.


Introduction
Head and neck squamous cell carcinoma (HNSCC), the most common form of cancer in the head and neck region (Heroiu Cataloiu et al., 2013), is the sixth most common type of cancer (Vigneswaran & Williams, 2014) and constitutes 5% of the entire cancer cases worldwide (Dhull et al., 2018), with males found to be affected significantly more than females, with a ratio ranging from 2:1 to 4:1 (Bray et al., 2013). According to GLOBOCAN 2018, India stands first in head and neck cancer (HNC) incidence, where a total of 205,325 HNC cases are reported, which is the highest number of HNC patients worldwide (Bray et al., 2018). In Northeast (NE) India, HNC incidence is the highest (54.48%) as compared to other Indian states, acquiring 30%-40% of cancers at all sites (Bhattacharjee et al., 2018), and is the sixth most common cause of death in males and seventh in females (Choudhury & Ghosh, 2015).
Around 75% of HNC cases have been attributed to tobacco smoking, use of smokeless tobacco (SLT), alcohol consumption and betel quid chewing, as they are considered as the primary risk factors for HNC. Besides these non-dietary habits, the prevalence of HPV in HNC is around 25% (Tumban, 2019), where HPV16 was a more virulent subtype, followed by HPV18 (Joshi et al., 2014). It had been found that the chance of HNC development in a smoker is about ten times higher compared to a non-smoker depending on the period and extent of smoking (Sturgis & Cinciripini, 2007). In contrast, SLT consumers develop around 15 times more risk of oral cancers (Datta et al., 2014). Abundant carcinogens produced from prolonged usage of various tobacco products react with the DNA of human tissues forming DNA adducts and subsequently resulting in DNA damage and mutations in crucial DNA repair genes (DRGs), leading to uncontrolled cell growth (Gupta et al., 2018;Lacko et al., 2009). Deleterious variations in DRGs distress other genes like tumour suppressor genes (TSGs), which allow cells to undergo malignant transformation (Torgovnick & Schumacher, 2015). If the DNA damage is too large, TSGs signal cells to undergo apoptosis. Genetic variations in TSGs themselves cause a loss of checkpoint control, allowing the accumulation of DNA damage, which ultimately leads to carcinogenesis (Abreu Velez & Howard, 2015). In this background, DNA repair pathways and TSGs play a crucial role in sustaining genomic integrity, as they make proteins regulating cell growth and proliferation and also repair cells from DNA damage within the genome. Thus, the assessment of genetic variation in these gene groups is vital for understanding the underlying mechanism of HNC development.
Various reports implicate the participation of several candidate DRGs and TSGs with a high degree of HNC risk (Borchiellini et al., 2017;Carrera-Lasfuentes et al., 2017;Jiang et al., 2015;Joyce & Kasi, 2019;Rao et al., 2017;Saeed et al., 2017;Sahu et al., 2016;Tinhofer et al., 2016). In our previous NGS-based study, we have identified alterations in the intronic regions of HLTF, RAD52, PARP4, RECQL5, EXO1 and PER1 genetic regions and exonic variants in TDP2, MSH6, BRCA2, PALB2 and TP53 genes working together during a particular phase of DNA repair mechanism for HNC causation (Das et al., 2018). This approach has also been applied on HNC, particularly on Nasopharyngeal carcinoma (NPC), which documented mutations in TSGs, BRCA1 and BRCA2 genes as the topmost mutated genes followed by PALB2 and TP53 genes (Fountzilas et al., 2018). A recent study also identified mutations in the TP53 gene and genomic alterations in the PI3K pathway as being among the most frequent events in metastatic HNSCC (Galot et al., 2020). Recently, another Indian study has revealed recurrent mutations within TSGs such as ATM, TP53, SMAD4, PIK3CA and ERBB4 in clinically diagnosed, potentially malignant lesions and tissues of oral cancer (Jayaprakash et al., 2020). In NE India, a prospective cohort study evaluated the association of p16 ink4a encoded by the CDKN2A gene in locally advanced HNSCC as a useful prognostic factor (Mohanty et al., 2020), and other reports investigated the expression profiling of the promoter regions of FLT3, EPB41L3 and SFN genes within the tribal population of Meghalaya in oral cancer (Khongsti et al., 2019). Moreover, alternative splicing variants in TSGs like MLL3 and RPS9 were repeatedly found to be associated with HNC (Sharma et al., 2019).
Moreover, in-depth details on a given protein's function being altered by different mutations become clear, once the protein's structure is known (Friedman et al., 2013). Computer-aided studies like Molecular dynamics simulations (MDS) sometimes complement molecular studies and provide new insights into various aspects from small drug-like compounds to proteins, DNA and even entire viruses, when combined with related experimental data (Baker, 2015). K-Ras protein, an important therapeutic target for human cancers, is a hotspot for oncogenic mutations. MDS analysis elucidated that the mutations G12A and G12V have a better binding efficiency with GTP among the others, which enabled us to understand the mechanisms of K-Ras induced oncogenic tumours (Doss & Zayed, 2020). In lung cancer and HNC patients, as revealed by MDS, the mutation R252H drastically affects the native conformation of the E2F1 protein, which plays a key role in the regulation of cell proliferation and antiproliferative processes . Another simulation study revealed that p53 R213G mutation, having a role in metastatic lung cancer, mainly affect its stability by interfering with p53 interactions with other macromolecules (Akbari et al., 2021). In India, a recent study identified EZH2 mutation (A490T) which was targeted computationally and provided insight into its role in breast cancer susceptibility (Gautam et al., 2021). However, there is a lack of MDS-based publications carried out on the HNC-prevalent NE Indian population.
In NE India, the prevalence of HNC is the highest (Bhattacharjee et al., 2018). Consequently, morbidity and mortality associated with HNC have become a significant source of concern. Thus, in this pilot study, we have endeavoured to give an overview of the germline variants in TSGs based on Whole Exome Sequencing (WES), where various statistical and bioinformatics tools were employed to screen significant TSG variants associated with HNC. Besides, we have also considered the WES-derived DRG variants from our previous study (Das et al., 2018) along with these significant TSG variants to estimate their recurrence using Sanger sequencing in the studied HNC-prevalent populations. Moreover, the possible role of the finally identified variants from both the gene groups has also been predicted in HNC causation using different in silico tools. Further, to understand their biological consequences, gene-gene interaction and gene ontology study were undertaken, whereby we aimed to achieve a comprehensive understanding of the combined mechanism of action of these two major cancercausing gene groups on HNC development in this part of India. In addition, we tried to elucidate the impact of the deleterious coding variant on the respective protein structure and function using molecular dynamics simulations (MDS) for predicting the disease risk. The current findings may serve as prognostic markers, which can provide insight into developing any preventive measures or novel clinical implications against HNC.

Data collection on major cancer-causing gene groups
The information on TSGs was retrieved from TSGene 2.0 database , consisting of a total number of 1217 of genes. We assessed various germline databases such as ExAC (Lek et al., 2016;Walsh et al., 2017), Human Gene Mutation Database (HGMD) (Stenson et al., 2017) and ClinVar (Landrum et al., 2016) for comparing our data to identify the clinically significant variants. The information on 228 DRGs was retrieved from our previous study (Das & Ghosh, 2017 (Seo et al., 2017), in the 'discovery set' samples that comprised of nine patients (cases) and six healthy (controls) individuals belonging to the ethnic groups of Manipur (Manipuri), Mizoram (Mizo), and Nagaland (Naga). Furthermore, genotype calling by Torrent Variant Caller (TVC v.5.4.0) plug-in and annotation of the variants using Ion Reporter TM and wANNOVAR (Chang & Wang, 2012) software were done to ascertain significant germline variants, as mentioned in our previous study (Das et al., 2018). Next, we used Haploview v4.2 for evaluating Hardy-Weinberg equilibrium, genotype frequencies and allele frequencies (Barrett et al., 2005). Moreover, the minimum genotype frequency of 50% and MAF > 0.05 was used as a cut-off for the initial screening of the variants. The gPLINK2 program was implemented to conduct the allelic association test of the dataset (Purcell et al., 2007). Additionally, we observed the linkage disequilibrium (LD) pattern for identifying strong LD blocks (having D' ¼ 1 and r 2 > 0.8). To determine the harmful effect of the identified variants, we followed two approaches. Firstly, the variants were analysed using different in silico tools, following which checked in clinical and other databases like ClinVar, HGMD ExAC and dbSNP databases (Landrum et al., 2016;Sherry et al., 2001;Stenson et al., 2017;Walsh et al., 2017). In silico amino acid substitution (AAS) tools like SIFT (Ng & Henikoff, 2003), PolyPhen-2 (Adzhubei et al., 2013), I-Mutant 3.0 (Capriotti et al., 2005), PROVEAN (Choi et al., 2012), PANTHER (Mi et al., 2016), Mutation Assessor (Reva et al., 2011) and SNAP2 (Hecht et al., 2015) examined the possible impact of the substitution on the structure and function of the respective protein. In contrast, AlamutV R Visual (v2.10.0) software (http://www.interactive-biosoftware.com/alamut-visual) was used to identify the effect of non-coding variants. In this study, all the exonic-non-synonymous variants were considered as coding variants, while the intronic, exonic-synonymous and other regulatory variants were considered as non-coding variants.

Recurrence assessment of the identified variants on 'confirmation set'
We assessed the recurrence of the identified TSG variants and the previously reported DRG variants (Das et al., 2018) in an independent 'confirmation set' samples (20 HNC patients and 20 healthy individuals without a family history of cancer) by sequencing in the 3500 Genetic Analyzer (Applied Biosystems). Afterwards, to check their association with the HNC causation, logistic regression analysis with 1000 bootstrap sampling was carried out in SPSS software [IBM SPSS Statistics 21] after adjusting with age, gender, populations and various non-dietary habits. Thus, the adjusted odds ratio (OR Adjusted ) and 95% confidence interval (CI) were estimated, and p-values of <0.05 were considered to be statistically significant.

Predicting biological consequences of the recurrent coding and non-coding variants
For biophysical validation of the coding variant, HOPE (Venselaar et al., 2010) and NetSurfP v1.1 (Petersen et al., 2009) determined the impact of the genetic variation on the structure and solvent accessibility of the respective protein.
The reliability of the NetSurfP v1.1 prediction is in the form of a Z-score. The ConSurf web server (http://consurf.tau.ac.il) provided the patterns of evolutionarily conserved regions, crucial for the function and structure of the respective protein. For the non-coding variants, we assessed deviations in putative transcription factor binding sites (TFBSs) using MATCH (Kel et al., 2003) from the geneXplain platform and PROMO (v3.0.24) (Farre et al., 2003;Messeguer et al., 2002). Furthermore, GeneMANIA v3.5.0 was used to carry out pathway analysis by constructing a gene-gene network to visualise other genes interacting via some critical pathways (Warde-Farley et al., 2010). Additionally, ShinyGO v0.61 retrieved the over-represented Gene Ontology (GO) categories associated with our identified genes (Ge et al., 2020).

3 D structure prediction
To show the impact of the identified coding variant on the respective putative protein, we undertook the 3 D structure prediction approach, for which the amino acid sequence of the query apo form of protein was retrieved from UniProt (ID-P04637, Organism-Human). To generate the 3 D structure, we used the tool for a homology modelling approach, MODELLER 9.23 (https://salilab.org/modeller/download_installation.html) (Eswar et al., 2006). The best predicted 3 D model was chosen based on the DOPE (discrete optimised protein energy) score; an assessment method implemented in the MODELLER (model with the lowest score was considered). Additionally, I-TASSER (https://zhanglab.ccmb.med.umich.edu/I-TASSER/) was adopted to predict the possible 3 D structure using the threading method for the missing region of the protein that was not generated by homology modelling (Zhang, 2008). The generated model was refined using ModRefiner (Xu & Zhang, 2011) and Galaxy Refine server (Heo et al., 2013). For the amino acid substitutions, RMSD difference and model visualisation, UCSF Chimera 1.14 (Pettersen et al., 2004), and for energy minimisation, Swiss-PDB Viewer software was used (Guex & Peitsch, 1997). Usually, a higher RMSD value implies a more significant deviation between mutant and native structure (Han et al., 2006), and a higher energy value indicates lesser stability of the respective structure (Desai & Chauhan, 2016). Besides, the model evaluation was done using ProSA webserver (Wiederstein & Sippl, 2007), Ramachandran plot analysis by RAMPAGE server (Lovell et al., 2003), where a model having 90% of its residues in the allowable regions is considered good (Laskowski et al., 1993) and assessment of non-bonded atomic interactions by ERRAT, where overall quality score >50 indicates higher quality model (Colovos & Yeates, 1993), which denotes the percentage of the protein that falls below the rejection limit of 95%, suggesting good model quality (Mora Lagares et al., 2020). The native and mutant protein models fulfilling all these criteria were finally chosen for the MDS analysis.

Molecular dynamics simulations (MDS) and trajectory analysis
The purpose of MDS is to investigate the mechanism of structural impacts of the variation on a protein by studying the movement of atoms and molecules over a given period.
The MDS was conducted using GROMACS version 2020.2 (Van Der Spoel et al., 2005) for the wild-type and mutant models, applying the OPLS-AA/L force field (Jorgensen et al., 1996). After generating topologies using the SPC water model, both native and mutant structures were placed in a cubic water box, and a distance of 10 Å was set apart in between the protein surface and the simulation box. After neutralising the system with ions, the standard energy minimisation was carried out followed by two-step position restrained equilibration, that is, NVT (Number of particles, Volume, and Temperature) and NPT (Number of particles, Pressure, and Temperature) ensemble, for which, V-rescale (tcouple) coupling of 300 K and Parrinelo-Rahman isotropic pressure coupling (pcouple) pressure of 1 bar were applied for 100 ps. Then, the production MD run for 150 ns with a time step of 2 fs using coulomb type Electrostatic evaluation scheme by means of PME, that is, Particle Mesh Ewald for long-range electrostatics was obtained for the native and mutant structures. The trajectories were analysed using gmx_rms, gmx_rmsf, gmx_gyrate, gmx_hbond and gmx_sasa of GROMACS utilities for the post-MD analysis such as root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (RoG), H-bonds and solvent accessible surface area (SASA), correspondingly. We further performed the principal component analysis (PCA) to identify large-scale collective motions of the wild-type and mutant proteins on the trajectories generated by MDS. For this, firstly, covariance matrix using gmx_covar, and then gmx_anaieg tool was applied for the construction of PCA. Porcupine plot analysis was executed in Pymol using python script 'modevectors.py' to visualize the direction and extent of the principal motions of the protein (Schrodinger, 2015). Then gmx_sham utility was used to calculate the Free Energy Landscape (FEL) of the given structures to analyse protein folding patterns and conformations based on corresponding Gibbs free energy. The 3 D graphs of the FEL were plotted using PC1 and PC2 components in OriginPro 2021b (OriginLab, 2021). Lastly, the timeline interface of the VMD software program was used for secondary structural analysis (Humphrey et al., 1996).

Status of the TSG variants obtained from WES on the 'discovery set'
After primary analysis of the WES data, out of 1217 TSGs reported in TSGene 2.0 database, we found only 901 genes polymorphic, which contained 4972 variants (Supplementary  Table S1 Figure S2 and Table S2).
Upon applying the initial cut-off on these 4972 variants, we screened out 2775 variants, of the 99 variants were uniquely present in cases (25 coding, and 74 non-coding), 30 variants were uniquely present in controls (8 coding and 22 non-coding), and the remaining 2646 variants (540 coding and 2106 non-coding) were found common in both case and control groups (Supplementary Table S2). We considered these common case-control variants for allelic association test, where 92 variants showed significant association (pvalue < 0.05) with HNC (Supplementary Table S3), from which 19 variants were excluded for being in the same LD block (Supplementary Figure S3). Thus, 73 common case-control variants (13 coding and 60 non-coding) (highlighted in Supplementary Table S3) were finally selected for further analysis.

Bioinformatics analysis of the coding and noncoding TSG variants
Initial analysis of 46 coding variants (25 unique-case, 8 unique-control, and 13 common case-control variants) using SIFT and PolyPhen-2 revealed six variants that showed deleterious effects in both the tools (Supplementary Table S4). These six variants were further assessed in five different AAS  occurred recurrently (!60%) in our studied populations (marked in Supplementary Table S8), which were further analysed to identify their association with HNC causation. For this, the demographic characteristics of our study subjects (15 samples from the discovery set and 40 samples from the confirmation set) were evaluated (Table 1). It showed that the mean age (±SE) of our study population was 46.6 (±1.4) years, whereas the mean age of the case group was 48.9 (±1.6) years and that for the control group was 44.0 (±2.2) years. Although there were no significant differences among the case and control groups, we continued further analyses for exploratory purposes.

Risk assessment of the identified HNC-linked recurrent variants
We performed the logistic regression analysis to assess the risk of the finally obtained genetic variants ( Table 2). The risk-allele containing genotypes of CHD5_rs2746066 variant were found to be associated with a 1.84-fold enhanced risk of HNC. Besides, homozygous alternate alleles of RAD52_rs6413436, TP53_rs1042522, CHD5_rs2746066, HACE1_rs6918700 and heterozygous TG genotype of EXO1_rs4150018 showed 2.49-fold, 2.51-fold, 3.58-fold, 6.00fold and 6.13-fold increased risk of HNC, respectively. On the other hand, the risk-allele containing genotypes of RAD52_rs6413436, FLT3_rs2491227, BMPR1A_rs7074064, homozygous alternate alleles of BMPR1A_rs7074064, heterozygous TC genotype of RAD52_rs6413436, heterozygous AG genotype of FLT3_rs2491227 and heterozygous AT genotype of HACE1_rs6918700 were found to have a protective role (OR Adjusted < 1) against HNC in our studied populations.

Biological consequences of the coding and noncoding variants
From logistic regression, we finally obtained seven variants to be associated with HNC, of which six variants were noncoding, and the remaining one was coding. In silico tools predicted that those six variants were involved in alteration of TFBSs, for example, due to the HACE1_rs6918700 variant, TFBSs were added, and for the BMPR1A_rs7074064 variant, TFBSs were lost, whereas for the rest of the variants TFBSs were affected in both ways. Additionally, these variants were also predicted to affect the splicing mechanism (Table 3). On the other hand, in the case of the observed coding variant, TP53_rs1042522, which resulted in the Arg > Pro change at the 72nd position, was identified to be exposed and located in the variable region (Supplementary Figure S4). It was also predicted to decrease the stability of p53 protein by affecting the interactions with the cell cycle and apoptosis regulator 2 (CCAR2) protein and other molecules/residues due to the charge difference. Moreover, the introduction of the more hydrophobic residue, proline, probably disturbed its correct folding. Besides, a significant reduction in the absolute solvent accessibility (ASA) of the Pro residue (60.22) compared to that of the wild-type Arg residue (73.60), and drift in relative solvent accessibility (RSA) of wild-type (0.32, Z-score: À1.908), and mutant (0.42, Z-score: À1.913) protein was observed, which influenced the activity of the p53 protein. These predictions for the coding variant were further validated using MDS analysis. Apart from this, to identify the biological significance of the finally obtained seven variants belonging to EXO1, HACE1, RAD52, TP53, CHD5, FLT3 and BMPR1A genes, a gene-gene interaction network was constructed ( Figure  1(A)), and an enrichment map of the over-represented Gene Ontology (GO) categories for these genes was also generated (Figure 1(B)). Pathway analysis showed that the HACE1 gene shares genetic connections with the RPA3 gene that is connected with TP53, EXO1, RAD52 and CHD5 genes. On the other hand, FLT3 and BMPR1A genes shared a similar protein domain and were genetically linked with RAD52 and HACE1 genes via the BMP2 gene. Additionally, over-represented GO-based analysis showed that these genes were involved in various biological processes like intracellular signal transduction, regulation of response to stimulus, regulation of cell proliferation and DNA repair (Supplementary Table S9).

3 D-structure prediction and model evaluation for the observed p53 variant
For generating the 3 D structure (Supplementary Figure S5), we selected 10 identical PDB files, and the regions matched to our query sequence are mentioned in Table 4. The residues 63-91, 292 and 295-318 could not be modelled by the homology modelling approach due to the unavailability of PDB structures. Thus, the regions 61-100 and 291-320 were predicted by employing the threading method (Table 4 and  Supplementary Table S10). The best protein model was chosen 'tp53.B99990005.pdb' on the basis of the lowest DOPE score (Supplementary Table S11). After model refinement, when R72P substitution was induced, the difference in RMSD value (0.02 Å) and increment in total energy (À18470.7 to À17620.9 kJ/mol) was observed, which suggested a deviation from the native structure and loss of stability due to this variation, respectively. Additionally, the z-score plots indicated the accuracy of the predicted models (Supplementary Figure S6). And, the Ramachandran plot predicted that 95.1% native model residues remained in the favoured region, 3.3% and 1.5% in allowed and outlier region, respectively, whereas 93.6% mutant model residues lied in the favoured region, 4.1% and 2.3% in allowed and outlier region, respectively (Supplementary Figure S7). Finally, for the native model, the overall quality score for non-bonded interactions between different atom types was predicted as 76.35 and that for the mutant was 76.13 (Supplementary Figure S8), suggesting good model quality.

MDS analysis
MDS analysis was performed for the time period of 150 nanoseconds (ns) for both the models of p53 to understand the conformational changes and dynamic stability due to the R72P variation. The trajectory files obtained were studied for various parameters and summarised in Table 5. The backbone RMSD calculated for 150 ns simulation illustrated that both native and mutant p53 showed an average RMSD of 0.65 and 0.70 nm, respectively (Figure 2(A)), whereas the C-alpha RMSD demonstrated that both native and mutant proteins showed an average value of 0.65 and  0.71 nm, respectively (Figure 2(B)), till the end of the simulations when the protein systems converged. This indicated that the mutant p53 was comparatively less stable than the native p53, and the variation decreased the backbone stability of the protein leading to the destabilization of its structure. The native attained the maximum RMSD value at 133.90 ns (0.78 nm) and achieved stable confirmation after 134 ns, whereas the mutant reached the maximum RMSD value after 69.25 ns (0.82 nm) and achieved stable confirmation after 70 ns and maintained throughout the simulation period. This suggested a stable dynamic behaviour for both the models, and hence, these trajectories were utilised for further analysis. To understand the effect of the R72P substitution upon the local flexibility, the RMSF analysis of C-alpha  atoms for each residue of the 150 ns trajectory showed that overall fluctuation of the mutant was higher (0.23 nm) in comparison to the native (0.21 nm), suggesting that the variation is making the mutant more flexible than the native one (Figure 2(C)). The RMSF graph of the mutant model also showed that there is an increased fluctuation in the region 59-90 residues compared to native model. Moreover, the RoG value was observed to be higher (2.35 nm) in mutant Pro72 as compared to the native Arg72 (2.32 nm), signifying that the variation decreases the overall compactness of the mutant p53 protein (Figure 2(D)). However, the average SASA value was decreased in Pro72 (225.54 nm 2 ) as compared to Arg72 (229.66 nm 2 ) (Figure 2(E)). Likewise, the total number of intermolecular H-bond with solvent in the native was 973, whereas in the mutant 964 H-bonds were observed, reflecting the potential influence of the variation on the structural stabilisation factors of the protein (Figure 2(F)). Furthermore, from the secondary structure plot of the native and mutant models, we noticed an overall decrease in turns (the secondary structure elements) mainly in the region 67-111 residues and in the C-terminal part within the mutant structure as compared to the native model (Supplementary Figure S9).

Principal component analysis (PCA) and assessment of free energy landscape (FEL)
In PCA, the overall flexibility of the wild-type and mutant proteins was confirmed from the trace values (for wild 26.45 nm 2 and mutant 27.86 nm 2 ) of the covariance matrix.
The Eigen values of the wild-type and mutant proteins were plotted against the corresponding Eigen vector index for the first fifty modes of motion (Figure 2(G)), wherein in the case of native and mutant structure the first five Eigen vectors accounted for 73.85% and 81.22%, respectively. Furthermore, we selected the first two principal components (PC1, PC2) to analyse the projection of trajectories during the wild-type and mutant simulations in the phase space (Figure 2(H)), which clearly showed that the mutant protein covered a larger region of phase space, while the native occupied a comparatively smaller region of phase space, pointing the variation highly altered the structural stability and flexibility. Consequently, in the porcupine plot, where the direction of the arrow shows the direction of movement of the protein, and the length of the arrows indicates the magnitude of the motion, we observed larger concerted motions in the overall structure of the mutant (Figure 2(J)) as compared to the native (Figure 2(I)). Furthermore, these two principal components were used to construct the FEL, where the Gibbs free energy spectrum for native showed multiple stable conformations (Figure 3(A,C)), and the mutant showed single stable conformation (Figure 3(B,D)).

Discussion
Major cancer-causing genes like DRGs and TSGs play a critical role in sustaining genomic integrity by regulating cell growth/proliferation and repairing DNA damage within the genome. Therefore, alterations in these genes with persistent tobacco consumption can cause large scale accumulation of DNA damage and loss of control in cell cycle checkpoints, leading to carcinogenesis (Gupta et al., 2018;Torgovnick & Schumacher, 2015). This warrants a detailed study of these gene groups to get a comprehensive understanding of HNC occurrence in NE India, where the maximum HNC cases are being reported (Bhattacharjee et al., 2018). In this context, the present study initially adopted the WES-based approach on the 'discovery set' to screen HNC-linked potential hotspot variants belonging to these gene groups. In the next phase, after estimating the recurrence of these variants on an independent 'confirmation set', we ultimately obtained seven significant variants, whose biological consequences were predicted using different in silico tools, which have been summarised ( Figure 4) and discussed in the following sections.
The genes belonging to DNA repair cascade viz. EXO1, RAD52 and TP53 usually get altered due to prolonged exposure to various carcinogenic agents from tobacco products (Dylawerska et al., 2017). The Comparative Toxicogenomics Database (Davis et al., 2019) also confirmed it, which revealed that the tobacco-related carcinogenic polycyclic aromatic hydrocarbons (PAHs) like benzo[a]pyrene (BAP) (Ajayi et al., 2019;Kreuzer et al., 2020), N-nitrosamines (Chatterjee et al., 2019), bisphenol A (BPA) (Liang et al., 2020), nicotine (Gao et al., 2020), and other major compounds like hydrogen peroxide (Aboonabi et al., 2020) and methyl methanesulfonate (Kodandaraman et al., 2020) were significantly associated with the ectopic expression of these genes. Consequently, our findings suggested that these genes conferred the highest HNC risk in NE India, where those nondietary practices are predominant as compared to other Indian states . Among these genes, EXO1 is an exonuclease that is involved in several pathways like mismatch repair (MMR), nucleotide excision repair (NER) and double-strand break repair (DSBR) in maintaining genomic stability (Jiricny, 2006). It is also associated with the regulation of the cell cycle checkpoints, replication fork maintenance, and post-replicative DNA repair pathways (Keijzers et al., 2018). Earlier, the observed intronic EXO1_rs4150018 variant was known to have connections with colorectal cancer (Gao et al., 2011), but in this study, this variant served as a possible determinant of HNC, as it incorporated new TFBS for the factor TFII-I that can differentially influence the process of transcriptional regulation (Tanikawa et al., 2011). Likewise, the intronic risk variant, RAD52_rs6413436 was known to be associated with lung cancer Lieberman et al., 2016) and gastric cancer (Carrera-Lasfuentes et al., 2017). However, here, this variant is probably affecting the carcinogen-induced DNA damage repairing process, and thus, it influences tumour cells division by altering the cell cycle checkpoints Lok & Powell, 2012). Interestingly, the presence of wild T allele of RAD52_rs6413436 variant decreased the risk of HNC by regulating the expression of this gene, which was in accordance with a previous study (Lieberman et al., 2016). Apart from this, it has been reported that 50% of all human cancers are related to the alteration of the TP53 gene (Levine, 2019). On that account, the only observed coding variant rs1042522 resulting in the R72P substitution in exon 4 of the TP53 gene, has been predicted to destabilise itself by affecting CCAR2 interacting DNA binding proline-rich domain (Qin et al., 2015). Usually, the wild p53-Arg72 variant is a more potent tumour suppressor due to its strong binding to MDM2, which enhances its transport to the mitochondria for initiating apoptosis (Dumont et al., 2003), and its capability of suppressing the growth of transformed cells by inducing the expression of the cell cycle inhibitor, p21-WAF1 (Su et al., 2003;Thomas et al., 1999). On the other hand, the alternate p53-Pro72 variant is known to be less effective for inducing cell cycle arrest in response to DNA damage (Hu et al., 2005). However, this TP53 variant is known to be associated with differential risk of several cancers (Gunaratna et al., 2019).
Among the observed TSG variants, the identified HACE1_rs6918700 variant was previously recognised to be one of the contributors in Wilms' tumour (Slade et al., 2010), and till date, its role in cancer growth has not been clearly understood. However, in this study, this variant has been newly reported to have a significant association with the HNC risk. Due to the genetic inactivation of HACE1 gene, NF-jB activation and apoptotic processes get impaired (Tortola et al., 2016), and RAC1, a well-known target of the ubiquitin ligase activity of HACE1, remains in a GTP-bound form, whereby increasing the malignant cell proliferation and migration in cancers (El-Hachem & Ballotti, 2018). Moreover, alterations in TFBSs result into a variable expression of this gene, which ultimately leads to accumulation of reactive oxygen species (ROS) causing DNA damages, dysregulation of cyclin D1-mediated control of the cell cycle progression and metabolic reprogramming of cells involved in cell transformation (Cetinbas et al., 2015;Daugaard et al., 2013;Rotblat et al., 2014). Thus, the observed polymorphic HACE1 is a crucial component of the cellular ROS detoxification, which fails to protect against oxidative stress, leading to carcinogenesis. Besides, the heterozygous AT genotype showed protection due to the presence of wild-type allele, but its actual mechanism of action is yet to be identified. Another intronic TSG variant, CHD5_rs2746066, was finally observed in this study to be associated with the HNC risk, which was predicted to hinder the role of CHD5 gene in regulating the cell cycle pathway (Lang et al., 2011). CHD5 forms a nucleosome remodelling and deacetylation (NuRD)-type chromatin remodelling complex and normally functions predominantly as a transcriptional regulator, RNA processing, DNA damage response, and cell cycle regulation (Murawska & Brehm, 2011;Stanley et al., 2013). Therefore, reduced CHD5 activity is linked to decreased apoptosis, failed to activate DNA damage response, increased proliferation and invasiveness in HNC (Wang et al., 2011). On the other hand, the remaining two intronic TSG variants, BMPR1A_rs7074064 and FLT3_rs2491227, were also newly reported in this study, which were predicted to have some protective roles against HNC. BMPR1A, the Bone Morphogenetic Protein (BMP), usually secrete cytokines/growth factors belonging to the Transforming Growth Factor b (TGF-b) family. Consequently, it has been found that the inactive BMPR1A impairs mammary tumour formation and metastasis, which makes it a potential tumour suppressor in human breast cancer (Pickup et al., 2015). On that account, loss of TFBS due to this variation probably leads to transcriptional repression of this gene, which, in turn, acts as a protagonist for controlling tumour progression. The other protective variant belonging to FLT3 gene that encodes for tyrosine kinase and induces multiple intracellular signalling pathways to maintain a balance between cell proliferation and differentiation (Stirewalt & Radich, 2003), was predicted to add certain TFBSs, which probably leads to transcriptional activation of this gene. This, in turn, may have affected the abovementioned balance, initiating apoptosis, whereby conferring protection against HNC. However, further experimental verifications of these claims are necessary.
In this study, the non-coding variants were ultimately found to be associated with transcriptional regulation and aberrant splicing mechanism, the impact of which has been predicted by subsequent gene-gene interaction as well as GO analysis. Likewise, the structural and functional impact of the observed R72P change in TP53, as predicted by HOPE and NetSurfP, was further confirmed by performing structural dynamics and stability study through MDS for 150 ns, as secondary structure formation, protein flexibility, and essential dynamics are the crucial determinants of protein interactions, and therefore, any fluctuations in these factors could lead to non-intuitive consequences (De Oliveira et al., 2019). Moreover, we thought that the combination of simulation approaches with the current information regarding the role of this TP53 variant in apoptosis induction, suppression of cell growth and transcriptional activation might help in solving the discrepancies associated with this infamous polymorphism (Volodko et al., 2015). Consequently, for the first time, we generated the full-length p53 protein model, as it has already been reported that the oncogenic form of this protein typically consists of the full-length protein with a single amino-acid substitution in its DNA-binding domain (Lu et al., 2007). On that account, the total energy of the mutant Pro-72 model showed higher energy than the wild-type Arg-72 model, suggesting its lesser stability (Rajasekaran & Sethumadhavan, 2010), which was also supported by the backbone as well as C-alpha RMSD differences, where the mean RMSD was increased in the mutant model speculating destabilization of the structure (Figure 2(A,B)). Furthermore, the RMSF and RoG analysis recommended that this Arg > Pro substitution made the mutant structure more flexible by reducing its overall compactness (Agrahari et al., 2019;Priya Doss et al., 2014), which was in accordance with the RSA value obtained from NetSurfP. On the other hand, we observed reduced mean SASA value for the mutant model (Table 5), which suggested that a structural rearrangement due to the replacement of polar Arg to non-polar hydrophobic Pro residue, made itself less accessible from the solvent and thus, lowered the surface area (Karn & Emerson, 2019). This observation is congruent with ASA value obtained from NetSurfP. On the other hand, decrease in intermolecular Hbonds in the mutant model is consistent with increased RoG value showing loss of compactness, probably influenced the structural stabilisation factors by altering the physicochemical characteristics of core regions, whereby affecting overall rigidity and folding (Jakubowski, 2021) of its structure (Ramireddy et al., 2018). Besides, in PCA, the higher average trace value, larger coverage in the phase space and a larger deviation in motion pattern of the mutant pointed highly altered structural stability and flexibility (Wang et al., 2019), which is in agreement with our RMSD, RMSF and RoG results, thus, enhancing the validity of the performed analysis. Lastly, the FEL analysis suggested that the mutant has less global minima conformation (Figure 3(A,D)), so it is thermodynamically more unfavourable as compared to native protein (Kumar et al., 2019), which also suggested that because of the high-energy barrier, the transition between active and inactive conformation is more difficult for the mutant as compared to native (Agrahari et al., 2019), recommending probable association of the mutant p53 with the HNC risk. All these results coincide with the decreased stability in the mutant structure as evident from RMSD, RMSF, RoG, intermolecular H-bonding, PCA and FEL analysis in the current study.
Additionally, we observed decreased turns (the secondary structure elements) in the mutant model, which is associated with impaired conformational stability and folding ability (Marcelino & Gierasch, 2008).

Conclusions
To the best of our knowledge, the current pilot-scale study on the NE Indian population is the first to provide an overall in silico assessment regarding the role of germline variations in major cancer-causing genes such as DRGs and TSGs related to HNC. Any damages in these genes cause loss of cell cycle checkpoint control and genomic instability, due to the accumulation of DNA damages from prolonged exposure to various carcinogens, ultimately leading to carcinogenesis, which emphasises the importance of these gene groups. Our findings reveal that EXO1, RAD52 and TP53 having vital roles in the DSBR pathway is the most error-prone, whereas the TSGs like CHD5 and HACE1 confer risk for HNC, due to their inability to activate the DNA repair cascade, thereby facilitating uncontrolled cell proliferation and impairing apoptosis. Conversely, other TSGs like FLT3 and BMPR1A show a protective role against HNC by controlling tumour growth. Here, these newly reported non-coding variants have been predicted to be associated with the alterations in the transcriptional regulation and splicing mechanism, whose impact has also been predicted by subsequent pathway and GO analysis. In contrast, for the first time, we generated the full-length p53 protein model that was utilised in the MDS analysis, which, in turn, provided evidence for the functional and structural impact of the observed R72P change on the putative protein. At length, these HNC-linked causal variants identified in the NE Indian population may be used for diagnostic/prognostic purposes and risk assessment, but other clinical implications such as their influence on therapeutic modalities may be further explored.