Systematical analysis of underlying markers associated with Marfan syndrome via integrated bioinformatics and machine learning strategies

Abstract Marfan syndrome (MFS) is a hereditary disease with high mortality. This study aimed to explore peripheral blood potential markers and underlying mechanisms in MFS via a series bioinformatics and machine learning analysis. First, we downloaded two MFS datasets from the GEO database. A total of 215 differentially expressed genes (DEGs) and 78 differentially expressed miRNAs (DEMs) were identified via “Limma” package. 60 DEGs, mainly enriched in abnormal transportation of structure and energy substances, were selected after protein-protein interaction (PPI) network construction, of which 20 were chosen for machine learning after three algorithms (betweenness, closeness, and degree) filtration using Cytoscape. Four overlapping DEGs (ACTN1, CFTR, GCKR, LAMA3) were finally selected as the candidate markers based on three machine-learning approaches (Lasso, random forest, and support vector machine-recursive feature elimination). Furthermore, we collected peripheral blood from MFS patients and healthy control to validate the findings and the results showed that compared with the control, the expression of the four DEGs was all statistically different in MFS patients validated by qRT-PCR. Besides, the area under the receiver operating characteristics curve was greater than 0.8 for each DEG. Single-sample gene-set enrichment analysis showed that the four DEGs were strongly associated with inflammation and myogenesis pathway. Finally, we constructed the mRNA-miRNA network based on the intersection of DEMs and predicted miRNAs targeting DEGs. In conclusion, our study partially provided four potential markers for MFS pathogenesis. Communicated by Ramaswamy H. Sarma


Introduction
Marfan syndrome (MFS) is an autosomal dominant disease and a rare hereditary genetic disorder affecting about 1/10000-1/20000 individuals worldwide (Yuan & Jing, 2010).Although the morbidity of MFS is low compared with other common diseases (cardiovascular disorders and various tumors), its mortality is extremely high (Young, 1991).MFS is characterized by the presence of tall stature, severe cardiac, musculoskeletal and eye complications.Besides, MFS patients often present with aortic root dilation, aortic dissection and mitral valve prolapse (Bhasin et al., 2021;Stuart & Williams, 2007).About 2/3 of MFS patients die or require major cardiac surgery before 42 years old (Robicsek, 2020), indicating that it is necessary to explore MFS pathogenesis for early diagnosis, prevention, and intervention.The current diagnosis methods for MFS include clinical symptoms, family history and iconography (echocardiography and magnetic resonance imaging) (Dean, 2007).
The development of MFS is complicated and has not been fully elucidated.The maintenance of the stability of connective tissue plays a critical role in MFS etiology.Fibrillin-1 (FBN1) mutation negatively affects MFS pathogenesis.Over 1000 individual mutations in FBN1 are related to MFS (Sakai et al., 2016).Dysregulation of FBN1 aberrantly activates TGF-b pathway, leading to the co-activation of downstream factors.However, perinatal administration of a TGF-b neutralizing antibody can potentially rescue the above phenomenon (Neptune et al., 2003).
Therefore, it is crucial to explore early prediction markers of MFS incidence to accelerate the establishment of effective treatment and prevention measures.Bioinformatics and machine learning strategies have rapidly developed in past decades.However, limited studies have focused on identifying underlying differentially expressed markers in MFS partially due to the low incidence, while the severe complications of MFS need to be paid attention to.Herein, the potential markers (mRNA and miRNA) for predicting MFS incidence were identified using various bioinformatics and machine learning strategies to provide new clinical ideas for developing early interventions and prevention strategies.

Data preparation and processing
Figure 1 depicts the complete study design.We downloaded microarray datasets from the NCBI GEO database (https:// www.ncbi.nlm.nih.gov/geo/)(Barrett et al., 2013).The following keywords were used for the search: '(Marfan syndrome)' AND 'Homo sapiens' [porgn:_txid9606] AND 'Expression profiling by array' AND 'Series' AND 'blood'.Two datasets were finally recruited for analysis based on our scope.GSE110964 (Abu-Halima et al., 2018), mRNA expression dataset, was generated from GPL16699 (platform) and included 14 samples (Control ¼ 7, MFS ¼ 7).GSE110965 (Abu-Halima et al., 2018), miRNA expression dataset, was generated from GPL16770 and contained 14 samples (Control ¼ 7, MFS ¼ 7), suggesting that the samples were similar between different datasets.Ssizer (Li et al., 2020), an online sample evaluation tool, was applied to assess the reliability of sample sizes in the selected datasets.

Data processing and identification of differentially expressed genes (DEGs) and miRNAs (DEMs)
The downloaded raw datasets underwent background adjustment and quantile normalization via 'affy' R package (Gautier et al., 2004) (R version: 4.1.2,www.R-project.org)from the Bioconductor project.The median expression value was calculated and regarded as the final expression when multiple probes corresponded to the same gene based on GSE110964.The log2 transformation was then conducted.DEGs and DEMs between MFS and control were identified using R 'Limma' package (Ritchie et al., 2015).The selection criteria were: abs fold change > 1.5 and p-value < 0.05.The volcano plot and heatmap were visualized via 'ggplot2' R package (Ito & Murphy, 2013).

Protein-protein interaction (PPI) network establishment
The STRING database (www.string-db.org/) was used to construct the PPI network based on the identified DEGs (Szklarczyk et al., 2021).DEGs which did not encode proteins and interact with others were eliminated (minimum required interaction score: 0.400).The PPI network was visualized using the remaining DEGs.The 'tsv' format was downloaded and input into Cytoscape (Doncheva et al., 2019) (https:// cytoscape.org/,version 3.9.1)software.The DEGs were further filtered using three algorithms (Betweenness, Closeness, and Degree) (Zhou et al., 2022) from CytoHubba plug-in in Cytoscape.Each algorithm was used to generate a score for every DEG, and the DEGs were subsequently ranked according to their respective scores.The top 30 DEGs in each algorithm were identified as the node DEGs.The interaction of node DEGs from three algorithms was visualized via the Venn diagram for the subsequent analysis.

Functional enrichment analysis
For functional enrichment analysis, the Gene Ontology (GO) (biological process (BP), cellular component (CC), and molecular function (MF)) (Harris et al., 2004) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2021) databases are widely employed.Herein, the Sangerbox platform (Shen et al., 2022), a user-friendly and comprehensive bioinformatics platform, was utilized to execute functional enrichment analysis on the DEGs selected from the PPI network.P-value 0.05 was deemed statistically significant.

Machine learning
Three machine learning approaches were applied to select markers for predicting the risk of MFS incidence.We used the labels of the group, including normal and disease samples, as dependent variables and the expression matrix of dataset GSE110964 was used as an independent variable to fit the model.The least absolute shrinkage and selection operator (Lasso) (Mao et al., 2021) is based on fitting a logistic regression model and restricting the coefficients of the model on this basis to obtain features that are relatively important for classification.Support vector machine-recursive feature elimination (SVM-RFE) and random forest (RF) are directed by the learning process, which allows for the development of more accurate classifiers and are based on a limited number of characteristics to select the most 'relevant' ones.To be specific, RF can handle high-dimensional data and build predictive models with high accuracy (Blanchet et al., 2020); SVM-RFE is widely used to select and visualize the most relevant features through non-linear kernels by removing redundant factors and retaining the most relevant variables (Sanz et al., 2018).All three feature selection methods are classifier models.In this study, only the overlapping DEGs from the three algorithms were considered as the validated DEGs to enhance the accuracy of the predictive value of DEGs.

Diagnostic value evaluation and nomogram construction
Student's sample t-test was employed to compare the expression of the DEGs, ensuring the validation of their specificity and sensitivity.The receiver operating characteristics curve (ROC) was then drawn, and the area under the curve (AUC) and 95% confidence interval (CI) were determined.The nomogram was established based on the validated DEGs using 'rms' R package (Jiang et al., 2021).

Peripheral blood collection and validation by quantitative reverse transcription polymerase chain reaction (qRT-PCR)
The  S1.Following the collection of peripheral blood, total RNA was extracted using the Blood RNA Kit (19241ES50, Yeasen, Shanghai, China) according to the manufacturer's instructions.The RNA concentration was determined using Nanodrop (Thermofisher, America), and the cDNA was synthesized using the HifairV R II 1st Strand cDNA Synthesis Kit (11121ES60, Yeasen, Shanghai, China).The primers are listed in Supplementary Table S2, and b-actin served as the housekeeping gene.

Single-sample gene-set enrichment analysis (ssGSEA)
Furthermore, ssGSEA analysis was performed to evaluate the underlying mechanisms of the validated DEGs in MFS pathogenesis.Specifically, Hallmark gene sets were obtained from MSigDB (Liberzon et al., 2015) (https://www.gsea-msigdb.org/gsea/msigdb/),which has 50 well-defined biological processes.The 'GSVA' R package was used to determine the correlation between each sample and the hallmark gene (H€ anzelmann et al., 2013).The correlation between the validated DEGs and hallmark gene sets was also visualized.

Identification of DEGs between MFS and control
A total of 215 DEGs were identified between MFS and control, of which 16 were up-regulated, and 199 were downregulated.All DEGs were visualized using the volcano plot (Figure 2

PPI network construction
The DEGs were input into STRING database.Subsequently, 60 DEGs were identified from a constructed PPI network after the removal of isolated DEGs and DEGs which did not encode proteins (Figure 3(A)).The top 30 DEGs from Betweenness, closeness and degree algorithms were visualized using CytoHubba plug-in in Cytoscape software (Supplementary Figure S1 and Table S3), of which 20 overlapping DEGs were selected for machine learning.The 20 DEGs were visualized via the Venn diagram (Figure 3(B)).

Functional enrichment analysis
Functional enrichment analysis was conducted using the 60 DEGs from the PPI network.GO analysis showed that the DEGs were mainly enriched in 'establishment of localization' and 'regulation of transport' (BP) (Figure 3   (MF) (Figure 3(E)).In summary, GO analysis showed that transport activity can be used to predict MFS incidence.KEGG analysis revealed that the DEGs were highly related to 'phospholipase D signaling pathway', 'glutamatergic synapse' and 'focal adhesion' (Figure 3(F), Supplementary Table S4).These results suggest that MFS are highly correlated with the abnormal transportation of structure and energy substances.

Identification of candidate DEGs via machine learning
The optimal parameter (lambda) in the LASSO model was selected using 10-fold cross-validation with the lowest standard.A plot was generated to depict the correlation between the partial likelihood deviation (binomial deviation) and log(lambda).Lasso regression analysis identified seven DEGs with the lowest binomial deviance (GRM3, GCKR, NCAN, LAMA3, CFTR, ITGB3 and ACTN1) (Figure 4(A)).Gene importance was calculated by random forest for DEGs.The top-10 DEGs based on the RF scores (SHANK2, CFTR, CLCA1, ACTN1, GCKR, ITGB3, LAMA3, FYN, GRM3 and CALD1) were selected for further analysis (Figure 4(B, C)).Furthermore, the SVM-RFE revealed that seven DEGs (NRXN1, GCKR, NCAN, LAMA3, SHANK2, ACTN1 and CLCN3) exhibited the highest accuracy and the lowest error after 10-folds, thereby qualifying them as promising candidate biomarkers for MFS (Figure 4(D)).The DEGs were then ranked based on the score visualized via the column (Figure 4(E)).Finally, the overlapping DEGs (ACTN1, CFTR, GCKR and LAMA3) based on the three machine-learning algorithms were visualized (Figure 4(F)) and used to compare expression profile and ROC analysis.

The evaluation of ROC curve and nomogram establishment
The expression of the DEGs was notably different between MFS and control visualized from the dataset and clinical sample comparison.Besides, the DEGs had an ideal predictive value based on the ROC curve.Specifically, the expression of ACTN1 was higher in MFS than in the control group, whereas the expression of the other three genes (CFTR, GCKR and LAMA3) was lower in MFS than in the control group (Figure 5(A)).The AUC for ROC and 95% CI were 0.959 and 0.864-1.000,respectively, for ACTN1, CFTR, GCKR; 0.816 and 0.533-1.000,respectively, for LAMA3 (Figure 5(B)).ROC curve analysis showed high predictive performance of each gene when AUC value of this gene >0.8.Moreover, all of the four DEGs showed similar pattern in expression level in MFS compared with control via clinical peripheral blood validation (Figure 5(C)).Subsequently, the constructed nomogram and its ROC curve are shown in Figure 5(D, E).The scores for DEGs were determined based on their expression level.The sum of the scores (total score) corresponded to the linear prediction and thus could be used to predict the risk of MFS incidence.ROC analysis of the nomogram was conducted to evaluate its clinical utility, resulting in the identification of a gene set that is significantly associated with an increased risk of MFS.

ssGSEA analysis
The ssGSEA analysis demonstrated that ACTN1 was significantly positively associated with 'myogenesis', 'inflammatory response', 'IL6-JAK-STAT3 signaling pathway', 'epithelial mesenchymal transition' and 'angiogenesis' (Supplementary Figure S2).In contrast, the other three DEGs were negatively correlated with most of the above pathways, showing that these pathways were activated in MFS and the identified biomarkers were strongly related to inflammation and myogenesis, which may be closely related to MFS occurrence and development.

Identification of DEMs and mRNA-miRNA network construction
A total of 78 filtered DEMs (47 up-regulated and 31 downregulated) were visualized via the volcano plot (Figure 6(A)).

Discussion
MFS is a rare disease with high mortality.In this study, biomarkers of MFS were explored using various bioinformatics and machine-learning tools.Finally, four DEGs (ACTN1, CFTR, GCKR and LAMA3) were identified and validated.ACTN1 (Actinin Alpha 1) belongs to the cytoskeletal proteins.Dysregulation of muscle function is commonly observed in MFS patients.Although the role of ACTN1 in several neuromuscular disorders has been partially elucidated, the function of ACTN1 in MFS pathogenesis is unknown.Blondelle et al. (2019) found that Cullin-3-mediated degradation of ACTN1 promotes muscle development.ACTN1 is also associated with metastasis and migration of several tumors via diverse mechanisms.For instance, Chen et al. (2021) showed that ACTN1 is highly expressed in hepatocellular carcinoma and promotes tumor growth by inhibiting the Hippo signaling pathway.ACTN1 overexpression is positively correlated with a poor prognosis of oral squamous cell carcinoma (Xie et al., 2020).Furthermore, ACTN1 inhibition by Oroxylin A can inhibit breast cancer metastasis (Cao et al., 2020).Herein, ACTN1 was up-regulated in MFS, and thus may reduce the stability of cytoskeletal proteins, leading to MFS.
CFTR (CF transmembrane conductance regulator) belongs to the ATP-binding cassette transporter superfamily.CFTR is also associated with cystic fibrosis (Boehm & Kazazian, 1990), a hereditary disease.Although the relationship between CFTR and MFS is unknown, it is postulated that CFTR can contribute to the occurrence of MFS since the genetic  model.Furthermore, CFTR inhibition can increase vascular cell proliferation and reduce pulmonary artery relaxation.In this study, CFTR was downregulated in MFS patients, suggesting that vascular homeostasis was damaged.Therefore, it is necessary to evaluate CFTR and its function in vascular regulation to determine its mechanism in MFS.
regulator (GCKR) participates in the regulation of pancreatic and liver secretion.Most studies have assessed the role of GCKR in diabetes mellitus (Ma et al., 2020) and fatty liver disease (Ioannou, 2021).The dysregulated secretion causes lipid disorders (Tin et al., 2016), leading to atherosclerosis and severe cardiac complications in MFS.Herein, GCKR was downregulated in MFS, demonstrating that maintaining GCKR balance can improve lipid metabolism and reduce the incidence of cardiac complications.
Laminin-332 a3 chain (LAMA3) belongs to the laminin family of secreted molecules.LAMA3 plays a crucial role in preserving diverse junctions and exhibits anti-pathogen effects through its involvement in these junctions.For example, Li et al. (2020) found that LAMA3 transduction can enhance hemidesmosome formation during wound healing.LAMA3 mutation is also associated with enamel defects ( Wang et al., 2022).Several variants of LAMA3 are highly related to epidermolysis bullosa (Wang et al., 2022).The damage of the junction is often linked to the occurrence of inflammation.Pesch et al. (2017) showed that LAMA3 disruption can induce skin inflammation and fibrosis, indicating that LAMA3 can regulate inflammation.Moreover, LAMA3 downregulation is associated with poor prognosis of different tumors (Tang et al., 2019).Herein, LAMA3 was downregulated in MFS, suggesting that enhancing LAMA3 expression could be a potential therapeutic target of MFS.
The mRNA-miRNA network showed that samples in both mRNA and miRNA datasets were similar (generated from peripheral blood), possibly due to the removal of sample bias.These results indicate that the DEGs were highly related to the regulation of junction, metabolism and inflammation, consistent with ssGSEA and functional enrichment analysis results.
However, this study has some limitations.First, the numbers of MFS datasets and the samples in the datasets were limited since the morbidity of MFS is low.As a result, the AUC of the four identified DEGs was extremely high.Herein, we induced a newly developed sample size evaluation online tool, as shown in Supplementary Figure S3, the sample size was located in the orange region, refers to that the sample size met one of the three approaches located in the tool, thus, the sample size is passable but not satisfactory.Second, although we collected clinical samples and validated the expression of the identified DEGs using qRT-PCR, the underlying mechanisms were not revealed, which need to be further investigated.

Conclusion
In summary, four potential DEGs (ACTN1, CFTR, GCKR and LAMA3) were systematically identified, and a nomogram was developed to predict the risk of MFS using multiple types of bioinformatic analyses and machine learning algorithms.Our findings provide the foundation for future research on prospective crucial candidate genes for MFS patients.Subsequently, the DEGs were substantially associated with the inflammatory response and the myogenesis pathway, which may provide additional guidance for personalized medicine.Meanwhile, the mRNA-miRNA network was constructed based on the DEGs for further clinical validation.Our study could contribute to a greater comprehension of how miRNAs mediate the etiopathological mechanisms of MFS.
(A)).Red and green triangles represented the prominent up-and down-regulated DEGs in MFS compared with control.Meanwhile, all of the up-regulated DEGs and the top-20 down-regulated DEGs were plotted in the heatmap (Figure 2(B)).

Figure 2 .
Figure 2. The volcano plot and heatmap of DEGs between MFS and control.(A) The volcano plot for GSE110964 showing all DEGs in MFS compared with healthy control.The red and green triangles denote up-and down-regulated DEGs, respectively.Each dot represents an individual gene.(B) The heatmap showing all upregulated DEGs and the top 20 down-regulated DEGs in MFS.Red and blue brackets represent up-regulated and down-regulated DEGs, respectively.The color intensity is proportional to gene expression level.DEGs: differentially expressed genes; MFS: Marfan syndrome.

Figure 3 .
Figure 3.The visualization and selection of DEGs from PPI network and functional enrichment analysis.(A) Visualization of the whole PPI network constructed by the STRING database using the 60 DEGs which were selected via PPI network construction after the removal of the isolated DEGs and DEGs that failed to encode proteins.The edges refer to the interaction between two genes.(B) The Venn diagram illustrating the overlapped 20 DEGs that were identified through the application of the Betweenness, Closeness, and Degree algorithms in CytoHubba plug-in from Cytoscape software.(C-E) GO analysis (biological process, cellular component, and molecular function) of 60 DEGs identified from the PPI network.The dimensions and chromaticity of the circle correspond to enriched gene numbers and FDR values for each item, respectively.In the bubble diagram, the horizontal axis denoted as gene ratio, portrays the proportion of core targets engaged in each term relative to the total number of targets in the term.The size of the bubble is proportional to the quantity of core targets implicated in the term, whereas the color spectrum spanning from black to yellow represents the FDR value, where darker hues indicate lower values and lighter hues indicate higher values.(F) KEGG analysis of 60 DEGs in MFS.The left and right parts of the circle represent enriched genes and relevant pathways, respectively.The color gradient ranging from black to yellow signifies the FDR value, where darker hues indicate lower values and lighter hues indicate higher values.PPI: protein-protein interaction network; GO: gene ontology; KEGG: Kyoto Encyclopedia of Genes and Genomes; FDR: false discovery rate; Others see Figure 2.

Figure 4 .
Figure 4. Candidate hub DEGs selected via the three machine-learning strategies.(A) Lasso regression analysis showing seven DEGs with the lowest binominal deviation.The lambda.min and lambda.1seare denoted by vertical dashed lines.(B, C) Random Forest error rate versus the number of classification trees.DEGs ranked based on the scores generated via a random forest algorithm.The top 10 DEGs were visualized via the column.(D) SVM-RFE algorithm indicating that seven DEGs exhibited the highest accuracy and the lowest error after 10-folds.(E) Genes sorted based on the average rank calculated via SVM-RFE.(F) The Venn diagram showing four DEGs selected from the intersection of the three algorithms.SVM-RFE: support vector machine-recursive feature elimination; Others see Figure 2.

Figure 5 .
Figure 5. Validation of the identified DEGs in MFS and construction of the nomogram.(A) The expression profile of DEGs between healthy controls and MFS patients.(B) The ROC curve of the DEGs in predicting the risk of MFS.Each panel displayed the AUC under the curve and 95% CI.The gene has a great predictive value for MFS when AUC value of this gene >0.8.(C) The expression of the identified DEGs in MFS compared with healthy control using clinical peripheral blood by qRT-PCR.�� , p < 0.01, ��� , p < 0.001.(D) Construction of a nomogram using the identified DEGs.(E) The predictive value of the nomogram in MFS revealing from the ROC curve.ROC: receiver operative curve; Others see Figure 2.

Figure 6 .
Figure 6.The volcano plot and heatmap of DEMs in MFS.(A) Volcano plot for GSE110965 showing DEMs.Red and green triangles represent the up-regulated and down-regulated DEMs, respectively.Each dot represents an individual miRNA.(B) Heatmap showing the top 20 up-regulated and down-regulated DEMs in MFS patients.Red and blue brackets represent up-and down-regulated DEGs, respectively.The level of miRNA expression is directly proportional to the intensity of color.DEMs: differentially expressed miRNAs; Others see Figure 2.

Figure 7 .
Figure 7.The prediction of miRNAs targeting the identified DEGs and establishment of mRNA-miRNA network.(A) Venn diagram showing the intersection of predicted miRNAs via miRWalk and RNA22 database.(B) The overlapping miRNAs based on the predicted miRNAs and the identified DEMs.(C) The whole mRNA-miRNA network in MFS.Ellipse and triangle represent mRNA and miRNA, respectively.Green and blue represent down-regulation, while red and yellow represent up-regulation.For abbreviations: See Figure 6.
Marfan patients and six healthy controls from May 1st 2022 to May 28th 2023 in the Department of Vascular Surgery and Rheumatology in Shaoxing People's Hospital.All the participants provided informed consent to participate.The basic clinical characteristics are displayed in Supplementary Table clinical sample collection protocol was approved by the Ethics committee board of Shaoxing People's Hospital (Approval number: 2022 Ethics Clearance No. 120).We recruited and collected peripheral blood samples from six