Decoding the Possible Molecular Mechanisms in Pediatric Wilms Tumor and Rhabdoid Tumor of the Kidney through Machine Learning Approaches

Abstract Objective: Wilms tumor (WT) and Rhabdoid tumor (RT) are pediatric renal tumors and their differentiation is based on histopathological and molecular analysis. The present study aimed to introduce the panels of mRNAs and microRNAs involved in the pathogenesis of these cancers using deep learning algorithms. Methods: Filter, graph, and association rule mining algorithms were applied to the mRNAs/microRNAs data. Results: Candidate miRNAs and mRNAs with high accuracy (AUC: 97%/93% and 94%/97%, respectively) could differentiate the WT and RT classes in training and test data. Let-7a-2 and C19orf24 were identified in the WT, while miR-199b and RP1-3E10.2 were detected in the RT by analysis of Association Rule Mining. Conclusion: The application of the machine learning methods could identify mRNA/miRNA patterns to discriminate WT from RT. The identified miRNAs/mRNAs panels could offer novel insights into the underlying molecular mechanisms that are responsible for the initiation and development of these cancers. They may provide further insight into the pathogenesis, prognosis, diagnosis, and molecular-targeted therapy in pediatric renal tumors. Graphical Abstract


Background
Wilms tumor (WT; nephroblastoma) is a prevalent kidney malignancy in infants and children and is responsible for approximately 7% of all pediatric cancers.In light of novel chemotherapies, combined with nephrectomy and radiotherapy, the survival rate of WT-affected children has increased by almost 90%.Chemotherapy resistance, recurrence, and metastasis affect the prognosis in some patients [1].Recently, it has been confirmed that genetic and epigenetic events are the driving forces of WT [2,3].Mutations in Wilms' tumor 1 (WT1), Wilms' tumor suppressor X chromosome (WTX), and β-catenin are related to increased activity of the Wnt/β-catenin pathway, occurring in 30% of patients.Mutations in microRNA (miRNA) processing genes also are responsible for 15% of the WT cases [4].A significant number of patients remain without a recognized genetic driver.The identification of other driving factors and potential targets is important for development of therapeutic strategies.
The rhabdoid tumor of the kidney (RTK, RT) is an aggressive tumor that occurs mostly in early childhood and quickly spreads to the brain or lungs [5].Because of a poor understanding of the regulatory mechanisms and biological characteristics of the disease, extremely poor prognosis of these patients is reported, and the 5-year survival rate is about 20% to 25% [6,7].Initially, the RT was categorized as a possible variant of the WT; however, later studies confirm its distinct nature and termed it a rhabdoid tumor.The responsible defect is deletion or mutation in the SMARCB1/ hSNF5/INI-1 gene.The SMARCB1 gene acts as a tumor suppressor gene [5].
There are some overlaps in a histologic pattern that challenges the histologic differentiation of the WT and RT [8].Despite discovering causative genes, the candidate mRNAs and miRNAs driving oncogenesis in the WT and RT remain unknown.It is important to elucidate the most important signaling pathways implicated in WT and RT development.In this regard, artificial intelligence (AI) technologies will help pharmacists and clinicians develop novel targeted therapy and decision-making strategies.Identifying diagnostic or prognostic molecular markers can reduce unnecessary interventions and enhance the survival rate of patients.
We proposed that the artificial intelligence-based algorithms can be valuable implements to extract appropriate biological information about transcriptome patterns in the WT and RT.The present study aimed to detect mRNAs and miRNAs and their pathological roles in the WT and RT using machine learning and deep learning approaches.

Methods
In this study, the WT and RT-related clinical, miRNA, and mRNA data were downloaded from The Cancer Genome Atlas (TCGA) dataset; the GDC portal (https:// portal.gdc.cancer.org)[9], Table 1.Different steps were applied to the miRNA and mRNA data (graphical abstract).First, using necessary pre-processing methods, the clinical, miRNA, and mRNA data were preprocessed.By the proposed algorithms, unrelated miRNAs and mRNAs features were removed from the data.The deep classifier model was used to discriminate the defined groups based on the candidate features obtained in the previous step.The association rule mining algorithm was used for recognizing the main features with an essential role in the groups.

Reading and preprocessing
Collecting, cleaning, cross-validation, and normalization processes were applied to miRNA and mRNA data.The data were elucidated as a matrix where columns and rows respectively represent features and samples.The non-discriminatory features with a similar value in all of the samples were deleted from the data.The hold-out cross-validation technique was used to divide the data into training, validation, and test parts.For data normalization, the z-score (Equation 1) and min-max (Equation 2) were used in the feature selection, classification, and association rule mining phases, respectively.
where x = raw feature values, y = normalized value, µ = mean value of x, σ= standard deviation, min= minimum value of x, and max=the maximum value of x.

Feature selection and graph method
In machine learning and pattern recognition problems, the feature selection (FS) process is considered a significant factor since it can improve the accuracy of the classifier or the error of the regressor model [10].In the genetic data, the FS allows us to filter out irrelevant and redundant features and identify the features (miRNAs and mRNAs) that are essential in the pathogenesis of the disease [11].In so doing, a new graph-based method was applied separately to feature selection in miRNA and mRNA data [12].Using the proposed algorithms, an appropriate subset of the candidate features was identified.In miRNA data, among 1541 features, an appropriate subgroup of candidate miRNAs was extracted based on a graph algorithm.In mRNA data, because of the existence of 57,799 mRNAs, first, the filter-based method was utilized decreasing their number 1000.In this sub-step, the AMGM measure was used to assess the significance of the features (See Ref. [13]).Three phases including feature space, community detection/graph clustering, and selection of important nodes for each cluster were used in the graph-based method.In the graph representation phase, the feature set is plotted in the graph space.In the community detection phase, to cluster the graph space and to find nodes with internally high-density connectors, a community detection algorithm was used.The Louvain algorithm was used in the application phase [14].In the node selection phase, the Boppana et al. algorithm [15] was used to get the Maximum Independent set (MIS) in each cluster/community.For more details, please see Refs.[13,16].

Classification
To evaluate the identified features and categorize data based on the identified mRNAs/ miRNAs, a classification model (classifier) and a self-organizing deep auto-encoder model [17] were applied, respectively.The training procedure of the deep model and the selection of the model were performed by respectively training and validation data.The accuracy, F 1 -score, the area under the curve (AUC), confusion matrix, and receiver Operating Characteristic curve (ROC) were used to test data for evaluating the activities of the classifier.

Association rule mining
Extraction of effective associations among data items can be performed by Association Rule Mining as described previously [18].Frequent itemset generation and Rule generation are two phases of association rule mining.In the present study, the Apriori algorithm [19] was utilized for the association analysis.In Supplementary Tables 1  and 2, the pseudo-code of the Apriori algorithm is presented.

Results
The four key stages including reading and preprocessing, feature selection, classification, and filtering were completed on molecular transcripts of the WT and RT data (graphical abstract).In the pre-processing step, 1881 miRNAs from 126 cases were used as a matrix form with 126 columns and 1881 rows.mRNA data (n = 60,482) obtained from 199 samples was composed as a matrix form, 340 miRNAs and 2683 mRNAs, due to irrelevancy.By the hold-out cross-validation method, the miRNA and mRNA data were split into train and test data, with an 80%: 20% ratio.Train and test data were normalized based on the information of training data by the min-max normalization method.
After carrying out the reading and pre-processing stages, the feature selection procedure was done on miRNA and mRNA data.In miRNA data, 85 RNAs were selected from 50 communities/clusters based on the graph method.The γ and β (graph method parameters) were respectively set to 0.3 and 0.6.In mRNA data, based on the AMGM, the filter method was used to filter out the mRNAs, so that 1000 mRNAs with the highest AMGM value were picked out.Thirty-one mRNAs were selected from 12 clusters/communities based on the graph method.In mRNA data, the β and γ parameters were set to 0.5 and 0.1, respectively.Candidate miRNAs and mRNAs were illustrated based on the normalized AMGM values (Figures 1a and 2a), respectively.In Supplementary Tables 3 and 4, the lists of candidate miRNAs/mRNAs are presented.
In the classification stage, candidate features (miRNA/mRNA) were evaluated.The discrimination power of candidate miRNA/mRNA was assessed by the classifier model.With a high AUC (0.96/0.93) and accuracy (97%/93%), candidate miRNAs could separate the WT and RT groups in train/test data.In mRNA data, with high accuracy (94%/97%) and AUC (0.95/0.98), candidate features discriminated the WT and RT groups in train/test data.Various measures of the self-organizing deep auto-encoder were described in Table 2 in more detail.Figures 1b-c and 2b-c respectively present the confusion matrix and ROC curve of train/test data.The accuracy of the use approach is shown in Table 3.No similar studies associated with WT/RT subtypes were found based on mRNA data for comparison.
In the association rule mining stage, interesting associations such as feature(s)-feature(s) (miRNA/mRNA) and feature(s)-target (WT/RT) were discovered.Significant miRNAs/mRNAs were selected based on the repeat count of those features and explored their role in pediatric kidney tumors.All miRNAs/mRNAs involved in the WT and RT rules were reported in Supplementary Table 5.Based on repeat counts in WT/RT rules, mRNAs and miRNAs as graph networks were shown in Figures 3a-b and 4a-b.
Based on their support, lift, and confidence, the distribution of WT/RT association rules was clarified for miRNA and mRNA (Figures 3c-d and 4c-d).
In the last step, five significant miRNA/mRNAs based on the WT and RT association rules were identified.The top significant dysregulated miRNAs/mRNAs in WT and RT are shown as Box plots (Figure 5).The association of the top identified RNAs in WT/RT rules based on the Spearman correlation was shown in Figures 6a-b and 7a-b, respectively.The pair-plot of the main miRNAs and mRNAs involved in WT and RT rules are demonstrated in Figures 6c-d and 7c-d, respectively.
The Chromosome 19 open reading frame 24 (C19orf24, ENSG00000228300.12)and RP1-3E10.2(RP1-3E10.2,ENSG00000235196.3)were the frequent itemset in the WT and RT association rules with 360 and 302 repeat counts, respectively.Likewise, let-7a-2 and miR-199b were the frequent itemset in the WT and RT association rules with 3571 and 783 repeat counts, respectively.To investigate the relation of the candidate miRNAs and mRNAs with other features, other association rules were performed (Figures 8 and 9).More in-depth biological functions of the results are provided in the discussion part.

Discussion
A comprehensive analysis was performed on the WT and RT data based on AI methods.Throughout an Association Rule Mining Analysis, 5 mRNAs and 5 miRNAs were identified with the highest repeat count items which have significant playing roles in WT and RT subtypes, respectively.Panels of candidate mRNAs (94%/97%) and miRNAs (97%/93%) could discriminate the WT from RT in train/test data with high accuracy, respectively.Let-7a-2 and chromosome 19 open reading frame 24 (C19orf24) were identified in the WT, while miR-199b and RP1-3E10.2were detected in the RT by analysis of Association Rule Mining.The C19orf24 had a high dependency on the BCLAF1, CTCF, BRPF3, SCAF8, SMPD4, PARD6BP1, BRD1, and SRCAP.In the RT, the RP1-3E10.2had a high dependency on SMPD4, PP4R1, and CTCF.
The WT consists of a heterogeneous population of cells as a part of the nephrogenesis pathway, where cancer stem cells (CSCs) have been hypothesized to manifest the WT [24].The WT1 is a tumor suppressor gene located on chromosome 11 [25], encoding a zinc finger transcription factor.It has an impact on developmental and pathological procedures by regulating the turnover and stability of key RNAs [26].WT1 is important for the homeostasis and development of several mesodermal tissues.It regulates the balance between the mesenchyme-to-epithelial transition (MET) and epithelial-to-mesenchyme transition (EMT), needed for nephron formation and yielding coronary vascular progenitors from the epicardium, respectively.In response to Wnt signaling during embryonic development, WT1 promotes the differentiation of the embryonic kidney progenitor cells into glomeruli and tubules.When these cells continue to proliferate rather than differentiate, WT can arise.The top five mRNAs including the C19orf24, ZSCAN23, BCLAF1, SCAF8, and PARD6BP1 were identified with association rule mining analysis, the pathological roles of which have not been reported in the WT.
The C19orf24 (chromosome 19 open reading frame (ORF) 24) was the first top-identified mRNA in our association rule mining analysis.About 62 oncogenic ORFs have been recognized to be associated with cancer [27].The C19orf24 is a secreted protein encoded by the proposed C19orf24 gene on chromosome 19p13.3[28].This non-classical secreted protein lacks a discernable signal peptide sequence [29] and does not co-localize in the Golgi apparatus, proposing that it is secreted through a novel and unidentified pathway.In recurrent or metastatic colorectal cancer, an upregulated C19orf24 was identified as a positive prognostic survival factor by machine learning algorithms [30].Abnormal methylation patterns of chromosome 11p15 [31] along with loss of heterozygosity at 19q [32] are reported in the WT that may affect the expression of the C19orf24.It is reported that transcripts of chromosome 19 encode potentially unidentified candidate proteins in glioma cancer stem-cell lines [33].The role of C19orf24 in the WT has not been studied; it can be suggested that genetic alteration or dysregulated levels of C19orf24 may be involved in its etiology.Although studies on the association between ORFs and cancer are scarce, these uncharacterized ORFs proteins can represent an opportunity to identify novel targets in different diseases and disorders, especially in cancer.
The ZSCAN23 was respectively the second and 3 rd identified mRNAs in the WT and RT in our association rule mining analysis.Zinc finger and SCAN domain-containing (ZSCAN) transcription factors have different biological functions in cancer progression by acting as transcriptional repressors or activators [34].These transcription factors can be involved in angiogenesis, stem cell properties, cellular differentiation, proliferation, apoptosis, and invasion [34].microRNA, alternative splicing, DNA methylation, or gene mutation can down-or up-regulate the ZSCANs expression [34].Their aberrant expression has been reported in different types of cancer.The ZSCAN23 was among 10 prognosis hypermethylated genes that expressed in low levels in cancer cells [35] and seems to be potential biomarkers of breast cancer with an unknown role [36].A mutation in the ZSCAN23 was reported in Thyrotropin-secreting pituitary adenomas [37].The roles of the ZSCAN23 in the WT and RT have not been reported yet and there is a need to clarify its function in the pathogenesis of these tumors.The Bcl-2-associated transcription factor 1 (BCLAF1) was the 3 rd and 4 th identified transcript respectively in the WT and RT in this study.The BCLAF1 is translated to a repressor protein that interacts with BCL2 family members and its overexpression triggers apoptosis.It is involved in diverse biological procedures, such as T-cell activation, lung development, lytic infection program regulation, DNA repair and damage, and post-transcriptional processes (mRNA splicing and processing) [38].Various studies indicate that BCLAF1 may serve as a tumor suppressor or oncogene [38].It is also reported that Bclaf1 is a component of the WT1-associating protein (WTAP) complex.This complex is enriched in proteins that are participated in RNA processing and post-transcriptional regulation, including splicing, polyadenylation, and stabilization of mRNAs involved in cell cycle progression, cellular apoptosis, proliferation, and early embryo development [39].The nuclear protein WTAP interacts with WT1.In renal cell carcinoma, higher expression of the WTAP exerts an oncogenic role by stabilizing CDK2 mRNA, a cell cycle controller whose dysregulation promotes cell proliferation  259) related association rules, in which the identified mirna, its rules, mirnas were presented, by orange, yellow, and blue colors, respectively.b) graph network of enSg00000228300.12related association rules (with lift >1.33), in which enSg00000228300.12, rules, mrnas were presented, by red, yellow, and red colors, respectively.2) related association rules, in which the identified mirna, its rules, mirnas were presented, by orange, yellow, and blue colors, respectively.b) graph network of enSg00000235196.3related association rules (with support > 0.25 and lift >1.19), in which enSg00000235196.3, rules, mrnas were presented, by red, yellow, and red colors, respectively.[40].The WTAP also has tumorigenesis function in other cancers [41].Bclaf1 could also enhance angiogenesis through its binding to the HIF1A gene promoter and can promote cancer progression via the regulation of HIF-1α and its angiogenic targets transcriptionally [38].The BCLAF1 binds to HIF-1α in the nucleus, stabilizing HIF-1α during hypoxia and eventually resulting in tumor progression [42].A recent study found that Bclaf1 is one of the commonly altered genes in the WT, mapping to chromosome 16q loss [43].It seems that BCLAF1 in the WT and RT plays a significant role in both the cell cycle and posttranscriptional regulation; its role needs to be studied.
RNA processing factor SCAF8 was the 4 th candidate mRNA with a high repeat count in our study.The SCAF8 is an anti-terminator protein that is needed for correct mRNA termination by the selection of a proper terminal polyA site [44].Precise control of mRNA termination is essential for accurate gene expression.The SCAF8 inhibits the utilization of early intrinsic polyA sites to reduce the creation of nonfunctional proteins.This RNA processing factor binds to RNA Polymerase II [45] and functions as a positive elongation factor of many genes [46].Of note, different newly discovered genes in the WT are participated in transcriptional elongation [47], implying its possible role in the pathogenesis of the WT.
Par-6 family cell polarity regulator β (PARD6B) is a vital component in tight junction formation in epithelial cells and the preservation of the polarity of the apicobasal.Overexpression of the PARD6B stimulates the MAPK activation and proliferation of breast cancer cells [48,49].Inhibition of the PARD6B leads to the loss of tight junctions in MCF7 cells [50].Downregulation of the polarity complex PKCζ/Pard6/Pard3 by hypoxia in lung cancer leads to EMT and cell invasion and chemoresistance [51].The PARD6B pseudogene 1 (PARD6BP1) was the 5 th top identified mRNA in association rule mining.No published data exist for the possible role of this RNA.
The roles of the identified mRNAs in this study have not been reported in the RT.The RP1-3E10.2 was the first top mRNA identified in this study.It is reported that the RP1-3E10.2was among genes differentially represented in the primary tumor microenvironment versus metastatic tumor microenvironment of high-grade serous carcinomas (HGSCs) samples.To the best of our knowledge, the role of this mRNA has not been elucidated.
The data-mining approaches proposed that PLCB4 (Phospholipase C Beta 4) modulates the inositol lipids metabolism and encodes the ß4 isoform of PLC isoenzymes [phosphoinositide-specific phospholipase C] [52,53].The role of PLCß4 in the field of cancer remains unclear.The PLCB4 is deemed to be among the top-ranking upregulated genes in aggressive tumors [52].Elevated expression levels of the PLCB4 has been associated with poor overall survival rate in cases with solid tumors [52].It has been shown that the PLCB4 was highly expressed in CD34 + CD38 -leukemia stem cell populations that are resistant to chemotherapy [53].The PLCB4 was identified as a potential target for treating and diagnosing a case with a small-cell type variant of renal oncocytoma [54].
Ubiquitin-specific peptidase 37 (USP37) is a serious deubiquitinase for HIF2α, which subsequently promotes its protein stability and kidney tumorigenesis [55].Interaction of the USP37 with HIF2α regulates a subset of downstream genes involved in hypoxia signaling [55].The USP37 depletion causes cell proliferation reduction in clear cell renal cell carcinoma (ccRCC) and its overexpression leads to lung cancer cell migration [56].It has been proposed that the USP37 regulates the ubiquitination of various targets (e. g.Cdt1, Wings Apart-Like (WAPL), Cyclin A, and c-Myc) [55,57].Additionally, the PLAGL2-USP37-Snail1 axis represents an important mechanism in the proliferation and migration of cells [58].

Conclusions
The application of the machine learning methods can identify mRNA and miRNA patterns, discriminating WT from RT. Comprehensive association rule mining analysis recommends that transcriptional elongation and dysregulated RNA expression due to genetic and epigenetic alterations signify a fruitful area of future research in the WT.The identified miRNAs/mRNAs panels could offer novel insights into the underlying molecular mechanisms that are responsible for the initiation and development of these cancers.They may provide further insight into the pathogenesis, prognosis, diagnosis, and molecular-targeted therapy in pediatric renal tumors.Our results suggest that future studies that attempt to target common pathways or processes would be more effective than targeting single-gene mutations or a dysregulated transcript.Artificial intelligence can open the dark matter of transcriptome and proteome to identify novel diagnostic and prognostic cancer targets where science is on the edge of a new age of disease-target discovery.

Figure 1 .
Figure 1. the performance of feature selection and classification steps in the mirna data.a) Bar plot of candidate mirnas (85) based on normalized aMgM values.b) confusion matrix of training and test data.c) receiver operating characteristic (roc) curve of training and test data.the area under the curve of receiver operating characteristic (aUc-roc) is reported for each curve.

Figure 2 .
Figure 2. the performance of feature selection and classification steps in the mrna data.a) Bar plot of candidate mrnas (31) based on normalized aMgM values.B) confusion matrix of training and test data.c) receiver operating characteristic (roc) curve of training and test data.the area under the curve of receiver operating characteristic (aUc-roc) is reported for each curve.

Figure 3 .
Figure 3. graph network of top mirnas with high repeat count in a) Wt and b) rt association rules.the most frequent mirnas were displayed with the edges weight based on repeat counts.also, the aMgM value of mirnas were illustrated with the nodes size.Strength distribution of c) Wt and d) rt association rules according to their support, lift, and confidence.

Figure 4 .
Figure 4. graph network of top mrnas with high repeat count in a) Wt and b) rt association rules.the most frequent mrnas were displayed with the edges weight based on repeat counts.also, the aMgM value of mrnas were illustrated with the nodes size.Strength distribution of c) Wt and d) rt association rules according to their support, lift, and confidence.

Figure 5 .
Figure 5. Box plot of selected mirnas for a) Wt and b) rt groups.Box plot of selected mrnas for c) Wt and d) rt groups.expression values of mirnas and mrnas were normalized in the range of 0 to 1 by the min-max method.

Figure 6 .
Figure 6. the heatmap plot based on the Spearman correlation for significant mirnas involved in Wt and rt. of a) Wt and b) rt rules.the pair plot of significant mirnas of c) Wt and d) rt rules.Wt: Wilms tumor, rt: rhabdoid tumor.

Figure 7 .
Figure 7. the heatmap plot based on the Spearman correlation for significant mrnas involved in Wt and rt.mrnas of a) Wt and b) rt rules.the pair plot of significant mrnas of c) Wt and d) rt rules.

Figure 8 .
Figure 8. graph network of identified micrornas/mrnas in Wilms tumor.a) has-let-7a-2 (with lift > 1.259) related association rules, in which the identified mirna, its rules, mirnas were presented, by orange, yellow, and blue colors, respectively.b) graph network of enSg00000228300.12related association rules (with lift >1.33), in which enSg00000228300.12, rules, mrnas were presented, by red, yellow, and red colors, respectively.

Figure 9 .
Figure 9. graph network of identified of identified micrornas/mrnas in rhabdoid tumor.a) has-mir-199b (with support > 1.2) related association rules, in which the identified mirna, its rules, mirnas were presented, by orange, yellow, and blue colors, respectively.b) graph network of enSg00000235196.3related association rules (with support > 0.25 and lift >1.19), in which enSg00000235196.3, rules, mrnas were presented, by red, yellow, and red colors, respectively.

Table 1 .
Subtypes information of kidney cancer in detail.

Table 2 .
the performance metrics for classification step.

Table 3 .
comparison of classification accuracy based on mirna data.