RNA-Seq Analysis of Clinical Samples from TCGA Reveal Molecular Signatures for Ovarian Cancer

Abstract Identifying differentially expressed genes and co-expression modules lead to novel biomarkers. GO, pathway enrichment, network, and tumor stage analysis of 318 ovarian cancer samples from TCGA, categorised into primary and recurrent, pre-menopause and post-menopause, and early and late stage tumors was performed. Upregulated and downregulated genes in primary vs recurrent, early stage vs late-stage and pre-menopause vs post-menopause tumors were 84 and 62, 84 and 35, and 88 and 14, respectively. IRAK2 and CXCL8 had higher expression in recurrent tumors while REG1A had higher expression in post-menopause samples. In late stage tumors constant expression of IRAK2 and REG1A was observed, while that of CXCL8 and EGF decreased. These genes may be potential biomarkers for the diagnosis of the disease.


Introduction
Ovarian cancer is one of the most fatal gynaecological cancers with 310,000 new cases and 200,000 deaths in the year 2020 (1). The disease is often diagnosed at advanced stages and relapses after initial therapy. The recurrent tumor is difficult to treat and even secondary surgery has not been beneficial in increasing the overall survival rate. Though the disease primarily manifests in post-menopausal women, around 5% of the total cases comprise of germ cell tumors seen in pre-menopausal women. With survival rate worsening due to delay in the disease diagnosis (2), this disease stands fifth among deaths caused due to cancer.
Therapeutic intervention at the early stages of cancer was shown to be beneficial to the patients. When diagnosed in early stage, the patients have shown a survival rate of more than 90%, which drops to 25% when diagnosed at late stage (3,4). However, the diagnosis of the disease remains a challenge. The manifestations of the disease do not include specific symptoms at the initial stages. Though late-stage disease present certain symptoms they are often non-specific and the disease is diagnosed when it shows metastasis. The treatment for the disease includes cytoreductive surgery and standard chemotherapy. Unfortunately, the tumor relapses within 2-3 years and the prognosis worsens leading to fatality (4). The most common diagnostic methods available today include the transvaginal ultrasound and the estimation of biomarker cancer antigen 125 (CA125). Several clinical trials have shown that these two methods are not sensitive and specific enough to prolong the survival of the patients (5). Early diagnosis of the disease with these methods remains a challenge. This necessitates emphasis on developing novel diagnostic and prognostic methods which is more sensitive and specific as compared to the existing ones.
The last decade has seen several studies attempting to discover novel biomarkers (6). With the advancement of next generation sequencing technology, the amount of genomic data in the public domain has increased exponentially and many researchers are working towards analyzing them in an attempt to discover and develop novel diagnostic and prognostic biomarkers (7,8). In one such study, co-occurrence of mutations in IL-7R gene was discovered in high grade serous carcinoma patients (9). In a recent study, Hou et al., presented a panel of 10 genes showing differential expression signature in ovarian cancer patients by analyzing the relationship between ovarian cancer and immune metagenes (10). In another study, important genes and pathways correlating endometriosis and ovarian cancer were predicted (11). Though several studies are available wherein the databases have been used for prediction of novel biomarkers, these studies have found limited application in clinical settings as they focus on individual genes rather than the association between a group of genes and their products. Differential co-expression networks were studied for various types of cancers like chronic lymphocytic leukaemia (12), breast cancer (13) and ovarian cancer (14). Gov and Arga, in their study showed network comprising of 84 genes which were differentially coexpressed and co-regulated between healthy and ovarian cancer patients (15).
TCGA was established in an attempt to expedite research in cancer prevention, diagnosis and development of novel therapeutics. TCGA has proven to be a landmark database, forming the basis of some the significant studies related to cancer biomarkers (16,17).
In this study the data available in TCGA was used to identify differential gene expression signature and co-expression networks in ovarian cancer and to perform their tumor stage analysis. The gene expression datasets available in TCGA have been categorized into 3the data sets based on the occurrence of the tumor (primary and recurrent tumors), the data sets based on the stage of the tumor (early stage and late-stage tumors), and the data sets based on the age at diagnosis (pre-menopause and post-menopause patients). Differentially expressed genes and coexpressed networks that can be potentially used as biomarkers to predict the diagnosis of the disease were identified. Besides their potential as biomarkers, these genes may also find their application as drug targets for ovarian cancer.

Gene expression datasets
The gene expression data and clinical information of ovarian cancer patients were acquired from the publicly available TCGA data portal (https://tcga-data.nci.nih.gov/tcga/). The datasets were divided into 3 categories based on parameters like occurrence, stage of cancer and age at diagnosis. The details of the samples and the categories are described in Table 1.
The numbers in this dataset are log2(x þ 1) transformed RSEM normalised counts, which reflect gene-level transcription estimates. HUGO probeMap was used to map genes onto human genome coordinates. The TCGA genome characterization centre at the University of North Carolina provided a reference to the method description (18). FDR (False Discovery Rate) and a p-value of <0.05 were used in all the analyses. The workflow is depicted in Figure 1.

Identification of differentially expressed genes
TCGABiolinks (19) which is an R/Bioconductor package was used for the identification of the differentially expressed genes (DEGs) between the different categories. For this, first quantile normalization of raw read count data was carried out. Functions from edger (20) package was used by TCGABiolinks for performing differential expression analysis. Log fold-change (logFc) cut off of >2 and < À2 was applied for upregulated and down regulated genes, respectively. False Discovery Rate (FDR) value limit of 0.05 and pvalue <0.05 was applied for identifying both upregulated and downregulated genes.

Determination of co-expression modules
The MCODE plugin of Cytoscape (v.3.8.2) was used to examine co-expression networks to understand the network modules (21,22). It employs a vertex-weighting algorithm that is based on the clustering coefficient Ci ¼ 2n/ki(ki À 1), where n indicates the number of edges in the neighbourhood and ki is the vertex size of the neighborhood of vertex I (21). Using the core clustering coefficient, the highly interconnected nodes of PPI network and weakly linked vertices were weighted. This algorithm analyses the weighted graph after computing and separates the highly connected regions referred to as clusters. These clusters represent the molecular complexes that are formed with differential expression of genes (DEGs) (23,24). The clusters containing at least 5 nodes (genes), average connectivity > ¼5, network density > ¼0. 30, with Kappa score (K-Core) set to 2, Degree Cutoff to 2 and Max Depth set to 100 were identified as coexpressing network modules (25,26) and were analysed further.

Topological and functional enrichment analysis
Several metrics, including node degree, network density, and clustering coefficient, were used to assess local and global topological properties of networks and their modules using the Cytoscape plugins NetworkAnalyzer (27) and Cytohubba (28). The dual-metric approach (15,29) was used to identify hub genes, with node degree as a local measure and betweenness centrality as a global measure (30). The top twenty hub genes in terms of any of the metrics were presented as hubs (31).
Database for Annotation, Visualization and Integrated Discovery (DAVID (v.6.8)) was used to perform GO terms like molecular functions, biological processes, and cellular components. Pathway enrichment analysis of identified DEGs (inside networks or modules) was also performed. DAVID (https://david.ncifcrf.gov/) is a database that provides bioinformatics tools for the functional interpretation of genes or proteins online (32). The Fisher Exact test was used to calculate p-values, which were then adjusted using Benjamini-Hochberg's method. The statistical significance was attributed if the adjusted pvalue <0.05.

Tumor stage analysis
The tumor stage analysis of the top 5 hub genes and the genes in co-expression module was performed using validated Gene Expression Profiling Interactive Analysis (GEPIA2) tool. This tool was also used to perform the analysis of hub genes which were found to be common between two categories of sample. GEPIA2 is an online tool that provides data analysis related to gene expression, tumor stage/grade, and survival (http:// gepia.cancer-pku.cn/), based on TCGA and the Genotype-Tissue Expression (GTEx) (33).

Identification of DEGs
Differential expression analysis between primary and recurrent ovarian cancer samples resulted in  Supplementary Table S1.

Primary vs. recurrent tumors
Gene Ontology (GO) enrichment analysis of the 84 upregulated and 62 downregulated genes in recurrent tumors revealed that most of the genes were found to be associated with angiogenesis, epithelial cell proliferation (Figure 3(a)) and apoptotic process. Majority of the genes were detected as a part of plasma membrane and mitochondria. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis showed their involvement in processes like TNF signalling pathway, apoptosis, etc., as listed in Table 2.

Early vs. late stage tumors
GO enrichment analysis of the 84 upregulated and 35 downregulated genes in late stage ovarian cancer samples revealed that most of the genes were found to be associated with transcription factors, cell adhesion, positive regulation of cell proliferation, positive regulation of ERK1 and ERK2 cascade etc. (Figure 3(b)). Majority of the genes were detected as a part of extracellular space and extracellular region. KEGG pathway analysis of the genes showed their involvement in pathways like cytokine-cytokine receptor interaction, chemical carcinogenesis, pancreatic secretion etc. (Table 2).

Pre-menopause vs post-menopause
GO enrichment analysis of the 88 upregulated and 14 downregulated genes in post-menopause patient samples showed association of most genes with process like transcription (Figure 3(c)). The genes were prominently located in extracellular space and extracellular region and majorly involved in pancreatic secretion as depicted by KEGG pathway analysis in Table 2.
Graphical representation of molecular function, cellular component and KEGG pathway for DEGs of all 3 categories is provided in Supplementary  Figure S2.

Identification of hub genes and generation of co-expression modules
The protein-protein interaction (PPI) network of DEGs was generated and the top 20 hub genes were selected using the MCC method for all three categories (Figure 4(a)). The top 5 hub genes viz.  and recurrent tumor samples. The statistically significant modules in the PPI coexpression network were predicted. These co-expressed modules generated by the hub genes were found to be interacting with other proteins involved in cellular physiology (Figure 4(b-d)). When hub DEGs of primary and recurrent tumors were  SLC9A1, SLC1A3 and SLC4A4 were found to be co-expressed. Similarly, CSF3, EGF, PF4, CCL20 and CCL7 were expressed in late stage tumor samples and REG1A, SPINK1, CTRC, CPA2, and CLPS genes in post-menopause tumor samples.

Tumor stage analyses of hub genes
The expression levels of top 5 hub genes unique in each category across different tumor stages is given in Figure 5(a-c). In recurrent tumor case, higher expression of CXCL8 and IRAK2 were observed. However, the expression level of CXCL8 was found to be decreasing as cancer progresses [Pr(>F) ¼ 0.00183] while that of IRAK2 remained almost unchanged. There was no measurable variation seen in the expression level of genes that were unique only to the latestage category. In post-menopause samples, significant expression of REG1A was found across all stages. Some hub genes were found to be common between the different categories ( Figure 6(a.b)) Among the common hub and coexpression module genes, the results predicted significant variation in the expression level of EGF [Pr(>F) ¼ 0.00258] gene that was common in late stage tumors and in recurrent tumors. The expression of EGF decreased gradually as the disease advanced. Other common hub genes did not show significant changes with respect to the tumor stages.

Discussion
Comprehensive studies on differential expression of genes and the co-expression modules between different patient groups is very important for cancer research. In this era of precision medicine and molecular diagnostics, the information available in the public domain can be put to good use by proper analysis. Analysis of samples from databases have identified different protein families as biomarkers. Dang et al have identified GPSM family members as potential biomarkers in breast cancer and SFXN family gene as a prognostic and therapeutic target for lung adenocarcinoma (34,35) Studies involving RNA-seq has revealed several information about the roles of circular RNA, micro RNAs, and certain genes in the pathophysiology of cancers (36,37). However, despite the progress made in the field of functional genomics, early diagnosis and appropriate prognosis of ovarian cancer remains a challenge. Appropriate biomarkers for diagnosis and prognosis of the probability of recurrence of the disease, stage of the disease, and probability of the disease in pre-menopausal women is an imminent requirement. Several previous studies have used the data available in TCGA database to understand the differential gene expression. Yang et al., identified 700 odd genes which were differentially expressed between stage II and III and 500 odd genes between stage III and IV. They also identified that COL3A1, COL1A1, COL1A2, KRAS, and NRAS genes played a role in poor prognosis of the disease (38). In yet another study, Xu et al., identified VSIG4, TGFBI, DCN, F13A1, ALOX5AP, GPX3, SFRP4 genes and correlated them with poor overall survival (35). In the present study, integrated differential expression analysis of TCGA gene expression dataset of ovarian cancer was executed (37). The datasets were divided into 3 different categories based on the occurrence of the cancer, the stage of the cancer and the age at diagnosis. Ovarian cancer is usually recurrent and relapses within 2-3 years after effective treatment. Moreover, it quickly progresses to the advanced stages, making the treatment ineffective. Though the disease is commonly seen in post menopause women, some women are also affected at the pre-menopause age. Hence classification of data into these three categories and identification of biomarkers for them gain significance. One of the major drawbacks associated with ovarian cancer is the lack of appropriate specific and sensitive biomarker for the disease diagnosis. In this study, we identified 84 upregulated and 62 downregulated genes based on the occurrence of tumor, 84 upregulated and 35 downregulated genes based on the stage, and 88 upregulated and 14 downregulated genes based on the age. Functional enrichment analysis of these DEGs revealed their association with various GO terms and KEGG pathways that are related to tumorigenesis and tumor progression. Some of these hub genes were also showed variation in expression levels as the disease stage progressed. Further investigation into this data may lead to development of potential biomarkers for diagnosis of ovarian cancer.
Most of the genes differentially expressed in recurrent tumor were found to be associated with angiogenesis, positive regulation of epithelial cell proliferation, autophagy, osteoblast differentiation, potassium ion transmembrane transport, phosphatidylinositol-4, 5-bisphosphate 3-kinase activity, TNF signalling pathway and apoptosis. For the early vs late-stage category, significant associations were found with cell adhesion, positive regulation of cell proliferation, positive regulation of ERK1 and ERK2 cascade, cell-cell signalling, female pregnancy and chemokine activity, growth factor activity and chemical carcinogenesis, whereas associations were observed with growth factor activity and serine-type endopeptidase inhibitor activity in post-menopausal women. From the PPI network analysis, hub genes were identified for each category. Though many of these hub genes were unique to each category, some common genes were also identified. CTRC, CPA2, SPINK1, CPA1 and DAZL have been observed to be differentially expressed in advanced stages of the diseases and also in patients at the post-menopausal age, signifying their potential role in progression of the disease. SLC4A4 and EGF were found to be common in advanced stages and in recurrent tumor.
In addition to this, 3 statistically significant coexpression modules linked with biological processes, molecular functions and pathways related to cancer progression were predicted. The coexpression module in recurrent tumor consists of EGF, PDGFRA, SLC9A1, SLC1A3, PIK3R1, SLC4A4 that are associated with phosphatidylinositol-4,5-bisphosphate 3-kinase activity, positive regulation of cell proliferation, growth factor activity, TNF signaling pathway and apoptosis. A co-expression module of CCL20, CSF3, PF4, CCL7 and EGF genes found in late-stage of the disease was associated with positive regulation of ERK1 and ERK2 cascade, cell-cell signaling, chemokine-mediated signaling pathway, positive regulation of cell proliferation, growth factor activity, angiogenesis, phosphatidylinositol-4,5bisphosphate 3-kinase activity. CTRC, CLPS, SPINK1, CPA2, REG1A formed a module found in post menopausal women and was associated with the growth factor activity.
In the latter part of the study, tumor stage analysis of the differentially expressed hub genes was performed. IRAK2 and CXCL8 were found to have higher expression in recurrent tumors while REG1A had higher expression in postmenopause samples.
Association of the expression levels with disease progression was observed in two genesthe EGF and CXCL8, where the expression was found to be decreasing as the disease progresses. Most of the other genes were found to have a consistent expression across different stages. Role of EGF and CXCL8 in cancers have been well documented. EGF and CXCL8 play important roles in angiogenesis during tumor progression. EGFR inhibitor is one of the common therapies for diseases like lung cancer and pancreatic cancer. However, inhibition of EGFR exhibits only modest effect on ovarian cancer patients. It is imperative to understand if lower levels of EGF in advanced stages, as predicted in this study, is contributing towards the low efficacy of EGFR inhibitors. For further understanding, it is essential to evaluate all the tumorogenic events controlled by EGF/EGFR axis. Studies have shown that knockdown of IL-8 sensitised the cells towards platinum based drugs (38). A better insight into the angiogenesis process and the role of immune cells in tumor progression will help us in understanding the trends observed in the current study. One of the major limitations of the study is that it relies mainly on the in siliso data. The data requires further validation in in vitro and in vivo models as well as in clinical samples in order to establish these genes as diagnostic biomarker for the disease in different category of patients. Nevertheless, this study brings us one step closer to the identification of potential diagnostic biomarkers for the disease.

Conclusions
In this study, we have identified a set of differentially expressed genes and co-expressed modules with a potential to be developed as diagnostic biomarkers for ovarian cancer. The samples used in the study were classified based on occurrence of the tumor, the stage of the tumor and the age at which it is diagnosed. The study helped in identifying 84 upregulated and 62 downregulated genes based on the occurrence of tumor, 84 upregulated and 35 downregulated genes based on the stage, and 88 upregulated and 14 downregulated genes based on the age. The co-expressed modules identified were capable of activating tumorogenic pathways. Several hub genes were identified and some of them were found to be common among different categories of datasets used for the study. Among these hub genes, EGF and CXCL8 showed a change in the expression level at different tumor stages, signifying their importance as biomarkers. These genes along with the existing biomarkers may be useful as probable biomarkers to study the diagnosis of the disease. However, further validation of this data in wet lab is required in establishing them as biomarkers.