Metascape Gene List Analysis Report

metascape.org1

Bar Graph Summary

Figure 1. Bar graph of enriched terms across input gene lists, colored by p-values.
The top-level Gene Ontology biological processes can be viewed here.

Gene Lists

User-provided gene identifiers are first converted into their corresponding H. sapiens Entrez gene IDs using the latest version of the database (last updated on 2022-04-22). If multiple identifiers correspond to the same Entrez gene ID, they will be considered as a single Entrez gene ID in downstream analyses. The gene lists are summarized in Table 1.

Table 1. Statistics of input gene lists.
Name Total Unique
MyList 47 47

Gene Annotation

The following are the list of annotations retrieved from the latest version of the database (last updated on 2022-04-22) (Table 2).

Table 2. Gene annotations extracted
Name Type Description
Gene Symbol Description Primary HUGO gene symbol.
Description Description Short description.
Biological Process (GO) Function/Location Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
Kinase Class (UniProt) Function/Location Detailed kinase classes.
Protein Function (Protein Atlas) Function/Location Protein Function (Protein Atlas)
Subcellular Location (Protein Atlas) Function/Location Sucellular Location (Protein Atlas)
Drug (DrugBank) Genotype/Phenotype/Disease Drug information for the given gene as target.
Canonical Pathways Ontology Canonical Pathways
Hallmark Gene Sets Ontology Hallmark Gene Sets

Pathway and Process Enrichment Analysis

For each given gene list, pathway and process enrichment analysis has been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, Cell Type Signatures, CORUM, TRRUST, DisGeNET, PaGenBase, Transcription Factor Targets, WikiPathways and COVID. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. More specifically, p-values are calculated based on the cumulative hypergeometric distribution2, and q-values are calculated using the Benjamini-Hochberg procedure to account for multiple testings3. Kappa scores4 are used as the similarity metric when performing hierarchical clustering on the enriched terms, and sub-trees with a similarity of > 0.3 are considered a cluster. The most statistically significant term within a cluster is chosen to represent the cluster.

Table 3. Top 9 clusters with their representative enriched terms (one per cluster). "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the user-provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.
GO Category Description Count % Log10(P) Log10(q)
GO:0034645 GO Biological Processes cellular macromolecule biosynthetic process 7 14.89 -3.83 0.00
WP4255 WikiPathways Non-small cell lung cancer 3 6.38 -3.71 0.00
hsa05203 KEGG Pathway Viral carcinogenesis 4 8.51 -3.54 0.00
GO:0140352 GO Biological Processes export from cell 5 10.64 -3.26 0.00
GO:0016032 GO Biological Processes viral process 4 8.51 -3.22 0.00
GO:0033674 GO Biological Processes positive regulation of kinase activity 5 10.64 -3.02 0.00
hsa04145 KEGG Pathway Phagosome 3 6.38 -2.76 0.00
GO:0016241 GO Biological Processes regulation of macroautophagy 3 6.38 -2.72 0.00
GO:0000278 GO Biological Processes mitotic cell cycle 5 10.64 -2.61 0.00

To further capture the relationships between the terms, a subset of enriched terms have been selected and rendered as a network plot, where terms with a similarity > 0.3 are connected by edges. We select the terms with the best p-values from each of the 20 clusters, with the constraint that there are no more than 15 terms per cluster and no more than 250 terms in total. The network is visualized using Cytoscape5, where each node represents an enriched term and is colored first by its cluster ID (Figure 2.a) and then by its p-value (Figure 2.b). These networks can be interactively viewed in Cytoscape through the .cys files (contained in the Zip package, which also contains a publication-quality version as a PDF) or within a browser by clicking on the web icon. For clarity, term labels are only shown for one term per cluster, so it is recommended to use Cytoscape or a browser to visualize the network in order to inspect all node labels. We can also export the network into a PDF file within Cytoscape, and then edit the labels using Adobe Illustrator for publication purposes. To switch off all labels, delete the "Label" mapping under the "Style" tab within Cytoscape, and then export the network view.

Figure 2. Network of enriched terms: (a) colored by cluster ID, where nodes that share the same cluster ID are typically close to each other; (b) colored by p-value, where terms containing more genes tend to have a more significant p-value.

Protein-protein Interaction Enrichment Analysis

For each given gene list, protein-protein interaction enrichment analysis has been carried out with the following databases: STRING6, BioGrid7, OmniPath8, InWeb_IM9.Only physical interactions in STRING (physical score > 0.132) and BioGrid are used (details). The resultant network contains the subset of proteins that form physical interactions with at least one other member in the list. If the network contains between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm10 has been applied to identify densely connected network components. The MCODE networks identified for individual gene lists have been gathered and are shown in Figure 3.

Pathway and process enrichment analysis has been applied to each MCODE component independently, and the three best-scoring terms by p-value have been retained as the functional description of the corresponding components, shown in the tables underneath corresponding network plots within Figure 3.

Figure 3. Protein-protein interaction network and MCODE components identified in the gene lists.
GO Description Log10(P)
GO:0043549 regulation of kinase activity -5.6
GO:0033674 positive regulation of kinase activity -5.2
GO:0051347 positive regulation of transferase activity -4.9

Quality Control and Association Analysis

Gene list enrichments are identified in the following ontology categories: Cell_Type_Signatures, DisGeNET, Transcription_Factor_Targets. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. The top few enriched clusters (one term per cluster) are shown in the Figure 4-6. The algorithm used here is the same as that is used for pathway and process enrichment analysis.

Figure 4. Summary of enrichment analysis in Cell Type Signatures11.


GO Description Count % Log10(P) Log10(q)
M40025 BUSSLINGER DUODENAL DIFFERENTIATING STEM CELLS 5 11.00 -4.00 -0.35
M41652 TRAVAGLINI LUNG PROXIMAL BASAL CELL 6 13.00 -3.40 -0.21
M39217 ZHENG CORD BLOOD C8 PUTATIVE LYMPHOID PRIMED MULTIPOTENT PROGENITOR 2 3 6.40 -3.30 -0.19
M39263 HU FETAL RETINA BLOOD 4 8.50 -3.00 -0.14
M39125 AIZARANI LIVER C24 EPCAM POS BILE DUCT CELLS 3 3 6.40 -2.50 0.00
M41651 TRAVAGLINI LUNG BASAL CELL 3 6.40 -2.50 0.00
M40026 BUSSLINGER DUODENAL TRANSIT AMPLIFYING CELLS 3 6.40 -2.50 0.00
M39175 MURARO PANCREAS MESENCHYMAL STROMAL CELL 5 11.00 -2.40 0.00
M41670 TRAVAGLINI LUNG LYMPHATIC CELL 3 6.40 -2.40 0.00
M41749 RUBENSTEIN SKELETAL MUSCLE NK CELLS 3 6.40 -2.30 0.00
M40010 BUSSLINGER GASTRIC ISTHMUS CELLS 4 8.50 -2.30 0.00
M41690 TRAVAGLINI LUNG BASOPHIL MAST 2 CELL 4 8.50 -2.10 0.00
M41712 FAN OVARY CL10 PUTATIVE EARLY ATRESIA GRANULOSA CELL 3 6.40 -2.00 0.00
M41689 TRAVAGLINI LUNG BASOPHIL MAST 1 CELL 3 6.40 -2.00 0.00
Figure 5. Summary of enrichment analysis in DisGeNET12.


GO Description Count % Log10(P) Log10(q)
C0740392 Infarction, Middle Cerebral Artery 5 11.00 -5.50 -0.99
C0266568 Persistent Hyperplastic Primary Vitreous 3 6.40 -4.70 -0.63
C2062441 Influenza A 7 15.00 -4.60 -0.63
C0027708 Nephroblastoma 7 15.00 -4.50 -0.63
C0279583 Childhood T Acute Lymphoblastic Leukemia 4 8.50 -4.40 -0.63
C0337428 Fibrinogen assay 3 6.40 -4.10 -0.36
C0024301 Lymphoma, Follicular 6 13.00 -4.00 -0.36
C0007193 Cardiomyopathy, Dilated 6 13.00 -3.90 -0.35
C0011853 Diabetes Mellitus, Experimental 6 13.00 -3.80 -0.34
C1333015 Childhood Kidney Wilms Tumor 5 11.00 -3.70 -0.31
C0006413 Burkitt Lymphoma 6 13.00 -3.70 -0.30
C0038273 Stereotypic Movement Disorder 4 8.50 -3.60 -0.30
C0035242 Respiratory Tract Diseases 4 8.50 -3.60 -0.30
C0266464 Polymicrogyria 4 8.50 -3.60 -0.30
C0036205 Sarcoidosis, Pulmonary 3 6.40 -3.60 -0.30
C1384583 Congenital absence of germinal epithelium of testes 3 6.40 -3.50 -0.30
C0032269 Pneumococcal Infections 3 6.40 -3.50 -0.28
C0008149 Chlamydia Infections 3 6.40 -3.50 -0.26
C1961099 Precursor T-Cell Lymphoblastic Leukemia-Lymphoma 6 13.00 -3.30 -0.19
C0578038 Thin lips 3 6.40 -3.30 -0.18
Figure 6. Summary of enrichment analysis in Transcription Factor Targets.


GO Description Count % Log10(P) Log10(q)
M29968 FOXE1 TARGET GENES 7 15.00 -3.90 -0.35
M30190 TAF9B TARGET GENES 6 13.00 -3.60 -0.30
M9902 ELF1 Q6 4 8.50 -3.20 -0.15
M9431 AP1 Q6 4 8.50 -3.20 -0.15
M17769 STAT1 02 4 8.50 -3.10 -0.15
M5440 AP1 Q4 4 8.50 -3.10 -0.15
M14686 ELK1 01 4 8.50 -3.10 -0.15
M30015 HOXC13 TARGET GENES 3 6.40 -3.00 -0.15
M30131 PSMB5 TARGET GENES 4 8.50 -2.90 -0.05
M40783 ZNF549 TARGET GENES 5 11.00 -2.60 0.00
M40826 CIC TARGET GENES 4 8.50 -2.40 0.00
M11345 AP4 Q6 3 6.40 -2.30 0.00
M30045 LCORL TARGET GENES 4 8.50 -2.30 0.00
M12298 CEBP Q2 3 6.40 -2.20 0.00
M3037 E2F1 Q6 01 3 6.40 -2.20 0.00
M5320 HIF1 Q5 3 6.40 -2.20 0.00
M17508 USF2 Q6 3 6.40 -2.10 0.00
M2146 STAT1 03 3 6.40 -2.10 0.00
M11921 NFKB Q6 3 6.40 -2.10 0.00
M30340 ZNF528 TARGET GENES 5 11.00 -2.10 0.00

Reference

  1. Zhou et al., Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications (2019) 10(1):1523.
  2. Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523
  3. Hochberg Y., Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine (1990) 9:811-818.
  4. Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. (1960) 20:27-46.
  5. Shannon P. et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 11:2498-2504.
  6. Szklarczyk D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. (2019) 47:D607-613.
  7. Stark C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. (2006) 34:D535-539.
  8. Turei D. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2016) 13:966-967.
  9. Li T. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2017) 14:61-64.
  10. Bader, G.D. et al. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics (2003) 4:2.
  11. Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-15550 (2005).
  12. Pinero J, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research 45, D833-D839 (2017).