Jupyter notebook for predicting miRNA target genes
This notebook contains the main analysis for predicting potential miRNA targets.
Using the miRNA and gene expression data from TCGA, we performed a correlation analysis between all miRNAs and all genes across multiple cancer types. The correlations served as features to describe each miRNA-gene pair. Using the existing databases of known miRNA targets, we labeled the miRNA-gene pairs, and trained machine learning models to predict novel miRNA-gene relationships. Our analysis involved over 22 million miRNA-gene pairs, ofwhich 0.12% were confirmed relationship in the existing databases and we labeled as positives. The remaining 99.88% were labeled as negative since they were not reported. Given the highly imbalanced nature of the dataset, we applied downsampling to the negative class. After downsampling, the data was split into 80% for training and 20% for testing, and the model’s performance was measured using the Area Under the Curve (AUC) metric. At each downsample level, we trained 1000 models and applied to the original negatives. The miRNA-gene pairs that were consistently predicted were consider as potential miRNA target relationships.