GENNUS - Generative Approach for Nucleotide Sequences
GENNUS - Generative Approach for Nucleotide Sequences
Classifying non-coding RNA (ncRNA) sequences, particularly mirtrons, is essential for elucidating gene regulation mechanisms. However, the prevalent class imbalance in ncRNA datasets presents significant challenges, often resulting in overfitting and diminished generalization in machine learning models.
In this study, GENerative Approaches for NUcleotide Sequences (GENNUS) is proposed, introducing novel data augmentation strategies using Generative Adversarial Networks (GANs) and Synthetic Minority Over-sampling Technique (SMOTE) to enhance ncRNA classification performance.
Our GAN-based methods effectively generate high-quality synthetic data that capture the intricate patterns and diversity of real mirtron sequences, eliminating the need for extensive feature engineering.
Through four experiments, it is demonstrated that models trained on a combination of real and GAN-generated data improve classification accuracy compared to traditional SMOTE techniques or only with real data.
Our findings reveal that GANs enhance model performance and provide a richer representation of minority classes, thus improving generalization capabilities across various machine learning frameworks. This work highlights the transformative potential of synthetic data generation in addressing data limitations in genomics, offering a pathway for more effective and scalable ncRNA classification methodologies.
An open-source version of code to fully reproduce results has been released at https://github.com/chiquitto/GENNUS