bpRNA-NF-15.0: an RNA secondary structure dataset for family-wise evaluation
The bpRNA-NF-15.0 is an RNA sequence and secondary structure dataset that exclusively includes new families from the ones contained in bpRNA-1m.
Using bpRNA-1m as a training dataset and bpRNA-NF-15.0 as a test dataset, it is possible to check for generalization capabilities to unseen families.
bpRNA-new is, to our knowledge, the only current other dataset designed for this.
bpRNA-NF-15.0 is based on the latest Rfam version (15.0) and contains twice as many new families compared to bpRNA-new, as well as longer RNA sequences up to 951 nt, whereas bpRNA-new only contains RNA sequences shorter than 500 nt.
We also provide here the Train, Validation and Test datasets described in our study. All three are built from bpRNA-1m. The Test dataset is a sequence-wise dataset in regards to the Train dataset. It is ensured that sequence similarities cannot exceed 80% between the two datasets, using the tool CDHIT-EST, but there may be RNA families in common.
Each dataset contains 3 variables:
- rna_name: the name of this sequence, as taken from the source dataset (Rfam for bpRNA-NF-15.0, or bpRNA-1m for Train / Validation / Test).
- seq: the RNA sequence.
- struct: its secondary structure in dot-bracket notation.
The bpRNA-NF-15.0 dataset was extracted from Rfam 15.0, following a procedure similar to the one that was used to build bpRNA-new.
First, RNA sequences were selected from Rfam 15.0, but only from families that are not included in Rfam 12.2.
This is to ensure that no common families are found with bpRNA-1m, since bpRNA-1m was built from Rfam 12.2.
Utility functions were applied to clean potential discrepancies, like converting sequence characters to capital letters, or ensuring efficient bracket representation.
Non-canonical base pairs were removed.
Then, the CDHIT-EST software was applied at an 80% similarity threshold to remove redundancies in the dataset.
To cite this dataset, please use:
Omnes L., Angel E., Bartet P., Tahi F. A divide-and-conquer approach based on deep learning for long RNA secondary structure prediction: focus on pseudoknots.