rEGEN-B database
The rEGEN-B (rrn operons Extracted from GENomes of Bacteria) database is dedicated to the ribosomal operon sequences of bacteria. The database contains 523,869 sequences, representing 16,217 species, with an average length of 4,580 bp. The database was filtered according to “high-confidence curation” criteria that were defined: (i) the sequences in the database only come from genomes with confident assembly levels (i.e. “chromosome” or “complete genome” status, but not “contig” nor “scaffold”), (ii) only sufficiently recent genomes were retained for operon sequence extraction (nothing before 2005), and (iii) the database was curated using the DB4Q2 pipeline (Dubois et al., 2022) to discard low-quality and misidentified sequences. To enable users with lower computational capabilities to utilize the rEGEN-B database in a more efficient way, a lighter version of the database has also been compiled by extracting only the first copy of the rrn operon in each genome (see the “uniq” label in the database files). This lighter database contains 115,032 rrn opeorn sequences.
Database update (2025-01-15):
- rEGEN-B: 542,371 sequences, 15,903 species
- rEGEN-B_uniq: 115,727 sequences, 15,903 species
The rEGEN-B database was constructed as part of the PRONAME pipeline, which has been developed to process Nanopore metabarcoding data and to significantly increase its accuracy and usability. Thanks to an innovative approach combining different quality filtering steps, read clustering, error-correction with a tool specifically dedicated to Nanopore data and the valorization of duplex reads, the generated consensus sequences display at least 99.5% accuracy with default settings.Please refer to the project GitHub repository for detailed information: https://github.com/benn888/PRONAME
Dubois, B., Debode, F., Hautier, L., Hulin, J., Martin, G. S., Delvaux, A., et al. (2022). A detailed workflow to develop QIIME2-formatted reference databases for taxonomic analysis of DNA metabarcoding data. BMC Genom Data 23, 53. doi: 10.1186/s12863-022-01067-5