figshare
Browse

cpn60-Classifier v10.1 (Performance testing)

dataset
posted on 2023-04-26, 20:41 authored by Janet HillJanet Hill

cpn60-Classifier v10.1 

(For additional information and releases, visit HillLab on github)


This is the version of the RDP Classifier trained on 11,001 reference cpn60 sequences used for performance testing. Duplicate sequences were removed from the reference database using the rm-dupseq function of the RDP classifier since these can inflate results during classification performance testing.


(An updated release containing additional sequences has been made available since this original investigation)


The release contains training files (taxonomy table and FASTA formatted sequences) as well as the trained classifier for use with RDP Classifier.


RDPTools includes the classifier and can be installed with conda https://anaconda.org/bioconda/rdptools (Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73:5261–7).

Quick start with the trained classifier

Download cpn60-Classifier_v10_trained.tar.gz and unpack it. The resulting directory should include:

  • bergeyTrainingTree.xml
  • genus_wordConditionalProbList.txt
  • logWordPrior.txt
  • rRNAClassifier.properties
  • wordConditionalProbIndexArr.txt

A basic command to classify cpn60 sequences contained in a file called queries.fasta:


java -jar /path/to/RDPTools/classifier.jar classify -c 0.9 -f allrank -t /path/to/cpn60-Classifier_v10_trained/rRNAClassifier.properties -o output.txt queries.fasta


See the README here for more details on the RDP Classifier: https://github.com/rdpstaff/classifier

To train the Classifier

Download cpn60-Classifier_v10_training.tar.gz and unpack it. The resulting directory should include:

  • refseqs_v10.fasta
  • taxonomytable_v10.txt

Other scripts needed (from https://github.com/GLBRC-TeamMicrobiome/python_scripts with minor edit to addFullLineage.py to fix error):

  • addFullLineage-jh.py
  • lineage2taxTrain.py

(If you want to generate your own taxonomy file, see https://pypi.org/project/taxonomy-ranks/)


Make ready-to-train taxonomy:


/path/to/lineage2taxTrain.py taxonomytable_v10.txt > ready2train_taxonomy.txt


Add lineages to fasta sequence definition lines:


/path/to/addFullLineage-jh.py taxonomytable_v10.txt resets_v10.fasta > ready2train_refseqs.fasta


Now train:


java -jar /path/to/RDPTools/classifier.jar train -o training_files -s read2train_refseqs.fasta -t ready2train_taxonomy.txt


The resulting directory contains the trained classifier EXCEPT for one important thing, which is the rRNAClassifier.properties file, which you can add manually.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC