cpn60-Classifier v10.1 (Performance testing)
cpn60-Classifier v10.1
(For additional information and releases, visit HillLab on github)
This is the version of the RDP Classifier trained on 11,001 reference cpn60 sequences used for performance testing. Duplicate sequences were removed from the reference database using the rm-dupseq function of the RDP classifier since these can inflate results during classification performance testing.
(An updated release containing additional sequences has been made available since this original investigation)
The release contains training files (taxonomy table and FASTA formatted sequences) as well as the trained classifier for use with RDP Classifier.
RDPTools includes the classifier and can be installed with conda https://anaconda.org/bioconda/rdptools (Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73:5261–7).
Quick start with the trained classifier
Download cpn60-Classifier_v10_trained.tar.gz and unpack it. The resulting directory should include:
- bergeyTrainingTree.xml
- genus_wordConditionalProbList.txt
- logWordPrior.txt
- rRNAClassifier.properties
- wordConditionalProbIndexArr.txt
A basic command to classify cpn60 sequences contained in a file called queries.fasta:
java -jar /path/to/RDPTools/classifier.jar classify -c 0.9 -f allrank -t /path/to/cpn60-Classifier_v10_trained/rRNAClassifier.properties -o output.txt queries.fasta
See the README here for more details on the RDP Classifier: https://github.com/rdpstaff/classifier
To train the Classifier
Download cpn60-Classifier_v10_training.tar.gz and unpack it. The resulting directory should include:
- refseqs_v10.fasta
- taxonomytable_v10.txt
Other scripts needed (from https://github.com/GLBRC-TeamMicrobiome/python_scripts with minor edit to addFullLineage.py to fix error):
- addFullLineage-jh.py
- lineage2taxTrain.py
(If you want to generate your own taxonomy file, see https://pypi.org/project/taxonomy-ranks/)
Make ready-to-train taxonomy:
/path/to/lineage2taxTrain.py taxonomytable_v10.txt > ready2train_taxonomy.txt
Add lineages to fasta sequence definition lines:
/path/to/addFullLineage-jh.py taxonomytable_v10.txt resets_v10.fasta > ready2train_refseqs.fasta
Now train:
java -jar /path/to/RDPTools/classifier.jar train -o training_files -s read2train_refseqs.fasta -t ready2train_taxonomy.txt
The resulting directory contains the trained classifier EXCEPT for one important thing, which is the rRNAClassifier.properties file, which you can add manually.