saffrontree-0.1.2.tar.gz (1.24 MB)

SaffronTree: Fast, reference-free pseudo-phylogenomic trees from reads or contigs

software

posted on 2017-05-03, 07:45 authored by Andrew PageAndrew Page, Martin Hunt, Torsten Seemann, Jacqueline A. Keane

When defining bacterial populations through whole genome sequencing (WGS) the samples often have unknown evolutionary histories. With the increased use of next generation WGS in routine diagnostics, surveillance and epidemiology a vast amount of short read data is available, with phylogenetic trees (dendograms) used to visualise the relationships and similarities between samples. Standard reference and assembly based methods can take substantial amounts of time to generate these phylogenetic relationships, with the computation time often exceeding the time to sequence the samples in the first place. Faster methods can loosely classify samples into known taxonomic categories, however the loss of granularity means the relationships between samples is reduced. This can be the difference between ruling a sample in or out of an outbreak, which is a clinically important finding for genomic epidemiologists. Other methods [@Boratyn2014] are closed source which prevents independent scrutiny.
SaffronTree utilises the k-mer profiles between samples to rapidly construct a tree, directly from raw reads in FASTQ format or contigs in FASTA format. It support NGS data (such as Illumina), 3rd generation long read data (Pacbio/Nanopore) and assembled sequences (FASTA). Firstly, a k-mer count database is constructed for each sample using KMC. Next, the intersection of the k-mer databases is found for each pair of samples, with the number of k-mers in common recorded in a distance matrix. Finally, the distance matrix is used to construct a UPGMA tree in Newick format. This tree method was chosen as it is fast, however the final result is lower quality than slower methods which perform ancestral sequence reconstructions. The computational complexity of the algorithm is O(N^2), so is best suited to datasets of less than 50 samples. This can give rapid insights into small datasets in minutes, rather than hours. SaffonTree provides better granularity than MLST as it uses more of the underlying genome, can operate at low depth of coverage, is reference free, species agnostic, and has a low memory requirement.

Funding

Wellcome Trust

History

Usage metrics

Keywords

bioinformatics analysis Phylogenomics newick Bioinformatics Software

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM