figshare
Browse
kmc_bioinformatics.pdf (403.54 kB)

Turtle: Identifying frequent k-mers with cache-efficient algorithms

Download (0 kB)
journal contribution
posted on 2013-09-07, 15:58 authored by Rajat RoyRajat Roy, Debashish Bhattacharya, Alexander Schleip

We present a novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high coverage libraries and large genomes such as human. Our method is designed to minimize cache-misses in a cache-efficient manner by using a Pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a Hash, for the actual counting. While this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant can resort to a counting Bloom filter for even larger savings in memory at the expense of false negatives in addition to the false positives common to all Bloom filter based approaches. A comparison to the state-of-the-art shows reduced memory requirements and running times.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC