Comparing algorithms to genotype short tandem repeats in next-generation sequencing data
Short tandem repeats (STRs) are short (2-6bp) DNA sequences repeated in tandem, which make up approximately 3% of the human genome. These loci are prone to frequent mutations and high polymorphism. Dozens of neurological and developmental disorders have been attributed to STR expansions. STRs have also been implicated in a range of functions such as DNA replication and repair, chromatin organisation and regulation of gene expression.
Traditionally, STR variation has been measured using capillary gel electrophoresis. This process is time-consuming and expensive, and so has tended to limit STR analysis to a handful of loci.
Next-generation sequencing has the potential to address these problems. However, determining STR lengths using next-generation sequencing data is difficult. For example, many callers are limited by sequencing read lengths and polymerase slippage during PCR amplification introduces stutter noise.
Recently, a small number of software tools have been developed genotype STRs in next-generation sequencing data. We have performed a general comparison of the tools published to date, identifying their application domains, assumptions and limitations.
We have assessed the performance of some of the most popular STR genotyping tools on human next-generation sequencing data. When comparing STR callers we have observed drastic differences in which STR loci are identified as variant. Surprisingly, even for variant loci reported in common between tools, there is markedly low concordance between the specific genotype calls.
Finally, we draw together our findings to comment on the considerations when choosing and running an STR genotyping tool, with an emphasis on applications to human disease.