figshare
Browse
journal.pone.0300545.pdf (1.11 MB)

A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samples

Download (1.11 MB)
journal contribution
posted on 2024-04-17, 13:29 authored by John W Oketch, Louise V Wain, Ed HolloxEd Hollox
Short tandem repeat (STR) variation is an often overlooked source of variation between genomes. STRs comprise about 3% of the human genome and are highly polymorphic. Some cause Mendelian disease, and others affect gene expression. Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data will help address this. Here, we compare software that genotypes common STRs and rarer STR expansions genome-wide, with the aim of applying them to population-scale genomes. By using the Genome-In-A-Bottle (GIAB) consortium and 1000 Genomes Project short-read sequencing data, we compare performance in terms of sequence length, depth, computing resources needed, genotyping accuracy and number of STRs genotyped. To ensure broad applicability of our findings, we also measure genotyping performance against a set of genomes from clinical samples with known STR expansions, and a set of STRs commonly used for forensic identification. We find that HipSTR, ExpansionHunter and GangSTR perform well in genotyping common STRs, including the CODIS 13 core STRs used for forensic analysis. GangSTR and ExpansionHunter outperform HipSTR for genotyping call rate and memory usage. ExpansionHunter denovo (EHdn), STRling and GangSTR outperformed STRetch for detecting expanded STRs, and EHdn and STRling used considerably less processor time compared to GangSTR. Analysis on shared genomic sequence data provided by the GIAB consortium allows future performance comparisons of new software approaches on a common set of data, facilitating comparisons and allowing researchers to choose the best software that fulfils their needs.

Funding

Genomic Epidemiology and Public Health Genomics

Wellcome Trust

Find out more...

National Institute for Health Research (NIHR) Leicester Biomedical Research Centre

History

Citation

Oketch JW, Wain LV, Hollox EJ (2024) A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samples. PLoS ONE 19(4): e0300545

Author affiliation

Population Health Sciences

Version

  • VoR (Version of Record)

Published in

PLOS ONE

Volume

19

Issue

4

Publisher

Public Library of Science (PLoS)

issn

1932-6203

eissn

1932-6203

Acceptance date

2024-02-27

Copyright date

2024

Available date

2024-04-17

Editors

Gagniuc PA

Spatial coverage

United States

Language

en

Deposited by

Professor Ed Hollox

Deposit date

2024-04-11

Data Access Statement

Code for analyses, and the full set of genotype calls at the clinical and forensic loci are available at https://doi.org/10.25392/leicester.data.22041020. Genotype call vcf files for GangSTR and HipSTR and ExpansionHunter are available for the Genome In a bottle samples are at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ and https://doi.org/10.25392/leicester.data.22041020 Genotype call vcf files for GangSTR, ExpansionHunter and HipSTR are available for the 1000 Genomes samples used are at https://doi.org/10.25392/leicester.data.22041020.

Rights Retention Statement

  • No

Usage metrics

    University of Leicester Publications

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC