Hybrid Enterobacteriaceae assemblies using PacBio+Illumina or ONT+Illumina sequencing

dataset

posted on 2019-08-20, 09:24 authored by Liam ShawLiam Shaw, Nicola De Maio, The REHAB Consortium

Data associated with: De Maio, Shaw, et al. on behalf of the REHAB consortium (2019), Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes. biorxiv 530824

Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods impact on assembly accuracy.

In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms.

This set of files includes all hybrid assemblies produced using Unicycler with different sequencing approaches and strategies. Each isolate has 8 hybrid assemblies = 4 x ONT-Illumina + 4 x PacBio-Illumina. There are a total of 158 hybrid assemblies from the full data as two assemblies did not finish (8x20 - 2 = 160 - 2 = 158). Additionally, there are

Assemblies were produced from different long read preparation strategies.

Hybrid assemblies with Unicycler (n1 = 158):

• Basic: no filtering or correction of reads (i.e. all long reads available used for assembly).
• Corrected: Long reads were error-corrected and subsampled (preferentially selecting longest reads) to 30-40x coverage using Canu (v1.5, https://github.com/marbl/canu) with default options.
• Filtered: long reads were filtered using Filtlong (v0.1.1, https://github.com/rrwick/Filtlong) by using Illumina reads as an external reference for read quality and either removing 10% of the worst reads or by retaining 500Mbp in total, whichever resulted in fewer reads. We also removed reads shorter than 1kb and used the --trim and --split 250 options.
• Subsampled: we randomly subsampled long reads to leave approximately 600Mbp (corresponding to a long read coverage around 100x).

Long-read only assemblies (n2 = 20 x 2 x 2 = 80):

• Flye: we ran Flye (https://github.com/fenderglass/Flye) with the options --plasmids --meta, which have been shown to improve the assemblies of plasmids in bacterial genomes (see: https://github.com/rrwick/Long-read-assembler-comparison)
• Pilon: the Flye assemblies were then polished with Illumina short-reads using Pilon (https://github.com/broadinstitute/pilon).

Assembly file names have the following format:

${sample-name}_${preparation-strategy}_${long-read-sequencing}.fasta

e.g. for sample CFT073 the filtered PacBio-Illumina assembly is: CFT073_filtered_pacbio.fasta

Also included are assemblies produced after subsampling long-read data to ~10X genome coverage for the following strategies: "basic" (hybrid) and long-read ("flye" and "pilon"). There are n3 = 20 x 3 x 2 = 120 of these assemblies. These have a '10X' preceding the preparation strategy.

The total number of assemblies is n1+n2+n3=158+80+120=358.

Also included is a pdf of supplementary figures and an Excel spreadsheet of supplementary tables.

See the associated preprint for more details: https://doi.org/10.1101/530824 and the published article in Microbial Genomics (currently in press).

Funding

This work was funded by the Antimicrobial Resistance Cross-council Initiative supported by the seven research councils [NE/N019989/1].

The following authors of the attached preprint: Crook, George, Peto, Sheppard, Walker are all affiliated to the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England (PHE) [grant HPRU-2012-10041]. The views expressed in the attached preprint are those of the author(s) and not necessarily those of the NHS, the NIHR, the Department of Health or Public Health England.

This work was supported by the NIHR Oxford Biomedical Research Centre.

History

Usage metrics

Keywords

enterobacteriaceae hybrid assembly pacbio illumina oxford nanopore minion long read sequencing smrt Microbiology Bioinformatics

Licence

CC BY 4.0