Exploiting Public Human Genome NGS Datasets to Characterize Repetitive DNA and Recover Assembly Gaps
thesisposted on 03.07.2018, 08:45 by Denye Nathaniel Ogeh
With the advent of Next Generation Sequencing (NGS), we have witnessed the generation of enormous volumes of short read sequence data, cheaply and on short time scales. Nevertheless, the quality of genome assemblies generated using NGS technologies has been greatly affected by this innovation, compared to those generated using Sanger DNA sequencing. This is largely due to the inability of short read sequence data alone to scaffold repetitive structures, creating gaps, inversions and rearrangements and ultimately resulting in assemblies that are, at best, draft forms (by draft we mean, assembly that is only a preliminary result that will require more work to be done to make it a more complete and accurate representation of the genome). Single molecule long-read sequencing (SMS) technologies on the other hand, address this challenge by generating sequences with greatly increased read lengths, offering the prospect to better recover these complex repetitive structures, concomitantly improving assembly quality. Following this development, we evaluate the ability of SMS data (specifically Pacific Biosciences SMRT data and Oxford Nanopore MinION data from human genomes) to recover poorly represented repetitive sequences (specifically, GCrich human minisatellites), identify novel transposable element insertions and enable the closing of gapped regions. Our results show that by using single molecule sequencing and long read technology, poorly represented repetitive sequences (specifically, minisatellites and L1s) and other missing elements in published human genome assemblies can be characterized by developing custom software, scalable for the analysis of single molecule long-reads (particularly, Pacific Biosciences’ SMRT technology). The tool designed is cross-platform, thus, giving computational and non-computational biologists a straightforward approach and less technical platform for local analysis of specific poorly characterized DNA sequences.