Improving Select Applications of Long-Read DNA Sequencing
Dunn, Timothy
2024
Abstract
The cost to sequence a human genome has dropped from an estimated $300 million to well under $1,000 over the past two decades. In fact, several companies – which can amortize sequencing cost by running hundreds of samples in parallel – have recently claimed to have reached the $100 human genome. Following such dramatic cost improvements, whole genome sequencing is just starting to regularly be used for cancer profiling, rare genetic disease detection, agricultural breeding, pathogen detection, microbiome bacterial abundance estimation, evolutionary biology research, personalized medicine development, and much more. As sequencing costs continue to drop, as accuracy improves, and as new applications are discovered, DNA sequencing will become increasingly ubiquitous. At the same time that sequencing cost is rapidly dropping, new sequencing technologies have emerged that offer greater capabilities than ever before. In particular, nanopore-based long-read sequencing has no theoretical limit on the length of a contiguous DNA sequence, or “read”, that can be measured. In comparison to short-read sequencing technologies that have dominated the sequencing market thus far (with maximum read lengths of 100 base pairs), nanopore devices have sequenced entire bacterial chromosomes in a single strand. The current nanopore read length record stands at over four million bases. Longer read lengths result in fewer problems during read mapping and genome assembly, allowing insight into complex regions of the genome and types of genetic variation that have been historically under-studied. Although nanopore devices were originally limited by their approximately 80% per-base accuracy when first publicly released in 2015, this accuracy has increased to over 99% in recent years with the adoption of deep learning basecallers. Nanopore-based sequencing devices are also the first to come in a portable handheld form factor and offer real-time analysis of raw data as it is being recorded, further expanding potential use cases. Despite its incredibly promising future, long read sequencing does not come without its own set of challenges. In this thesis, I explore several different applications of long read DNA sequencing and improve upon current methodologies in this new field. First, we present a hardware-accelerated filter that directly analyzes nanopore sequencer output in real time to filter non-viral reads, enabling cheaper detection of pathogenic viruses. Next, we introduce a novel read alignment algorithm that enables more consistent alignment of long reads in highly repetitive areas of the genome, and demonstrate that this improves recall for tandem repeat variant calling. Then, we analyze the design space for complex variant representation and present a new variant calling benchmarking tool that accurately and stably measures performance regardless of the representation of reported variants. Last, we extend this benchmarking tool to jointly evaluate small and structural variants, and demonstrate that doing so results in improved measured performance and enables more accurate phasing analyses.Deep Blue DOI
Subjects
Long Read Sequencing Nanopore Sequencing Whole Genome Sequencing Alignment Variant Calling Benchmarking
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.