Statistical and Computational Methods for the Unified Analysis of Short Genetic Variants
Tan, Adrian
2019
Abstract
High throughput sequencing technologies underpins the development of personalized medicine by allowing us to identify the genetic variants of an individual in a short period of time. However many computational and statistical challenges remain to be solved for the accurate detection of variants from the tremendous amount of sequence data complicated by sequencing errors and alignment artifacts. While many different genetic variant callers for single nucleotide polymorphism (SNP) produce concordant results, existing variant callers for short insertions or deletions (Indels) often show high (>50%) discordance between algorithms. Two major reasons are (1) Unlike SNPs, Indels are often represented in different ways between different callers, and (2) Short tandem repeats (STRs), which constitute a large fraction of Indels, are not detected and represented in a consistent way due to its multi-allelic nature with frequent inexact repeats. While a large fraction of Indels are isolated in uniquely mappable regions of genome, another large fraction ( 50%) of Indels are located in repetitive regions of genome, mostly in the form of STRs, often with inexact repeat units. The spectrum of Indels encompasses these two extreme forms, and detection of Indels becomes progressively more difficult as the length of Indels increases and as nearby sequence diversity decreases. To be an effective and robust indel caller for the STR-like variants, the calling algorithm must be aware of the inherent heterogeneous nature of Indels. viii The first chapter of this thesis proposes a unified representation of genetic variants. A variant can be represented in a standard Variant Call Format (VCF) in multiple different ways, but no standard algorithm for unified representation has been proposed. We propose a simple algorithm to normalize the representation of genetic variants in an unambiguous and principled way. Our normalization algorithm demonstrates that 14% of non-SNP variants in dbSNP141 are unnormalized, and 4.6% of them are redundant. The second chapter describes a novel algorithm to discover genetic variants from aligned sequence data. Our algorithm introduces a novel repeat-aware hidden Markov model (HMM), which behaves like a hidden regular expression that allows us to robustly detect appropriate flanking sequences in the presence of inexact repeat units for Indels. We also present a suffix tree solution to detect candidate repeat motifs in the presence of STRs. The novelty in this approach lies in the top down approach of detecting variants from short tandem repeats to short Indels to SNPs and the explicit contextual filtering of variants in the presence of nearby variants. The third chapter extends our hidden regular expression model described in the second chapter to perform the genotyping of variants. Our method models sequence reads with an unspecified number of repeats for STRs, achieving robust genotyping of multi-allelic STRs, which can be applied to isolated Indels too. Our method also robustly account for unexpected alleles when modeling STRs. Our method is also computationally efficient because it only requires one HMM run per indel, instead of aligning the reads for each possible allele separately.Subjects
variant calling
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.