Statistical and Computational Methods for the Unified Analysis of Short Genetic Variants

Tan, Adrian

Statistical and Computational Methods for the Unified Analysis of Short Genetic Variants

Tan, Adrian

2019

View/Open

atks_1.pdf

(2.4MB

PDF)

Abstract

High throughput sequencing technologies underpins the development of personalized medicine by allowing us to identify the genetic variants of an individual in a short period of time. However many computational and statistical challenges remain to be solved for the accurate detection of variants from the tremendous amount of sequence data complicated by sequencing errors and alignment artifacts. While many diﬀerent genetic variant callers for single nucleotide polymorphism (SNP) produce concordant results, existing variant callers for short insertions or deletions (Indels) often show high (>50%) discordance between algorithms. Two major reasons are (1) Unlike SNPs, Indels are often represented in diﬀerent ways between diﬀerent callers, and (2) Short tandem repeats (STRs), which constitute a large fraction of Indels, are not detected and represented in a consistent way due to its multi-allelic nature with frequent inexact repeats. While a large fraction of Indels are isolated in uniquely mappable regions of genome, another large fraction ( 50%) of Indels are located in repetitive regions of genome, mostly in the form of STRs, often with inexact repeat units. The spectrum of Indels encompasses these two extreme forms, and detection of Indels becomes progressively more diﬃcult as the length of Indels increases and as nearby sequence diversity decreases. To be an eﬀective and robust indel caller for the STR-like variants, the calling algorithm must be aware of the inherent heterogeneous nature of Indels. viii The ﬁrst chapter of this thesis proposes a uniﬁed representation of genetic variants. A variant can be represented in a standard Variant Call Format (VCF) in multiple different ways, but no standard algorithm for uniﬁed representation has been proposed. We propose a simple algorithm to normalize the representation of genetic variants in an unambiguous and principled way. Our normalization algorithm demonstrates that 14% of non-SNP variants in dbSNP141 are unnormalized, and 4.6% of them are redundant. The second chapter describes a novel algorithm to discover genetic variants from aligned sequence data. Our algorithm introduces a novel repeat-aware hidden Markov model (HMM), which behaves like a hidden regular expression that allows us to robustly detect appropriate ﬂanking sequences in the presence of inexact repeat units for Indels. We also present a suﬃx tree solution to detect candidate repeat motifs in the presence of STRs. The novelty in this approach lies in the top down approach of detecting variants from short tandem repeats to short Indels to SNPs and the explicit contextual ﬁltering of variants in the presence of nearby variants. The third chapter extends our hidden regular expression model described in the second chapter to perform the genotyping of variants. Our method models sequence reads with an unspeciﬁed number of repeats for STRs, achieving robust genotyping of multi-allelic STRs, which can be applied to isolated Indels too. Our method also robustly account for unexpected alleles when modeling STRs. Our method is also computationally eﬃcient because it only requires one HMM run per indel, instead of aligning the reads for each possible allele separately.

Subjects

variant calling

Types

Thesis

Handle

https://hdl.handle.net/2027.42/151406

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.