Statistical Methods for Analyzing Population-scale Genomic and Transcriptomic Data

Liu, Andrew

Statistical Methods for Analyzing Population-scale Genomic and Transcriptomic Data

Liu, Andrew

2021

View/Open

aeyliu_1.pdf

(5.5MB

PDF)

Abstract

The study of genetics is an integral part to understanding the biology behind our complex traits and can be approached in a variety of ways. Technological advancements in the field of genomics have enabled unprecedented large-scale studies which have identified numerous statistical associations between many diseases and our genes. Recently, studies involving gene expression have become an increasingly popular approach to understanding the biological pathways underlying statistical associations. In this dissertation, I address specific challenges related to the study of gene expression, including meta-imputation of expression across multiple datasets with only summary-level imputation models available, correcting for technical biases towards reference alleles in array-based expression assays, and identifying tissue-specific and population-specific regulatory variants and trait-associated loci in the context of systems genetics with whole genome sequencing, transcriptomics profiles, morphometric traits, and clinical endpoints. In Chapter 2, I develop a method which leverages multiple datasets to accurately impute tissue-specific gene expression levels. Our method, Smartly Weighted Averaging across Multiple Tissues (SWAM) does not train directly from data, but rather performs a meta-imputation by combines extant imputation models by assigning weights based on their predictive performance and similarity to the tissue of interest. I demonstrate that when using the same set of resources, SWAM improves imputation accuracy compared to existing approaches that impute tissue-specific expression by training directly from raw data. The major benefit of using the SWAM meta-imputation framework is the flexibility to combine multiple pre-trained imputation models trained from privacy-protected raw datasets. Indeed, prediction accuracy is substantially improved when integrating multiple datasets, highlighting the importance of using multiple datasets. In Chapter 3, I examine the benefits of using deep whole genome sequencing to empower and refine existing microarray-based eQTL studies. I revisited a well-known hybridization bias that arises in microarray studies caused by genetic polymorphisms within target probe sequences. In this chapter, I interrogated the impact of genetic variants from whole genome sequencing to accurately identify and characterize this bias at both the probe and probeset level. I evaluated several approaches to account for hybridization bias, including methods to remove variant-overlapping probes, and a novel method to adjust hybridization bias for each probe. I demonstrate that accounting for variant-overlapping probes when quantifying expression levels reduces reference bias and false positives in cis-eQTL analyses. I also demonstrate that adjusting for hybridization bias with deeply sequenced genomes is ideal to avoid reference bias, although leveraging publicly available variant catalogues such as the 1000 Genomes data provides comparable benefits. In Chapter 4, I performed a systems genetic study of Pima Native Americans enrolled in a diabetic nephropathy study. I integrate whole genome sequences, transcriptomic profiles, and morphometric traits derived from two micro-dissected renal compartments – glomerular and tubulointerstitial – and clinical phenotypes to identify significant associations between these molecular and complex traits. I identified thousands of eQTLs, including kidney-specific and population-specific eQTLs. I also identified many transcriptional associations with morphometric and clinical phenotypes enriched for kidney-specific biological pathways. Moreover, through dimension reduction techniques, I identified genome-wide significant genetic associations with a morphometric trait (podocyte volume), and with a composite trait representing albumin-creatin ration and glomerular surface volume, which was obtained from dimensionality reduction techniques. Studying this unique and richly-phenotyped cohort resulted many population- and tissue-specific regulatory variants, genes, and pathways implicated for renal disease progression.

Deep Blue DOI

https://dx.doi.org/10.7302/3061

Subjects

statistical genetics

gene expression

Types

Thesis

Handle

https://hdl.handle.net/2027.42/170016

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.