Computational and Statistical Approaches for Large-Scale Genome-Wide Association Studies

Zhou, Wei

Computational and Statistical Approaches for Large-Scale Genome-Wide Association Studies

Zhou, Wei

2018

View/Open

zhowei_1.pdf

(24.3MB

PDF)

Abstract

Over the past decade, genome-wide association studies (GWAS) have proven successful at shedding light on the underlying genetic variations that affect the risk of human complex diseases, which can be translated to novel preventative and therapeutic strategies. My research aims at identifying novel disease-associated genetic variants through large-scale GWAS and developing computational and statistical pipelines and methods to improve power and accuracy of GWAS. Bicuspid aortic valve (BAV) is a congenital heart defect characterized by fusion of two of the normal three leaflets of the aortic valve. As the most common cardiovascular malformation in humans, BAV is moderately heritable and is an important risk factor for valvulopathy and aortopathy, but its genetic origins remain elusive. In Chapter 2, we present the first large-scale GWAS study to identify novel genetic variants associated with BAV. We report association with a non-coding variant 151kb from the gene encoding the cardiac-specific transcription factor, GATA4, and near-significance for p.Ser377Gly in GATA4. We used multiple bioinformatics approaches to demonstrate that the GATA4 gene is a plausible biological candidate. In the subsequent functional follow-up, GATA4 was interrupted by CRISPR-Cas9 in induced pluripotent stem cells from healthy donors. The disruption of GATA4 significantly impaired the transition from endothelial cells into mesenchymal cells, a critical step in heart valve development. Genotype imputation is widely used in GWAS to perform in silico genotyping, leading to higher power to identify novel genetic signals. When multiple reference panels are not consented to combine together, it is unclear how to combine the imputation results to optimize the power of genetic association tests. In Chapter 3, we compared the accuracy of 9,265 Norwegian genomes imputed from three reference panels – 1000 Genomes Phase 3 (1000G), Haplotype Reference Consortium (HRC), and a reference panel containing 2,201 Norwegian participants from the HUNT study with low-pass genome sequencing. We observed that the overall imputation accuracy from the population-specific panel was substantially higher than 1000G and was comparable with HRC, despite HRC being 15-fold larger. We also evaluated different strategies to utilize multiple sets of imputed genotypes to increase the power of association studies. We propose that testing association for all variants imputed from any panel results in higher power to detect association than the alternative strategy of testing only the version of each genetic variant with the highest imputation quality metric. In phenome-wide GWAS by large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly -- producing large type I error rates -- in the analysis of phenotypes with unbalanced case-control ratios. In Chapter 4, we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 white British European-ancestry samples for 1,403 dichotomous phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.

Subjects

Genome-wide association studies

Bicuspid aortic valve

Genotype imputation

Logistic mixed models

Saddle point approximation

Large biobank

Types

Thesis

Handle

https://hdl.handle.net/2027.42/144097

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.