Genetic Association and Prediction Methods for Biobank Data

Zhao, Zhangchen

Genetic Association and Prediction Methods for Biobank Data

Zhao, Zhangchen

2021

View/Open

zczhao_1.pdf

(7.1MB

PDF)

Abstract

With large sample sizes, population-based cohorts and biobanks provide an exciting opportunity to identify genetic components of complex traits. For example, UK Biobank provides genome-wide genotyping data of 500,000 volunteer participants, which is an invaluable resource to detect genetic associations and build prediction models of genetic effects. In the first two projects, we focus on discovery-type questions and develop robust region-based tests of genetic association. In the third project, we target a translation-type question and develop a multi-ethnic prediction method. In the first project, we propose SKAT/SKAT-O type region-based tests to account for unbalanced case-control ratios. In biobank data analysis, most binary phenotypes have unbalanced case-control ratios, which can cause inflation of type I error rates. Recently, a saddlepoint approximation (SPA) based single variant test has been developed to provide an accurate and scalable method to test for associations of such phenotypes. For gene- or region-based multiple variant tests, a few methods exist that can adjust for unbalanced case-control ratios; however, these methods are either less accurate when case-control ratios are extremely unbalanced or not scalable for large data analyses. To address these problems, we develop a robust method, where the single-variant score statistic is calibrated based on SPA and Efficient Resampling (ER). Through simulation studies, we show that the proposed method provides well-calibrated p-values. The proposed method has similar computation time as the unadjusted approaches and is scalable for large sample data. In our application, the UK Biobank whole-exome sequence data analysis of 45,596 unrelated European samples and 791 PheCode phenotypes identified 10 rare variant associations with p-value < 10E-7, including the associations between JAK2 and myeloproliferative disease, HOXB13 and cancer of prostate, and F11 and congenital coagulation defects. In the second project, we extend the robust method to related samples. Here we propose a scalable generalized mixed model region-based association test that can handle large sample sizes and accounts for unbalanced case-control ratios for binary traits. This method, SAIGE-GENE, utilizes state-of-the-art optimization strategies to reduce computational and memory cost, and hence is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples. Through the analysis of the HUNT study of 69,716 Norwegian samples and the UK Biobank data of 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large sample data (N > 400,000) with type I error rates well controlled. In the third project, we propose a novel multi-ethnic PRS using transfer learning from machine learning literature. As most existing GWAS results were conducted in European or East Asian individuals, the existing PRS models have limited transferability to minority populations such as Africans and South Asians. Although recent studies have developed multi-ethnic PRS models that linearly combine multiple PRS trained with different ancestry GWAS, they remain under-powered. Our approach, TL-PRS, fine-tunes the potentially biased model trained with GWAS summary statistics from the majority ancestry to the target dataset of the minority ancestry. Through simulation studies, we show that TL-PRS improved the performance of PRS with a wide range of genetic architectures and cross-population genetic correlations compared to the baseline methods. In the application of 8,168 Africans and 10,285 South Asians of UK Biobank data, TL-PRS substantially improved the prediction accuracy of the six quantitative and two dichotomous traits.

Deep Blue DOI

https://dx.doi.org/10.7302/3019

Subjects

unbalanced case-control ratio

rare variant test

genome-wide association study

phenome-wide association study

sample relatedness

polygenic prediction

Types

Thesis

Handle

https://hdl.handle.net/2027.42/169974

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.