Show simple item record

Statistical and Machine Learning Methods for the Analysis of Summary Statistics Derived from Large Genomic Datasets

dc.contributor.authorLiao, Kevin
dc.date.accessioned2024-02-13T21:18:41Z
dc.date.available2024-02-13T21:18:41Z
dc.date.issued2023
dc.date.submitted2023
dc.identifier.urihttps://hdl.handle.net/2027.42/192396
dc.description.abstractAdvancements in DNA sequencing over the past decade have transformed our ability to characterize genetic variation in large populations and study the genetics of many complex traits. For population geneticists, information on the genetic variation (i.e., which sites in the genome are mutated and at what frequency) alone is interesting as it allows for studying aspects of a population (e.g., demographic history, natural selection, and mutation rates). For statistical geneticists and genetic epidemiologists, the availability of phenotypic information in the same set of genetically sequenced individuals allow for studying the genetic basis of a complex trait. In this dissertation, I present three separate projects that leverage genetic information originating from DNA sequencing. In the first project I focused on analyzing genetic variation without consideration of a phenotype, as is often done in the field of population genetics to make inferences on demographic history or natural selection. A commonly used summary statistic of genetic variation for population genetics inference is the allele frequency spectrum. However, methods based on the allele frequency spectrum make a simplifying assumption: all sites are interchangeable (i.e., an A->T mutation is the same as a C->T) mutation. In this project, I first extended previous literature to show heterogeneity in the allele frequency spectrum exists across mutation types at finer levels of resolution. I then illustrated how inferences of demographic history and natural selection are impacted by the violation of this assumption. In the second project I focused on combining phenotypic information with genetic data through genome wide association studies (GWAS) and polygenic risk scores (PRS). GWAS estimate per-variant genetic effects on a complex trait, which can be used to summarize the genetic risk of that trait for an individual in PRS (constructed as the GWAS-weighted sum of their risk variants). However, PRS have a portability issue where phenotype predictions worsen as the ancestry of the target sample diverges from that of the GWAS sample. In admixed individuals, genome can be traced back to multiple ancestral populations and ancestry lies on a continuum. Such a continuum causes an ancestry dependence of PRS performance, as the PRS for samples whose ancestry better matches the external GWAS perform better. To help resolve this issue, I developed slaPRS, a stacking-based framework to integrate GWAS from multiple ancestral populations to construct polygenic risk scores (PRS) in admixed individuals. In simulations and real data, slaPRS performed well and reduced the ancestry dependence compared to existing approaches. In the third project I focused on how genetic-phenotypic associations are shared across two more phenotypes through pleiotropy. Pleiotropy can be characterized at resolutions including genome wide, regionally, or at the SNP/gene-level. One approach to studying pleiotropy is local genetic correlation (LGC), which quantifies the extent of genetic sharing in a local region through the similarity in GWAS effect sizes. However, one problem of LGC is that it remains unable to identify SNP or gene-level pleiotropy, making it impossible to identify which variants or genes in a region drive a signal of LGC. To resolve this issue, I developed LDSC-MIX, a Bayesian mixture of regression method to infer latent groups of likely shared causal variants across two traits. In simulations and real data, LDSC-MIX identified SNP sets recovering the true LGC and tested whether genes in a region are enriched for such SNPs.
dc.language.isoen_US
dc.subjectstatistical genetics
dc.subjectpopulation genetics
dc.subjectpolygenic risk scores
dc.subjectgenetic correlation
dc.subjectpleiotropy
dc.titleStatistical and Machine Learning Methods for the Analysis of Summary Statistics Derived from Large Genomic Datasets
dc.typeThesis
dc.description.thesisdegreenamePhD
dc.description.thesisdegreedisciplineBiostatistics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberZoellner, Sebastian K
dc.contributor.committeememberTerhorst, Jonathan
dc.contributor.committeememberBaladandayuthapani, Veerabhadran
dc.contributor.committeememberMorrison, Jean
dc.contributor.committeememberWen, Xiaoquan William
dc.subject.hlbsecondlevelGenetics
dc.subject.hlbsecondlevelPublic Health
dc.subject.hlbsecondlevelStatistics and Numeric Data
dc.subject.hlbtoplevelHealth Sciences
dc.subject.hlbtoplevelScience
dc.contributor.affiliationumcampusAnn Arbor
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/192396/1/ksliao_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/22305
dc.identifier.orcid0000-0003-4660-2480
dc.identifier.name-orcidLiao, Kevin; 0000-0003-4660-2480en_US
dc.working.doi10.7302/22305en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.