Show simple item record

Statistical Models for Large Scale Genomic Data

dc.contributor.authorSi, Yichen
dc.date.accessioned2024-02-13T21:16:09Z
dc.date.available2024-02-13T21:16:09Z
dc.date.issued2023
dc.date.submitted2023
dc.identifier.urihttps://hdl.handle.net/2027.42/192344
dc.description.abstractSequencing technologies transformed how scientists examine biological systems at both macroscopic and microscopic level. For example, population-scale genome sequencing has cataloged millions of common and rare genetic variants in humans and enabled genotype imputation; new spatial transcriptomics technologies can resolve the sequences and locations of individual transcripts in tissues at submicron resolution. The massive amount of data generated by these technologies poses new scientific questions and challenges. This dissertation proposes statistical models and methods motivated by observations in large-scale sequencing data. In Chapter 2, we study genotype imputation accuracy through the lens of the coalescent. We focus on rare variants where imputation is most challenging and investigate the theoretical upper bound of their imputation accuracy under the reference-based genotype imputation framework. We develop closed-form solutions for the probability distribution of the joint genealogy of the target and reference sequences, quantify the inevitable error rate of imputation, and evaluate the error’s impact on association studies across a range of allele frequencies and reference sample sizes. In Chapter 3, we infer population level germline methylation using only the allele frequencies from public genetic variant catalogs. We observe that the high mutation rates at methylated CpGs distort the site frequency spectrum and leave a distinct signature in genomic regions hyper-methylated in germline. Leveraging this observation, we build a Hidden Markov Model that generates a unique resource for CpG methylation status in human germline cells and overcomes limitations in whole genome bisulfite sequencing measurements. In Chapter 4, we develop a factor analysis method for sub-micron resolution spatial transcriptomics. We achieve segmentation-free sub-cellular resolution inference and scale our analysis to data with hundreds of millions of spatial locations. We apply our method to data from four platforms including both spatial barcoding and sequencing based technologies and textit{in situ} imaging-based technologies. Our method delineates cell type boundaries precisely and identifies rare cell populations even in complex tissue regions where cell segmentation fails.
dc.language.isoen_US
dc.subjectcoalescence
dc.subjectstatistical genetics
dc.subjectspatial transcriptomics
dc.titleStatistical Models for Large Scale Genomic Data
dc.typeThesis
dc.description.thesisdegreenamePhD
dc.description.thesisdegreedisciplineBiostatistics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberKang, Hyun Min
dc.contributor.committeememberZoellner, Sebastian K
dc.contributor.committeememberTerhorst, Jonathan
dc.contributor.committeememberBoehnke, Michael Lee
dc.contributor.committeememberMukherjee, Bhramar
dc.subject.hlbsecondlevelScience (General)
dc.subject.hlbtoplevelScience
dc.contributor.affiliationumcampusAnn Arbor
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/192344/1/ycsi_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/22253
dc.identifier.orcid0000-0001-5576-3054
dc.identifier.name-orcidSi, Yichen; 0000-0001-5576-3054en_US
dc.working.doi10.7302/22253en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.