Statistical Models for Large Scale Genomic Data

Si, Yichen

Statistical Models for Large Scale Genomic Data

dc.contributor.author	Si, Yichen
dc.date.accessioned	2024-02-13T21:16:09Z
dc.date.available	2024-02-13T21:16:09Z
dc.date.issued	2023
dc.date.submitted	2023
dc.identifier.uri	https://hdl.handle.net/2027.42/192344
dc.description.abstract	Sequencing technologies transformed how scientists examine biological systems at both macroscopic and microscopic level. For example, population-scale genome sequencing has cataloged millions of common and rare genetic variants in humans and enabled genotype imputation; new spatial transcriptomics technologies can resolve the sequences and locations of individual transcripts in tissues at submicron resolution. The massive amount of data generated by these technologies poses new scientific questions and challenges. This dissertation proposes statistical models and methods motivated by observations in large-scale sequencing data. In Chapter 2, we study genotype imputation accuracy through the lens of the coalescent. We focus on rare variants where imputation is most challenging and investigate the theoretical upper bound of their imputation accuracy under the reference-based genotype imputation framework. We develop closed-form solutions for the probability distribution of the joint genealogy of the target and reference sequences, quantify the inevitable error rate of imputation, and evaluate the error’s impact on association studies across a range of allele frequencies and reference sample sizes. In Chapter 3, we infer population level germline methylation using only the allele frequencies from public genetic variant catalogs. We observe that the high mutation rates at methylated CpGs distort the site frequency spectrum and leave a distinct signature in genomic regions hyper-methylated in germline. Leveraging this observation, we build a Hidden Markov Model that generates a unique resource for CpG methylation status in human germline cells and overcomes limitations in whole genome bisulfite sequencing measurements. In Chapter 4, we develop a factor analysis method for sub-micron resolution spatial transcriptomics. We achieve segmentation-free sub-cellular resolution inference and scale our analysis to data with hundreds of millions of spatial locations. We apply our method to data from four platforms including both spatial barcoding and sequencing based technologies and textit{in situ} imaging-based technologies. Our method delineates cell type boundaries precisely and identifies rare cell populations even in complex tissue regions where cell segmentation fails.
dc.language.iso	en_US
dc.subject	coalescence
dc.subject	statistical genetics
dc.subject	spatial transcriptomics
dc.title	Statistical Models for Large Scale Genomic Data
dc.type	Thesis
dc.description.thesisdegreename	PhD
dc.description.thesisdegreediscipline	Biostatistics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Kang, Hyun Min
dc.contributor.committeemember	Zoellner, Sebastian K
dc.contributor.committeemember	Terhorst, Jonathan
dc.contributor.committeemember	Boehnke, Michael Lee
dc.contributor.committeemember	Mukherjee, Bhramar
dc.subject.hlbsecondlevel	Science (General)
dc.subject.hlbtoplevel	Science
dc.contributor.affiliationumcampus	Ann Arbor
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/192344/1/ycsi_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/22253
dc.identifier.orcid	0000-0001-5576-3054
dc.identifier.name-orcid	Si, Yichen; 0000-0001-5576-3054	en_US
dc.working.doi	10.7302/22253	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: ycsi_1.pdf
Size:: 13.52MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.