Show simple item record

Decoding Regulatory Variants With Computational Methods in Non-coding Regions of the Human Genome

dc.contributor.authorZhao, Nanxiang
dc.date.accessioned2023-09-22T15:35:21Z
dc.date.available2023-09-22T15:35:21Z
dc.date.issued2023
dc.date.submitted2023
dc.identifier.urihttps://hdl.handle.net/2027.42/177989
dc.description.abstractUnderstanding the functional consequences of regulatory variants is a significant challenge in genomics. Although Genome-Wide Association Studies (GWAS) have provided valuable insights into human phenotypes by identifying genetic variations associated with diseases and complex traits, the functional implications of many of these genetic variants remain unknown, particularly for non-coding regions of the human genome, which account for over 90% of all variants. To address this challenge, my dissertation focuses on functionally characterizing regulatory elements and their variants in the human genome. Specifically, I define regulatory variants as single nucleotide polymorphisms (SNPs) that can modify the binding affinities of transcription factors (TFs) within the regulatory elements. Such alterations can impact downstream gene expression and potentially contribute to disease progression and trait development. However, characterizing regulatory variants has traditionally relied on the laborious experimental dissection of the human genome, often confined to specific cell types or tissues, thus making it unfeasible to examine all relevant variants in their appropriate biological context. The advent of high-throughput sequencing and computation methods has substantially accelerated the discovery process. In my dissertation, I have developed a series of computational tools and methods to end-to-end characterize regulatory elements and their variants (Fig 6.1). In Chapter II, I developed a peak calling software, F-Seq2, to accurately define regulatory element regions from open chromatin assays and ChIP-seq assays. F-Seq2 utilized kernel density estimation and a dynamic "continuous" Poisson test to account for local biases, outperforming state-of-the-art software including MACS2 in terms of precision and recall. Accurate peak calling is essential for downstream analysis, such as differential binding or motif analysis, and lays the foundation for the functional characterization of regulatory variants. In Chapter III, I advanced a leading regulatory variants database, RegulomeDB, to its second version. RegulomeDB allows users to query variants and obtain a comprehensive list of functional evidence for their variants of interest. The new version of RegulomeDB contains over five times more data than its previous version, providing an even more comprehensive resource for researchers. Additionally, the introduction of a suite of scoring models, namely SURF and TURF, enables accurate summaries of the likelihood that variants function as regulatory variants based on all available evidence. In Chapter IV, I developed a machine learning model, TLand, as the next version of the RegulomeDB scoring model, to annotate and prioritize regulatory variants in an organ-specific manner. TLand takes advantage of RegulomeDB-derived features and builds a flexible architecture using stacked generalization to reduce overfitting and facilitate future continuous learning. TLand outperformed state-of-the-art models when holding out cell lines or organ allele-specific binding data. By accounting for common data availability issues that often exist in sequence-based deep learning models, TLand accurately prioritized the relevant organs for approximately 2 million GWAS SNPs. In Chapter V, I introduced a pipeline, Explain-seq, to automatically train and interpret sequence-based deep learning models given genomic coordinates. I demonstrated the utility of Explain-seq by applying it to a recent STARR-seq dataset to gain insights into enhancer binding patterns in a cell-specific manner. The pipeline identified both known and de novo motifs in the K562 cell line by comparing them to the JASPAR database. Overall, the computational methods and tools that I developed throughout my dissertation can aid in the discovery and characterization of regulatory elements and variants in the non-coding regions of the human genome.
dc.language.isoen_US
dc.subjectBioinformatics
dc.subjectGenomics
dc.subjectMachine learning
dc.subjectSoftware development
dc.titleDecoding Regulatory Variants With Computational Methods in Non-coding Regions of the Human Genome
dc.typeThesis
dc.description.thesisdegreenamePhDen_US
dc.description.thesisdegreedisciplineBioinformatics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberBoyle, Alan P
dc.contributor.committeememberDerksen, Harm
dc.contributor.committeememberKitzman, Jacob
dc.contributor.committeememberNajarian, Kayvan
dc.contributor.committeememberWelch, Joshua
dc.subject.hlbsecondlevelGenetics
dc.subject.hlbtoplevelScience
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/177989/1/samzhao_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/8446
dc.identifier.orcid0000-0003-3124-0958
dc.identifier.name-orcidZhao, Nanxiang; 0000-0003-3124-0958en_US
dc.working.doi10.7302/8446en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.