Computational Methods for Single-Cell and Spatial Multimodal Data Integration
Gao, Chao
2024
Abstract
Advancements in sequencing technologies have revolutionized our ability to measure biomolecules. Single-cell single-omics sequencing allows for the examination of genome, transcriptome, epigenome at unprecedented resolution, providing a detailed view of cellular diversity and function. Furthermore, it addressed the limitations of bulk RNA sequencing that only profiles averaged gene expression across cells, masking the cellular heterogeneities. Following this, single-cell multimodal omics enables simultaneous analysis of multiple types of molecular measurements in the same cell. Such paired information has revealed genetic and epigenetic landscapes as well as their relationships. Further, spatial sequencing technologies provide molecular measurements with localization within tissues, adding an essential dimension to our understanding of biological complexity. They have assisted our research about how cells interact within spatial context, crucial for comprehending tissue organization, development, and disease pathology. In this dissertation, I propose three computational methods to address the challenges posed by each of these data types for identifying the heterogeneities within cell populations and tissue regions, advancing our knowledge of biological systems. Integrating diverse single-cell unimodal datasets offers tremendous opportunities for unbiased, comprehensive, quantitative definition of cell identities. The published single-cell data integration approaches are not designed for integration of multiple modalities or not scalable to massive datasets. None of these methods can incorporate new data without recalculating from scratch. To this end, I develop an online learning algorithm to solve the integrative nonnegative matrix factorization (Online iNMF). For cell type inference, I apply Online iNMF to integrate large-scale, continually arriving single-cell datasets of diverse molecular modalities, including gene expression, chromatin accessibility, and DNA methylation. Online iNMF converges rapidly and decouples the peak memory usage from the size of the entire dataset. Online iNMF shows that the improved computational efficiency is not at the cost of dataset alignment and cluster preservation performance. Online iNMF’s ability to iteratively incorporate data is useful in building single-cell multi-omic atlases. Single-cell multimodal epigenomic profiling simultaneously measures multiple histone modifications and chromatin accessibility in the same cells. Such parallel measurements provide opportunities to investigate how epigenomic modalities vary together across cell populations. I propose ConvNet-VAE, a variational autoencoder comprising one-dimensional convolutional layers, for dimensionality reduction. After window-based genome binning, ConvNet-VAE leverages the multi-track and sequential nature of these data. I apply ConvNet-VAE to integrate histone modification marks and chromatin accessibility profiled from juvenile mouse brain and human bone marrow. Compared to multimodal VAEs with only fully connected layers, ConvNet-VAE can achieve better performance in dimensionality reduction and batch correction, while using significantly fewer parameters. The advantage of ConvNet-VAE increases with the number of modalities, making it a promising tool as the number of jointly profiled epigenomic modalities grows. Multimodal spatial profiling has allowed for the simultaneous investigation of transcriptomics, proteomics, and epigenomics at the individual cell/bead/spot level in the tissue. I devise spaMVGAE, a multimodal variational autoencoder employing graph convolutional networks. By incorporating spatial location information, spaMVGAE adapts to various modalities and learns a joint low-dimensional embedding of cells/beads/spots for domain detection. I apply spaMVGAE to spatially resolved multimodal datasets from different biological contexts, such as breast cancer, mouse bone development, and adult mouse brain. spaMVGAE accurately detects regions of interest by capturing the heterogeneous and complex molecular makeup of the cells or tissue microenvironments. spaMVGAE scales to large datasets and carries out joint integration across multiple tissue sections.Deep Blue DOI
Subjects
Single-Cell and Spatial Multimodal Data Integration Machine Learning
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.