Novel Statistical Learning Methods for High-Dimensional Complex Biomedical Data Analysis

Zhang, Daiwei

Novel Statistical Learning Methods for High-Dimensional Complex Biomedical Data Analysis

Zhang, Daiwei

2021

View/Open

daiweiz_1.pdf

(20MB

PDF)

Abstract

Over the past decades, biomedical data have grown rapidly both in dimension and in complexity. Traditional statistical models often lack the power of detecting the nonlinear associations underlying the complex high-dimensional biomedical data. Machine learning (ML) methods, on the other hand, have been shown to be successful for solving the challenging problems in some applications. However, because of a ``black box'' nature, standard ML neither elucidates the data-generation mechanism nor quantifies the model-fitting uncertainty, which have largely limited their usefulness in biomedical studies. Furthermore, the sample sizes required by sophisticated ML approaches, such as deep neural networks, for analyzing large-scale data, such as those commonly found in imaging genetics and spatial transcriptomics, are not widely affordable in typical medical studies. These difficulties have contributed to the relatively scant success of ML in biomedical applications. To address these challenges, this dissertation aims at developing several novel approaches that combine traditional statistical models with ML algorithms to efficiently and effectively analyze large-scale complex biomedical data. In the first project, we develop a robust and fast method based on principal component analysis (PCA) for predicting population stratification (PS) from genotypes. PS is a major confounder in genome-wide association studies that can lead to false positive associations. Although PCA-based methods have been widely adopted for PS adjustment, existing methods are either biased toward the null or computationally expensive for large reference sets. In response, we propose two alternative approaches that can estimate the asymptotic shrinkage bias using random matrix theory and reduce the computation cost with online SVD. The proposed methods are applied to extensive simulation studies and data in the UK Biobank and the 1000 Genomes Project. We show that compared with existing methods, our methods are unbiased and the computation cost is significantly lower. In the second project, we propose a novel image-on-scalar regression (ISR) model to study the association between imaging measurements and scalar covariates. Statistical inferences on medical ISR is challenging due to the high imaging dimensionality, limited number of images, complex spatial correlations, and heterogeneous noises. To address these challenges, we utilize deep neural networks to model the spatially varying coefficient functions of the main effects, individual effects, and noise variance in the ISR model (NNISR). Compared to existing methods, NNISR is more flexible for capturing complex spatial patterns, more straightforward to interpret, and more accurate for small numbers of high-resolution images. We develop computationally efficient and scalable algorithms for parameter estimation and activation region selection. Theoretical analysis is conducted to establish estimation and selection consistency of the proposed method. The superiority of NNISR is further demonstrated through extensive simulations and analyses of brain fMRI data. In the third project, we focus on modeling the conditional distribution of the response given predictors via deep neural networks. Standard neural network regression makes prediction on the response using the conditional mean and often assumes a simple homoscedastic error distribution. To better quantify prediction uncertainty, we develop a novel Bayesian hierarchical neural network model by introducing latent variables at each hidden layer, which induces high flexibility in modeling the predictive distribution of the response. In light of the special structure of the proposed model, we develop a scalable and accurate Gibbs sampling for posterior computation. We illustrate the proposed method via simulations and analysis of neuroimaging data.

Deep Blue DOI

https://dx.doi.org/10.7302/3890

Subjects

statistical learning

Types

Thesis

Handle

https://hdl.handle.net/2027.42/171378

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.