Statistical Methods of Data Integration, Model Fusion, and Heterogeneity Detection in Big Biomedical Data Analysis

Tang, Lu

Statistical Methods of Data Integration, Model Fusion, and Heterogeneity Detection in Big Biomedical Data Analysis

dc.contributor.author	Tang, Lu
dc.date.accessioned	2018-10-25T17:39:01Z
dc.date.available	NO_RESTRICTION
dc.date.available	2018-10-25T17:39:01Z
dc.date.issued	2018
dc.date.submitted
dc.identifier.uri	https://hdl.handle.net/2027.42/145885
dc.description.abstract	Interesting and challenging methodological questions arise from the analysis of Big Biomedical Data, where viable solutions are sought with the help of modern computational tools. In this dissertation, I look at problems in biomedical studies related to data integration, data heterogeneity, and related statistical learning algorithms. The overarching strategy throughout the dissertation research is rooted in the treatment of individual datasets, but not individual subjects, as the elements of focus. Thus, I generalized some of the traditional subject-level methods to be tailored for the development of Big Data methodologies. Following an introduction overview in the first chapter, Chapter II concerns the development of fusion learning of model heterogeneity in data integration via a regression coefficient clustering method. The statistical learning procedure is built for the generalized linear models, and enforces an adjacent fusion penalty on ordered parameters (Wang et al., 2016). This is an adaptation of the fused lasso (Tibshirani et al., 2005), and an extension to the homogeneity pursuit (Ke et al., 2015) that only considers a single data set. Using this method, we can identify regression coefficient heterogeneity across sub-datasets and fuse homogeneous subsets to greatly simplify the regression model, so to improve statistical power. The proposed fusion learning algorithm (published as Tang and Song (2016)) allows the integration of a large number of sub-datasets, a clear advantage over the traditional methods with stratum-covariate interactions or random effects. This method is useful to cluster treatment effects, so some outlying studies may be detected. We demonstrate our method with datasets from the Panel Study of Income Dynamics and from the Early Life Exposures in Mexico to Environmental Toxicants study. This method has also been extended to the Cox proportional hazards model to handle time-to-event response. Chapter III, under the assumption of homogeneous generalized linear model, focuses on the development of a divide-and-combine method for extremely large data that may be stored on distributed file systems. Using the means of confidence distribution (Fisher, 1956; Efron, 1993), I develop a procedure to combine results from different sub-datasets, where lasso is used to reduce model size in order to achieve numerical stability. The algorithm fits into the MapReduce paradigm and may be perfectly parallelized. To deal with estimation bias incurred by lasso regularization, a de-bias step is invoked so the proposed method can enjoy a valid inference. The method is conceptually simple, and computationally scalable and fast, with the numerical evidence illustrated in the comparison with the benchmark maximum likelihood estimator based on full data, and some other competing divide-and-combine-type methods. We apply the method to a large public dataset from the National Highway Traffic Safety Administration on identifying the risk factors of accident injury. In Chapter IV, I generalize the fusion learning algorithm given in Chapter II and develop a coefficient clustering method for correlated data in the context of the generalized estimating equations. The motivation of this generalization is to assess model heterogeneity for the pattern mixture modeling approach (Little, 1993) where models are stratified by missing data patterns. This is one of primary strategies in the literature to deal with the informative missing data mechanism. My method aims to simplify the pattern mixture model by fusing some homogeneous parameters under the generalized estimating equations (GEE, Liang and Zeger (1986)) framework.
dc.language.iso	en_US
dc.subject	Data integration
dc.subject	Fusion learning
dc.subject	Distributed computing
dc.subject	Regularized regression
dc.subject	Missing data
dc.subject	Longitudinal data
dc.title	Statistical Methods of Data Integration, Model Fusion, and Heterogeneity Detection in Big Biomedical Data Analysis
dc.type	Thesis	en_US
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Biostatistics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Song, Peter Xuekun
dc.contributor.committeemember	Peterson, Karen Eileen
dc.contributor.committeemember	Kang, Jian
dc.contributor.committeemember	Sanchez, Brisa N
dc.subject.hlbsecondlevel	Public Health
dc.subject.hlbsecondlevel	Statistics and Numeric Data
dc.subject.hlbtoplevel	Health Sciences
dc.subject.hlbtoplevel	Science
dc.description.bitstreamurl	https://deepblue.lib.umich.edu/bitstream/2027.42/145885/1/lutang_1.pdf
dc.identifier.orcid	0000-0001-6143-9314
dc.identifier.name-orcid	Tang, Lu; 0000-0001-6143-9314	en_US
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: lutang_1.pdf
Size:: 761.2KB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.