Renewable Estimation and Incremental Inference with Streaming Health Datasets
Luo, Lan
2020
Abstract
The overarching objective of my dissertation is to develop a new methodology that allows to sequentially update parameter estimates and their standard errors along with data streams. The key technical novelty pertains to the fact that the proposed estimation method, termed as renewable estimation in my dissertation, uses current data and summary statistics of historical data, but no use of any historical subject-level data. To implement renewable estimation, I utilize the powerful Lambda architecture in Apache Spark to design a new paradigm that includes an inference layer in addition to the existing speed layer. This expanded architecture is named as the Rho architecture, which accommodates inference-related statistics and to facilitate sequential updating of quantities involved in estimation and inference. The first project focuses on the renewable estimation in the setting of generalized linear models (RenewGLM) in which I develop a new sequential updating algorithm to calculate numerical solutions of parameter estimates and related inferential quantities. The proposed procedure aggregates both score functions and information matrices over streaming data batches through some summary statistics. I show that the resulting estimation is asymptotically equivalent to the maximum likelihood estimation (MLE) obtained with the entire data once. I demonstrate this new methodology on the analysis of the National Automotive Sampling System-Crashworthiness Data System (NASS CDS) to evaluate the effectiveness of graduated driver licensing (GDL) in the USA. The second project focuses on a substantial extension of the first project to analyze streaming datasets with correlated outcomes, such as clustered data and longitudinal data. I establish the theoretical guarantees for the proposed renewable quadratic inference function (RenewQIF) for dependent outcomes and implement it within the Rho architecture. Furthermore, I relax the homogeneous assumption in the first project and consider regime-switching regression models with a structural change-point. I propose a real-time hypothesis testing procedure based on a goodness-of-fit test statistic that is shown to achieve both proper type I error control and desirable change-point detection power. The third project concerns data streams that involve both inter-data batch correlation and dynamic heterogeneity, arising typically from various types of electronic health records (EHR) and mobile health data. This project is built in the framework of state space models in which the observed data stream is driven by a latent state process that may incorporate trend, seasonal, or time-varying covariate effects. In this setting, calculating the online MLE is challenge due to the involvement of high-dimensional integrals and complex covariance structures. In this project, I develop a Kalman filter to facilitate a multivariate online regression analysis (MORA) in the context of linear state space mixed models. MORA enables to renew both point estimates and standard errors of the fixed effects. We also apply the MORA method to analyze an EHR data example, adjusting for some heterogeneous batch-specific effects.Subjects
Streaming data analysis Stochastic gradient descent Incremental inference
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.