Statistical Learning for Large-Scale and Complex-Structured Data
Tang, Weijing
2022
Abstract
Our modern era has seen an explosion in the amount of valuable information stored in large and complex datasets. The growing scale, diversity of data structures, and incomplete observations in these datasets pose new challenges for statistical learning. Motivated by these challenges, this dissertation addresses three important problems below. (I) The first part of the dissertation presents how ordinary differential equations (ODE) can be novelly used to enhance modeling flexibility and computational efficiency in survival analysis for complex and incomplete censored data. Despite rich literature on survival analysis, most existing statistical models and estimation methods still suffer from practical limitations such as restricted model capacity and a lack of scalability for large-scale studies. We introduce a unified ODE framework for survival analysis that allows flexible modeling and enables a statistically efficient procedure for estimation and inference. In particular, the proposed estimation procedure is computationally efficient, easy-to-implement, and applicable to a wide range of survival models. Moreover, to accommodate data in diverse formats, we extend the ODE framework by leveraging deep neural networks for powerful prediction. (II) The second part of the dissertation focuses on statistical models for signed networks. Statistical network models are useful for understanding the underlying formation mechanism and characteristics of complex networks. However, statistical models for signed networks have been largely unexplored. In signed networks, there exist both positive (e.g., like, trust) and negative (e.g., dislike, distrust) edges, which are commonly seen in real-world scenarios. The positive and negative edges in signed networks lead to unique structural patterns, which pose challenges for statistical modeling. In this part, we introduce a novel latent space approach for modeling signed networks and accommodating the well-known balance theory in social science, i.e., "the enemy of my enemy is my friend" and "the friend of my friend is my friend". The proposed approach treats both edges and their signs as random variables, and characterizes the balance theory with a novel and natural notion of population-level balance. This approach guides us towards building a class of balanced inner-product models, and towards developing scalable algorithms via projected gradient descent to estimate the latent variables. We also establish non-asymptotic error rates for the estimates. (III) The third part of the dissertation focuses on applications of statistical machine learning to healthcare. In particular, quick and accurate prediction of disease progression can provide valuable information for clinicians to provide appropriate care in a timely manner. The success of prediction models often relies on the availability of a large number of labeled training data. However, in many healthcare settings, only a small minority of available data is accurately labeled while unlabeled data is abundant. Further, input variables such as clinical events in the medical records are usually of a complex, longitudinal nature, which poses additional challenges. Motivated by the scarcity of annotated data, we propose a new semi-supervised joint learning method for classifying clinical events data, which requires fewer labeled training data while maintaining the same prediction performance when compared to the supervised method.Deep Blue DOI
Subjects
Survival Analysis Network Analysis Semi-supervised Learning
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.