Models and Algorithms for Large-Scale Time-to-Event Data
Luo, Lingfeng
2024
Abstract
Large-scale disease registries, such as the Surveillance, Epidemiology and End Results Registry (SEER) have a wealth of untapped information about the determinants of patient survival, offering extraordinary opportunities for full utilization of these rich sources of information. However, many existing statistical methods are not designed to handle such large-scale data with complex structures. For example, the widely used proportional hazards model assumes that the effects of the risk factors investigated have a constant impact on the hazard over time. However, associations between survival outcomes and risk factors can be complex and involve time-varying effects and interactions, especially with long follow-up periods. This thesis introduces new statistical methods and computational algorithms that leverage large cancer registry databases to perform estimation, variable selection, and prediction. Such tools will accelerate new discoveries and generate new hypotheses about the determinants of survival outcomes. In Chapter II, we develop statistical methods to estimate the time-varying effects on survival data. Fitting a time-varying effect model by maximizing the partial likelihood with such large-scale survival data is not feasible with most existing software. Moreover, estimating time-varying coefficients using spline-based approaches requires a moderate number of knots, which may lead to unstable estimation and over-fitting issues. To resolve these issues, adding a penalty term greatly helps in the estimation. However, the selection of penalty smoothing parameters is difficult in this time-varying setting. We propose modified information criteria to determine the smoothing parameter and a parallelized Newton-based algorithm for estimation. Simulations demonstrate the effectiveness of our approach in reducing the mean squared error of the estimated time-varying coefficients. We apply the method to SEER head-and-neck, colon, prostate, and pancreatic cancer data and detect the time-varying nature of various risk factors. In Chapter III, we focus on performing variable selection for high dimensional survival data, distinguishing between time-varying effects and time-independent effects, and identifying important interaction terms. We propose a coordinate ascent-based gradient boosting procedure using discrete failure time models. This method achieves variable selection in high-dimensional scenarios and ensures the appropriate inclusion of interaction terms, adhering to either strong or weak hierarchy restrictions. Recognizing the limitations of traditional boosting stopping criteria, we propose an information-criteria-based rule with a well-defined degrees of freedom concept. When applied to simulations and SEER melanoma cancer data, our approach isolates key risk factors, differentiates time-varying and time-independent effects, and identifies critical interactions. In Chapter IV, we develop a flexible deep learning framework for integrating external risk models with internal time-to-event data. Deep learning methods have shown promise in overcoming these challenges, but their effectiveness is often dependent on large datasets. However, deep learning methods implemented on moderate or small data sets often suffer from severe problems, such as insufficient training data and overfitting. To address these issues we propose a flexible integration-based deep learning framework named NNCoxKL for prediction in time-to-event data. Our approach accommodates both homogeneous and heterogeneous settings, allows for the integration of risk scores derived from various external sources, and leverages the power of deep neural networks to capture complex non-linear relationships. We demonstrate the improved predictive accuracy of our framework through extensive simulations and real-world applications. This methodology offers a valuable tool to improve prognostic prediction in survival analysis, particularly in settings with limited sample sizes.Deep Blue DOI
Subjects
time varying effects, high dimensional variable selection, data integration, statistical machine learning and deep learning
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.