Statistical Methods to Incorporate External Summary-level Information into a Current Study
Gu, Tian
2021
Abstract
In the era of big data, it is becoming increasingly common for researchers to consider incorporating external information from large studies to improve the accuracy of statistical inference instead of relying on a modestly sized dataset collected internally. We consider a general statistical problem where there are some known regression models or risk calculators to predict an outcome of interest from a set of commonly used predictors. Different types of summary information are available for these external models. An internal modest-sized dataset containing individual-level data for the variables in the known models and some new variables is available for our current analysis. In all three chapters below, we consider different settings to achieve the same goal--to build an improved prediction model that includes the new variables, using both the internal individual-level data and summary information obtained from the known external model(s). In Chapter 2, we focus on the simple case where there is only one large, well-characterized previous study from the external population. We propose a synthetic data approach, which first converts the external information into synthetic data, and then analyzes a combined dataset consisting of the observed internal data and the synthetic data. A theoretical justification and extensive simulation studies establish the efficiency gain and improved prediction performance of the proposed data integration method. We also illustrate that even under less restrictive requirements on the information that is available externally, the combined estimates have the same asymptotic properties as an alternative constraint maximum likelihood estimation approach. In Chapter 3, we consider a more complicated but quite plausible situation where several external prediction models are available to aid inference and prediction for the internal study. We assume that each of the external studies developed a prediction model for the same outcome but may use a slightly different set of covariates. We propose a meta-inference framework using an empirical Bayes estimation approach, which adaptively combines the estimates from the external models. This adaptive approach diminishes the influence of information that is less compatible with the internal data while balancing the bias-variance trade-off. The estimators we proposed are more efficient than the naive analysis of the internal data. In Chapter 4, we first extend the synthetic data method from Chapter 2 to accommodate the situation with multiple external prediction models, and further allow for heterogeneity of covariate effects across the external populations. Each external model could potentially be built on slightly different subsets of covariates that are measured in the internal study. The proposed approach generates synthetic outcome data in each population, uses stacked multiple imputation to create a long dataset with complete covariate information, and finally analyzes the imputed data with weighted regression. Leveraging multiple sources of auxiliary information from a broad class of externally fitted predictive models or established risk calculators based on parametric regression or machine learning methods, this new strategy can make statistical inference more accurate for both the internal population and the external populations. We evaluate the proposed methods through extensive simulations and apply them to improve models for predicting the risk of high-grade prostate cancer.Deep Blue DOI
Subjects
Data integration Prediction models Regression inference
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.