Show simple item record

Synthetic Data Sharing and Estimation of Viable Dynamic Treatment Regimes with Observational Data

dc.contributor.authorZhou, Nina
dc.date.accessioned2021-02-04T16:37:34Z
dc.date.available2021-02-04T16:37:34Z
dc.date.issued2020
dc.date.submitted2020
dc.identifier.urihttps://hdl.handle.net/2027.42/166113
dc.description.abstractSignificant public demand arises for rapid data-driven scientific investigations using observational data, especially in personalized healthcare. This dissertation addresses three complementary challenges of analyzing complex observational data in biomedical research. The ethical challenge reflects regulatory policies and social norms regarding data privacy, which tend to emphasize data security at the expense of effective data sharing. This results in fragmentation and scarcity of available research data. In Chapter 2, we propose the DataSifter approach that mediates this challenge by facilitating the generation of realistic synthetic data from sensitive datasets containing static and time-varying variables. The DataSifter method relies on robust imputation methods, including missForest and an iterative imputation technique for time-varying variables using the Generalized Linear Mixed Model (GLMM) and the Random Effects-Expectation Maximization tree (RE-EM tree). Applications demonstrate that under a moderate level of obfuscation, the DataSifter guarantees sufficient per subject perturbations of time-invariant data and preserves the joint distribution and the energy of the entire data archive, which ensures high utility and analytical value of the time-varying information. This promotes accelerated innovation by enabling secure sharing among data governors and researchers. Once sensitive data can be securely shared, effective analytical tools are needed to provide viable individualized data-driven solutions. Observational data is an important data source for estimating dynamic treatment regimes (DTR) that guide personalized treatment decisions. The second natural challenge regards the viability of optimal DTR estimations, which may be affected by the observed treatment combinations that are not applicable for future patients due to clinical or economic reasons. In Chapter 3, we develop restricted Tree-based Reinforcement Learning to accommodate restrictions on feasible treatment combinations in observational studies by truncating possible treatment options based on patient history in a multi-stage multi-treatment setting. The proposed new method provides optimal treatment recommendations for patients only regarding viable treatment options and utilizes all valid observations in the dataset to avoid selection bias and improve efficiency. In addition to the structured data, unstructured data, such as free-text, or voice-note, have become an essential component in many biomedical studies based on clinical and health data rapidly, including electronic health records (EHR), providing extra patient information. The last two chapters in my dissertation (Chapter 4 and Chapter 5) expands the methods developed in the previous two projects by utilizing novel natural language processing (NLP) techniques to address the third challenge of handling unstructured data elements. In Chapter 4, we construct a text data anonymization tool, DataSifterText, which generates synthetic free-text data to protect sensitive unstructured data, such as personal health information. In Chapter 5, we propose to enhance the precision of optimal DTR estimation by acquiring additional information contained in clinical notes with information extraction (IE) techniques. Simulation studies and application on blood pressure management in intensive care units demonstrated that the IE techniques can provide extra patient information and more accurate counterfactual outcome modeling, because of the potentially enhanced sample size and a wider pool of candidate tailoring variables for optimal DTR estimation. The statistical methods presented in this thesis provides theoretical and practical solutions for privacy-aware utility-preserving large-scale data sharing and clinically meaningful optimal DTR estimation. The general theoretical formulation of the methods leads to the design of tools and direct applications that are expected to go beyond the biomedical and health analytics domains.
dc.language.isoen_US
dc.subjectData Masking
dc.subjectViable Dynamic Treatment Regime Estimation
dc.titleSynthetic Data Sharing and Estimation of Viable Dynamic Treatment Regimes with Observational Data
dc.typeThesis
dc.description.thesisdegreenamePhDen_US
dc.description.thesisdegreedisciplineBiostatistics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberDinov, Ivaylo D
dc.contributor.committeememberWang, Lu
dc.contributor.committeememberXi, Chuanwu
dc.contributor.committeememberAlmirall, Daniel
dc.contributor.committeememberWu, Zhenke
dc.subject.hlbsecondlevelStatistics and Numeric Data
dc.subject.hlbtoplevelScience
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/166113/1/zhounina_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/36
dc.identifier.orcid0000-0002-1649-3458
dc.identifier.name-orcidZhou, Nina; 0000-0002-1649-3458en_US
dc.working.doi10.7302/36en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.