Statistical Methods and Privacy Preserving Protocols for Combining Genetic Data with Electronic Health Records

Zhao, Xutong

Statistical Methods and Privacy Preserving Protocols for Combining Genetic Data with Electronic Health Records

dc.contributor.author	Zhao, Xutong
dc.date.accessioned	2020-01-27T16:24:44Z
dc.date.available	NO_RESTRICTION
dc.date.available	2020-01-27T16:24:44Z
dc.date.issued	2019
dc.date.submitted
dc.identifier.uri	https://hdl.handle.net/2027.42/153415
dc.description.abstract	In recent years, electronic health records (EHR) have been combined with genetic data to uncover disease biology and accelerate generation of hypotheses for drug development and treatment strategies. The goal of this dissertation is to develop novel statistical models that can address the challenges of analyzing ‘imperfect’ EHR data and to propose privacy-preserving methods that enable sensitive individual-level data sharing across EHR studies and other large genetic studies. In Chapter II, we propose a statistical method to address misclassified clinical outcomes, a common challenge in EHR data. One essential step of EHR-based genome-wide association studies is constructing a cohort of cases and controls for a specific disease from billing codes and other clinical or administrative data. Nearly always, a perfect strategy for deriving disease phenotypes from billing codes is not available, resulting in some incorrect case/control labels. Here, we propose a method to estimate the misclassification of case/control status by examining genotype information of dozens of disease associated loci. Through simulation and application to the Michigan Genomics Initiative data, we demonstrate that the method enables the evaluation of new EHR-based phenotype definition schemes and provides accurate estimates of disease association measures when phenotypes are misclassified. In Chapter III and IV, we focus on identifying overlapping samples between studies, a common challenge when aggregating information across datasets. We particularly focus on identifying duplicate or related samples when sharing the underlying individual level genetic data is restricted. We propose methods that do not require disclosure of individual identities but that can still identify genetic relatives across datasets. In Chapter III, we show that by grouping genotypes into segments and calculating summary statistics within each segment, we are able to obscure and encode individual-level genetic information. Relatives can be inferred with the coded genotypes using a likelihood model. Simulation and application to the Trans-Omics for Precision Medicine (TOPMed) program data demonstrate the utility and security of the method. In Chapter IV, we extend the method further, with a strategy that guarantees stronger encryption and is expected to work across heterogeneous populations. This secure protocol can infer genetic relatives among people of diverse ethnic backgrounds. The method works by combining a cryptographic technique, homomorphic encryption, with the robust relationship inference method previously described by Manichaikul et al (2010). Through simulations, we show that our method's performance is identical to that of implementations that use the original unencrypted genotypes. Our protocol scales well in computing time and is protected from several possible attacks. The secure protocol was again applied to TOPMed dataset. Securely identifying related samples will facilitate combination of results across datasets when there are restrictions to sharing the underlying individual level data. In conclusion, the methods developed here well enhance use of EHR data and genome data to improve accuracy of case/control status as well as decrease inclusion of relatives across studies when desired.
dc.language.iso	en_US
dc.subject	statistical genetics
dc.subject	electronic health records
dc.subject	genome-wide association studies
dc.subject	misclassification
dc.subject	data security
dc.title	Statistical Methods and Privacy Preserving Protocols for Combining Genetic Data with Electronic Health Records
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Biostatistics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Abecasis, Goncalo
dc.contributor.committeemember	Peyser, Patricia A
dc.contributor.committeemember	Kang, Hyun Min
dc.contributor.committeemember	Lee, Seunggeun Shawn
dc.subject.hlbsecondlevel	Public Health
dc.subject.hlbtoplevel	Health Sciences
dc.description.bitstreamurl	https://deepblue.lib.umich.edu/bitstream/2027.42/153415/1/xtzhao_1.pdf
dc.identifier.orcid	0000-0003-3179-7369
dc.identifier.name-orcid	Zhao, Xutong; 0000-0003-3179-7369	en_US
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: xtzhao_1.pdf
Size:: 1.976MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.