Show simple item record

Statistical Latent Space Models for International Classification of Diseases (ICD) Codes

dc.contributor.authorMa, Cheng
dc.date.accessioned2024-05-22T17:27:39Z
dc.date.available2024-05-22T17:27:39Z
dc.date.issued2024
dc.date.submitted2024
dc.identifier.urihttps://hdl.handle.net/2027.42/193418
dc.description.abstractThe increasingly widespread use of Electronic Health Records (EHRs) offers significant opportunities to improve patient care insights and inspire extensive healthcare research. The International Classification of Diseases (ICD) codes, a crucial component of EHR data, have attracted significant research interest due to their potential to improve clinical decision-making. This dissertation focuses on developing novel statistical models for ICD code embedding and exploring their applications. Most existing healthcare research works borrow word embedding techniques from natural language processing (NLP) and apply them to ICD codes. However, significant differences in the structure, meaning, and usage exist between ICD codes and natural language words, making word embedding methods not entirely suitable for modeling ICD codes. The first part of this dissertation proposes a new latent space zero-inflated Poisson model to characterize the co-occurrence of ICD codes. This model associates each ICD code with a latent vector and assumes the co-occurrence of two ICD codes depends on the relative positions of the corresponding latent vectors. By utilizing a zero-inflated Poisson distribution, the proposed model effectively addresses the abundant zeros commonly observed in practice. Theoretically, we establish error bounds for the estimation of the latent vectors. Furthermore, we demonstrate the effectiveness of our model using the MIMIC-III EHR dataset, showing that the learned latent vectors are useful predictors for downstream tasks. This indicates that our model has the potential to improve patient outcome predictions and advance EHR-based research. Designed in the 1970s, the Ninth Revision of ICD (ICD-9) no longer meets the medical needs of healthcare providers and patients. In October 2015, hospitals in the United States transitioned from ICD-9 to ICD-10 codes. Consequently, the healthcare domain faces challenges in transferring and merging historical data and applications to this new system. Addressing this, the second part of the dissertation proposes a joint latent space zero-inflated Poisson model that learns embeddings for both versions of ICD codes simultaneously, as well as a transformation that maps ICD-9 codes to the newer system. To demonstrate the practical value of the model, we design an ICD code translation task using the Nationwide Readmissions Database (NRD) and show that our proposed model outperforms all existing approaches in this task. Although the latent space zero-inflated Poisson model has proven effective in modeling ICD codes, focusing solely on pairwise co-occurrences may overlook higher-order information among ICD codes. In the third project, we treat ICD codes in EHR as a hypergraph, where the set of ICD codes in a medical record forms a hyperedge. Specifically, we consider a latent space model based on the determinantal point process for hypergraphs. Direct estimation of parameters using the likelihood function is however numerically unstable. To overcome this, we develop an algorithm based on a mixture of the ordinary likelihood and the pseudo-likelihood. This proposed algorithm is shown to be more stable compared to using the ordinary likelihood alone, particularly when the number of hyperedges is not large. We apply this approach to a readmission prediction task on the MIMIC-III EHR dataset and demonstrate its practical value. We also establish theoretical guarantees for the consistency of the mixed likelihood estimation.
dc.language.isoen_US
dc.subjectNetwork Analysis
dc.subjectHypergraph
dc.subjectICD Code Embedding
dc.subjectLatent Space Model
dc.titleStatistical Latent Space Models for International Classification of Diseases (ICD) Codes
dc.typeThesis
dc.description.thesisdegreenamePhD
dc.description.thesisdegreedisciplineStatistics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberZhu, Ji
dc.contributor.committeememberJin, Judy
dc.contributor.committeememberLevina, Liza
dc.contributor.committeememberShedden, Kerby A
dc.subject.hlbsecondlevelStatistics and Numeric Data
dc.subject.hlbtoplevelScience
dc.contributor.affiliationumcampusAnn Arbor
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/193418/1/chengmc_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/23063
dc.identifier.orcid0009-0006-0558-1686
dc.working.doi10.7302/23063en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.