Statistical Latent Space Models for International Classification of Diseases (ICD) Codes
dc.contributor.author | Ma, Cheng | |
dc.date.accessioned | 2024-05-22T17:27:39Z | |
dc.date.available | 2024-05-22T17:27:39Z | |
dc.date.issued | 2024 | |
dc.date.submitted | 2024 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/193418 | |
dc.description.abstract | The increasingly widespread use of Electronic Health Records (EHRs) offers significant opportunities to improve patient care insights and inspire extensive healthcare research. The International Classification of Diseases (ICD) codes, a crucial component of EHR data, have attracted significant research interest due to their potential to improve clinical decision-making. This dissertation focuses on developing novel statistical models for ICD code embedding and exploring their applications. Most existing healthcare research works borrow word embedding techniques from natural language processing (NLP) and apply them to ICD codes. However, significant differences in the structure, meaning, and usage exist between ICD codes and natural language words, making word embedding methods not entirely suitable for modeling ICD codes. The first part of this dissertation proposes a new latent space zero-inflated Poisson model to characterize the co-occurrence of ICD codes. This model associates each ICD code with a latent vector and assumes the co-occurrence of two ICD codes depends on the relative positions of the corresponding latent vectors. By utilizing a zero-inflated Poisson distribution, the proposed model effectively addresses the abundant zeros commonly observed in practice. Theoretically, we establish error bounds for the estimation of the latent vectors. Furthermore, we demonstrate the effectiveness of our model using the MIMIC-III EHR dataset, showing that the learned latent vectors are useful predictors for downstream tasks. This indicates that our model has the potential to improve patient outcome predictions and advance EHR-based research. Designed in the 1970s, the Ninth Revision of ICD (ICD-9) no longer meets the medical needs of healthcare providers and patients. In October 2015, hospitals in the United States transitioned from ICD-9 to ICD-10 codes. Consequently, the healthcare domain faces challenges in transferring and merging historical data and applications to this new system. Addressing this, the second part of the dissertation proposes a joint latent space zero-inflated Poisson model that learns embeddings for both versions of ICD codes simultaneously, as well as a transformation that maps ICD-9 codes to the newer system. To demonstrate the practical value of the model, we design an ICD code translation task using the Nationwide Readmissions Database (NRD) and show that our proposed model outperforms all existing approaches in this task. Although the latent space zero-inflated Poisson model has proven effective in modeling ICD codes, focusing solely on pairwise co-occurrences may overlook higher-order information among ICD codes. In the third project, we treat ICD codes in EHR as a hypergraph, where the set of ICD codes in a medical record forms a hyperedge. Specifically, we consider a latent space model based on the determinantal point process for hypergraphs. Direct estimation of parameters using the likelihood function is however numerically unstable. To overcome this, we develop an algorithm based on a mixture of the ordinary likelihood and the pseudo-likelihood. This proposed algorithm is shown to be more stable compared to using the ordinary likelihood alone, particularly when the number of hyperedges is not large. We apply this approach to a readmission prediction task on the MIMIC-III EHR dataset and demonstrate its practical value. We also establish theoretical guarantees for the consistency of the mixed likelihood estimation. | |
dc.language.iso | en_US | |
dc.subject | Network Analysis | |
dc.subject | Hypergraph | |
dc.subject | ICD Code Embedding | |
dc.subject | Latent Space Model | |
dc.title | Statistical Latent Space Models for International Classification of Diseases (ICD) Codes | |
dc.type | Thesis | |
dc.description.thesisdegreename | PhD | |
dc.description.thesisdegreediscipline | Statistics | |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | |
dc.contributor.committeemember | Zhu, Ji | |
dc.contributor.committeemember | Jin, Judy | |
dc.contributor.committeemember | Levina, Liza | |
dc.contributor.committeemember | Shedden, Kerby A | |
dc.subject.hlbsecondlevel | Statistics and Numeric Data | |
dc.subject.hlbtoplevel | Science | |
dc.contributor.affiliationumcampus | Ann Arbor | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/193418/1/chengmc_1.pdf | |
dc.identifier.doi | https://dx.doi.org/10.7302/23063 | |
dc.identifier.orcid | 0009-0006-0558-1686 | |
dc.working.doi | 10.7302/23063 | en |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.