Scalable Classification Methods with Applications to Healthcare Claims and Automotive Dealership Data
Wu, Wenyi
2019
Abstract
With technology advances in recent years, sensing and media storage capabilities have enabled the generation of enormous amounts of information, often in the form of large data sets in different scientific fields such as biology, marketing and medicine. As this vast amount of data has opened a wealth of opportunities for data analysis, computationally scalable methods become increasingly important for statistical modeling. This thesis focuses on developing scalable classification methods and their applications to automotive dealerships and healthcare problems. The first project studies parameter estimation of customers' and dealerships' consumption preferences for the automotive market, which determine the manufacturers' profits. Most existing methods assume that the dealerships are rational and hence aim to maximize profits, which conflicts with observations. We propose a structural Bayesian model for customers’ and dealerships’ preferences which aim to maximize a flexible utility function. Further we develop an MCMC algorithm utilizing parallel computing to estimate model parameters. The model is calibrated to data from a manufacturer, and the estimates are used in a simulation model to design optimal financial incentive offers to maximize profits. The second project focuses on the two-class classification problem based on the area under the receiver operating curve (AUC), which is often considered as a comprehensive measure for the performance of a classifier. Maximizing the empirical AUC directly, however, is computationally challenging as naive computation of the AUC requires quadratic time complexity. Further, the optimization involves indicator functions and it is NP-hard. In this project, we propose a non-convex differentiable surrogate function for the AUC, and further develop a scalable algorithm to optimize this surrogate loss function. The proposed algorithm takes advantage of the selection tree data structure and also uses a truncated Newton strategy so that the computational complexity of the optimization scales at the quasilinear time. In the setting of linear classification, we also show that the estimated coefficients enjoy theoretical asymptotic consistency. Finally, we evaluate the performance of the proposed method using both simulation studies and two data sets and show that the proposed method outperforms the support vector machine (SVM) in terms of the AUC. The last project aims to predict midterm mortality of patients using the Ninth Revision, International Classification of Diseases (ICD-9) codes, which is relevant for healthcare and clinical research. The ICD-9 contains a list of standard alphanumeric codes recording useful clinical information. However, the number of ICD-9 codes in a specific study is often large, and the dependence structure among ICD-9 codes is complicated, which pose statistical challenges. To address these challenges, we develop a supervised embedding method that combines an unsupervised criterion for learning latent representations of ICD-9 codes and a Deep Set neural network for classification. The proposed method has the advantage of modeling the inter-relationship within ICD-9 codes and the nonlinear relationship between codes and the outcome variable simultaneously, and it can also be naturally extended to the semi-supervised learning setting. The model is trained using the stochastic gradient descent (SGD) approach, which makes the proposed method suitable for analyzing large data sets. We have applied the proposed method to 1-year mortality prediction using the Medical Information Mart for Incentive Care III (MIMIC-III) database and achieved superior performance in comparison with several benchmark models.Subjects
Discrete Choice Model Classification Deep Learning Mortality Prediction
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.