An Accurate and Scalable Approach to Classifying High-Dimensional Data With Dense Latent Structure
Payne, Nora Yujia
2022
Abstract
The primary aim of a classification analysis is to learn the relationship between a set of features and a discrete variable of primary interest so that good predictive accuracy is achieved on new out-of-sample observations. On many large-scale datasets, this task is complicated by the high-dimensionality of the data, as well as the presence of unobserved variables besides the variable of primary interest. Frequently, these unobserved variables induce variation across a large proportion of the features, resulting in variation that is both dense and latent. This variation presents both challenges and opportunities. Some of these unobserved variables may be partially correlated with the class label, and thus may be useful for learning the predictive relationship between the features and the class label. However, others may be uncorrelated with the class label and merely contribute additional noise. If the effects stemming from the variable of primary interest are sparse or weak, as they are thought to be in many applications, then the dense effects may obscure them when they should ideally be captured by the classifier. To address the challenges posed by latent variables while leveraging any benefits they may confer, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. Numerical experiments comparing the CRC with popular methods used to classify high-dimensional genomic data demonstrate that our method of separating and reintegrating the latent variables can offer substantial gains in accuracy. Applying high-dimensional classifiers like the CRC in practice requires scalable implementations that can accommodate both the size and high-dimensionality of large-scale datasets. Not all classifier implementations are equipped to handle data of this nature, either because they take a substantial amount of time to run when the number of features is large or have large memory requirements that cannot be easily accommodated by the typical user (e.g., requiring the data to be stored in memory). Furthermore, many classification analysis workflows involve the use of resampling techniques such as cross-validation to select tuning parameters or to estimate out-of-sample accuracy rates. These techniques only exacerbate existing runtime and memory challenges. We develop strategies to address these challenges in the context of the CRC, which is intended for large-scale data of this nature but involves extensive resampling steps. Specifically, we propose a new, computationally-efficient way to estimate and residualize out the latent variation, in addition to a computationally-efficient feature selection procedure that is free of any tuning parameters and does not require the data to be stored in memory. These contributions address two of the most time-consuming and memory-intensive parts of the CRC algorithm. Furthermore, they enable the CRC to be computed via serial reads of the data, facilitating its application to large-scale datasets, particularly those that cannot be stored in memory. These contributions not only improve the scalability of the CRC, but they also improve its statistical properties, which we explore. Numerical experiments on simulated and genomic data illustrate these computational and statistical gains.Deep Blue DOI
Subjects
latent variables high-dimensional classification
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.