Statistics in the Modern Era: High Dimensions, Decision-Making, and Privacy
Roy, Saptarshi
2024
Abstract
High dimensional data analysis has become increasingly frequent and important in diverse fields of sciences, engineering, genomics, and machine learning (ML), and it has quite evidently spawned new complexities concerning modern problems ranging from variable selection, distributed learning, and computational efficiency, data privacy, and online decision-making. To address these modern emerging statistical problems in data science, my research focuses on the intersection of high-dimensional statistics and the above modern ML problems. The first chapter studies the problem of exact support recovery for high-dimensional sparse linear regression when the signals are weak, rare, and possibly heterogeneous. Specifically, we broaden the theoretical understanding of model selection accuracy of best subset selection (BSS) and marginal screening (MS) under independent Gaussian design. Furthermore, to overcome the computational bottleneck of BSS, we also propose an efficient two-stage algorithm called ``Estimate Then Screen'' (ETS) which shares exactly the same asymptotic optimality in terms of exact recovery as BSS. The second chapter follows up on the work of the first chapter by considering correlated features. In this chapter, we also study the model selection accuracy of BSS. We show that apart from the separation margin between the true and noise variables, the complexity of residualized signals and projections of spurious features also play intricate roles in characterizing the model consistency of BSS. In this chapter, we demonstrate the interplay between the margin separation and the two complexities through a simple margin condition which further helps to understand the theoretical properties of BSS. In the third chapter, we propose a differentially private algorithm for model selection under the high-dimensional sparse regression setup. We adopt the well-known exponential mechanism for designing a sampling scheme that can identify the true set of features under desirable conditions on the signal. In fact, under low privacy regime, we show that the minimum signal strength requirement exactly matches the requirement under the non-private setting. Moreover, to achieve computational expediency over the intractable exponential mechanism, we design a Metropolis-Hastings chain that quickly mixes to the target distribution to generate private estimates of the model. In the final chapter, we propose a novel Thompson sampling algorithm for the high-dimensional sparse linear contextual bandit. We specifically use a sparsity-inducing prior for Thompson sampling that exploits the low dimensional structure of the problem, and we theoretically show that our algorithm enjoys desirable regret bound. Furthermore, for computational speed-up, we adopt a variational inference framework and demonstrate the superior performance of our algorithm over its competitors both for simulated and real data.Deep Blue DOI
Subjects
Best Subset Selection Differential Privacy High-dimensional Statistics Online Bandits
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.