Bridging Data and Hardware Gap for Efficient Machine Learning Model Scaling
Zheng, Haizhong
2024
Abstract
Recent research in deep learning models has achieved astonishing progress in various domains, like image classification, text generation, and image generation. With the exponential growth of model sizes and data volumes, AI models show a stronger ability than before and have been applied in many real-world scenarios like chatbots and autonomous driving, which marks a watershed moment in artificial intelligence. Despite our hope for further performance gains through scaling up model and dataset sizes, today’s large models are reaching scaling limits in two aspects: First, large model training is data-hungry. The cost of creating large volumes of high-quality human feedback data is prohibitively expensive, creating bottlenecks in scaling up large model training. Second, naive reliance on more powerful hardware is inadequate, as hardware improves at a much slower rate than what the growth in model size demands. Therefore, it becomes more important than ever to design more efficient algorithms and models for the purpose of data and inference efficiency. Fortunately, recent research shows that both datasets and models exhibit great redundancy, providing opportunities to optimize performance and reduce computational costs. This dissertation aims to bridge the gap between the rapid scaling of models and the slower scaling of high-quality data and hardware. For data efficiency, this dissertation designs several novel coreset selection and data condensation algorithms to select or synthesize a small but representative dataset for training to reduce redundancy in datasets. First, we show that coresets with better coverage of the underlying data distribution lead to better training performance. Based on this observation, we propose a coverage-centric coreset selection algorithm that significantly improves coreset selection performance. Second, we propose another coreset selection algorithm, ELFS, but focus on the label-free coreset selection scenario. Human labeling is one of the major bottlenecks for data collection. Given a limited human labeling budget, ELFS can identify a more representative subset for labeling. Besides coreset selection, we also explore other techniques to improve data efficiency. In our third data efficiency project, we explore how to improve data condensation performance. Instead of selecting a data subset, data condensation aims to synthesize a small synthetic dataset that captures the knowledge of a natural dataset. We propose a novel data container structure, HMN. HMN utilizes the hierarchical structure of the classification system, which stores information more efficiently. In our last data efficiency project, we propose a novel adversarial training algorithm that significantly reduces the overhead of the data augmentation phase of adversarial training. For inference efficiency, this dissertation majorly focuses on building hardware-friendly contextual sparse models that only activate necessary neurons for inference to reduce memory-bandwidth overhead. To achieve this goal, we propose LTE, an efficiency-aware training algorithm to train hardware-friendly contextual sparse models, which accelerates inference efficiency without sacrificing model performance. Finally, we conclude and discuss possible future research directions and opportunities to further enhance data and inference efficiency.Deep Blue DOI
Subjects
Efficient Machine Learning Scaling Law Data Efficiency
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.