Efficient Resource Management for Deep Learning Clusters
dc.contributor.author | Gu, Juncheng | |
dc.date.accessioned | 2021-09-24T19:23:33Z | |
dc.date.available | 2021-09-24T19:23:33Z | |
dc.date.issued | 2021 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/169955 | |
dc.description.abstract | Deep Learning (DL) is gaining rapid popularity in various domains, such as computer vision, speech recognition, etc. With the increasing demands, large clusters have been built to develop DL models (i.e., data preparation and model training). DL jobs have some unique features ranging from their hardware requirements to execution patterns. However, the resource management techniques applied in existing DL clusters have not yet been adapted to those new features, which leads to resource inefficiency and hurts the performance of DL jobs. We observed three major challenges brought by DL jobs. First, data preparation jobs, which prepare training datasets from a large volume of raw data, are memory intensive. DL clusters often over-allocate memory resource to those jobs for protecting their performance, which causes memory underutilization in DL clusters. Second, the execution time of a DL training job is often unknown before job completion. Without such information, existing cluster schedulers are unable to minimize the average Job Completion Time (JCT) of those jobs. Third, model aggregations in Distributed Deep Learning (DDL) training are often assigned with a fixed group of CPUs. However, a large portion of those CPUs are wasted because the bursty model aggregations can not saturate them all the time. In this thesis, we propose a suite of techniques to eliminate the mismatches between DL jobs and resource management in DL clusters. First, we bring the idea of memory disaggregation to enhance the memory utilization of DL clusters. The unused memory in data preparation jobs is exposed as remote memory to other machines that are running out of local memory. Second, we design a two-dimensional attained-service-based scheduler to optimize the average JCT of DL training jobs. This scheduler takes the temporal and spatial characteristics of DL training jobs into consideration and can efficiently schedule them without knowing their execution time. Third, we define a shared model aggregation service to reduce the CPU cost of DDL training. Using this service, model aggregations from different DDL training jobs are carefully packed together and use the same group of CPUs in a time-sharing manner. With these techniques, we demonstrate that huge improvements in resource efficiency and job performance can be obtained when the cluster’s resource management matches with the features of DL jobs. | |
dc.language.iso | en_US | |
dc.subject | resource management | |
dc.subject | deep learning | |
dc.subject | memory disaggregation | |
dc.subject | job scheduling | |
dc.subject | RDMA | |
dc.title | Efficient Resource Management for Deep Learning Clusters | |
dc.type | Thesis | |
dc.description.thesisdegreename | PhD | en_US |
dc.description.thesisdegreediscipline | Computer Science & Engineering | |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | |
dc.contributor.committeemember | Chowdhury, N M Mosharaf Kabir | |
dc.contributor.committeemember | Shin, Kang Geun | |
dc.contributor.committeemember | Ying, Lei | |
dc.contributor.committeemember | Kasikci, Baris | |
dc.contributor.committeemember | Madhyastha, Harsha | |
dc.subject.hlbsecondlevel | Computer Science | |
dc.subject.hlbtoplevel | Engineering | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/169955/1/jcgu_1.pdf | |
dc.identifier.doi | https://dx.doi.org/10.7302/3000 | |
dc.identifier.orcid | 0000-0002-3315-5784 | |
dc.identifier.name-orcid | Gu, Juncheng; 0000-0002-3315-5784 | en_US |
dc.working.doi | 10.7302/3000 | en |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.