Efficient Resource Management for Deep Learning Clusters

Gu, Juncheng

Efficient Resource Management for Deep Learning Clusters

dc.contributor.author	Gu, Juncheng
dc.date.accessioned	2021-09-24T19:23:33Z
dc.date.available	2021-09-24T19:23:33Z
dc.date.issued	2021
dc.identifier.uri	https://hdl.handle.net/2027.42/169955
dc.description.abstract	Deep Learning (DL) is gaining rapid popularity in various domains, such as computer vision, speech recognition, etc. With the increasing demands, large clusters have been built to develop DL models (i.e., data preparation and model training). DL jobs have some unique features ranging from their hardware requirements to execution patterns. However, the resource management techniques applied in existing DL clusters have not yet been adapted to those new features, which leads to resource inefficiency and hurts the performance of DL jobs. We observed three major challenges brought by DL jobs. First, data preparation jobs, which prepare training datasets from a large volume of raw data, are memory intensive. DL clusters often over-allocate memory resource to those jobs for protecting their performance, which causes memory underutilization in DL clusters. Second, the execution time of a DL training job is often unknown before job completion. Without such information, existing cluster schedulers are unable to minimize the average Job Completion Time (JCT) of those jobs. Third, model aggregations in Distributed Deep Learning (DDL) training are often assigned with a fixed group of CPUs. However, a large portion of those CPUs are wasted because the bursty model aggregations can not saturate them all the time. In this thesis, we propose a suite of techniques to eliminate the mismatches between DL jobs and resource management in DL clusters. First, we bring the idea of memory disaggregation to enhance the memory utilization of DL clusters. The unused memory in data preparation jobs is exposed as remote memory to other machines that are running out of local memory. Second, we design a two-dimensional attained-service-based scheduler to optimize the average JCT of DL training jobs. This scheduler takes the temporal and spatial characteristics of DL training jobs into consideration and can efficiently schedule them without knowing their execution time. Third, we define a shared model aggregation service to reduce the CPU cost of DDL training. Using this service, model aggregations from different DDL training jobs are carefully packed together and use the same group of CPUs in a time-sharing manner. With these techniques, we demonstrate that huge improvements in resource efficiency and job performance can be obtained when the cluster’s resource management matches with the features of DL jobs.
dc.language.iso	en_US
dc.subject	resource management
dc.subject	deep learning
dc.subject	memory disaggregation
dc.subject	job scheduling
dc.subject	RDMA
dc.title	Efficient Resource Management for Deep Learning Clusters
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Computer Science & Engineering
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Chowdhury, N M Mosharaf Kabir
dc.contributor.committeemember	Shin, Kang Geun
dc.contributor.committeemember	Ying, Lei
dc.contributor.committeemember	Kasikci, Baris
dc.contributor.committeemember	Madhyastha, Harsha
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/169955/1/jcgu_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/3000
dc.identifier.orcid	0000-0002-3315-5784
dc.identifier.name-orcid	Gu, Juncheng; 0000-0002-3315-5784	en_US
dc.working.doi	10.7302/3000	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: jcgu_1.pdf
Size:: 3.977MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.