[vc_empty_space][vc_empty_space]
Dynamic Resource Allocation for Distributed TensorFlow Training in Kubernetes Cluster
Surya R.Y.a, Imam Kistijantoro A.a
a Institut Teknologi Bandung, Indonesia
[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© 2019 IEEE.Distributed deep learning training nowadays uses static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes, which have a constant amount throughout the training. Training job cannot utilize more resources in the middle of its run, even though additional free resources in the cluster are available. This causes the training job being unable to run faster than it shoulsd be. Besides that, if those free resources are unused for a long time, the cluster’s resources become underutilized. In this research, dynamic resource allocation is designed and implemented for TensorFlow-based training job which runs on top of Kubernetes cluster. The implementation is done by creating a component called Config Manager (CM). The role of this component is to know the cluster’s resources information at a certain time, as well as adding more ps and worker nodes to a training job once free resources exist. The experiment shows that the training with dynamic resource allocation with Config Manager has better performance than one with static resource allocation on the following metrics: resource utilization, epoch time, and total training time. Total training time can be reduced to more than 50%, while the cluster’s resources utilization can be kept up high.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Dynamic resource allocations,Resource utilizations,Resources information,Resources utilizations,Server architecture,Training time,Worker nodes[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]distributed training,dynamic resource allocation,Kubernetes,TensorFlow[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1109/ICoDSE48700.2019.9092758[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]