Elastic Deep Learning in high performance multitenant cluster
Learn how the Elastic Deep Learning capabilities of IBM Watson Machine Learning Accelerator are designed for large-scale distributed deep learning workloads
The Elastic Deep Learning capabilities of IBM Watson® Machine Learning Accelerator are designed for large-scale distributed deep learning workloads. It transforms static monolithic training into a dynamic process that is resilient to failures and automatically scales GPU allocation while training.
Data scientists, deep learning developers, and administrators can use Elastic Deep Learning capabilities to simplify production deployment, improve run time efficiency, and deliver on service level agreements (SLAs).
Elastic Deep Learning
Training your deep learning models on frameworks such as TensorFlow and PyTorch takes a long time. Accelerator technologies such as GPUs help reduce training time, but as models and data sets get bigger, they can still require days or weeks to train. Training is also an iterative and resource-intensive process – you run many rounds of training, tweak parameters, run more training, tweak parameters, and so on, all towards the goal of high model accuracy.
Scaling out in a cluster and running distributed training is a typical approach to improve training time, but there are still many challenges. The current approaches to distributed training tend to be complex, inflexible, and monolithic. Requiring developers to figure out cluster and network topology, data ingest, and parallel execution plans, and hand-coding it into the model increases complexity and development time. Intermingling the model logic with run time control creates portability challenges when moving from development to production and out to the cloud.
Here’s where Elastic Deep Learning from IBM Watson Machine Learning Accelerator can help. It’s a technology designed to resolve many of these distributed training challenges, including:
Dynamic distributed run time management: Replace the framework-specific data ingest and run time configuration with a scheduler that determines data partitions and number of batches, and dynamically allocates resources for every iteration. A high-efficiency scheduling engine ensures the necessary throughput to fully utilize all allocated resources.
Automatic and dynamic model scaling (up and down): Transparent to model code and without disrupting the deep learning jobs in progress, this function is a dynamic resource allocation that adds resources when they become available, providing training speed-up, and removes them when needed by other jobs. Hyperparameter-aware scheduling ensures that the adjustment of resources does not affect the training convergence and accuracy.
Resiliency to failures and environment changes: When a host fails or cloud resources disappear, the running tasks are automatically rerun on another host with the latest parameters and weights.
Automatic quality of service (QoS): Dynamic rebalancing of resource allocations after every iteration based on policy and job priority. Existing workloads continue to run without interruption.
Decouple model logic from run time control: Hides the complexity of resource allocation, data ingest, configuration, and run time control from model developers, reducing the need to have knowledge of the underlying cluster or cloud resources, availability, and topology.
Support native models (graph, neural net definition) in TensorFlow and PyTorch: The data scientist or model developer can focus on the model itself and not deployment and production run time configuration.
Optimized training with accuracy: By using available resources and not just a fixed number, the training jobs finish faster. By considering model constraints and infrastructure limitations (for example, GPU memory or network bandwidth), training is optimized by scaling for both speed and convergence.
Better return on investment (ROI): By fully utilizing your resources (for example, CPUs, GPUs, and network bandwidth) and making sure none are idle you get the most out of your investment. Enabling multitenancy allows multiple developers and line of businesses to share common resources, eliminating the need for dedicated clusters.
When comparing an Elastic Deep Learning job to a typical training job, the typical job requires a static allocation of resources without interruption (for example, 16 GPUs on 4 machines running for 2 weeks). This means that typical jobs cannot take advantage of free resources and cannot reduce the number of allocated resources (without killing entire jobs) to accommodate higher priority workloads. A deep learning framework uses MPI-like communications, so a single hardware failure or loss of rank causes the entire job to fail. This is especially painful if the job has been running for days or weeks. It is also unrealistic to assume zero changes or failures in cloud and commodity-based clusters.
Instead of using a static resource allocation strategy, Elastic Deep Learning uses resilient, dynamic, fine-grained controls for the training process. Watson Machine Learning Accelerator is responsible for distributing the training tasks (iterations), coordinating synchronization among tasks, and dynamically allocating the resources.
When a data scientist submits a training job, they can specify the maximum number of GPUs that the job can use. While executing, Watson Machine Learning Accelerator allocates as many GPUs as available to the job, up to the maximum number specified, or removes GPUs as other policies dictate. Regardless, the job continues to execute until complete. With Elastic Deep Learning, you can use QoS policies to define how the system behaves during GPU allocation and job execution. Watson Machine Learning Accelerator 1.2 includes the following policies that are utilized by Elastic Deep Learning:
- Preemption: Enables the running task to complete the iteration and grants synchronization before assigning the resource to another job
- Fairshare: Sets the policy scheduler to share all available resources among all submitted jobs as equally as possible
- Priority: Allows the higher priority job to use more resources than the other jobs
In this example, illustrated using the following images, Job 1 is submitted at 8:09 and runs on 8 GPUs. At 8:10, Job 2 is submitted and the Fairshare policy ensures that the GPUs are shared between all jobs. Therefore, 4 Job 1 tasks are preempted and allowed to complete, then the GPUs are allocated to Job 2. Just past 8:13, the priority of Job 2 is increased, a Job 1 task gets preempted and completes, then the GPU is allocated to Job 2. Three GPUs are allocated to Job 1 and 5 are allocated to Job 2. Just past 8:17, Job 1 runs to completion, freeing up 3 GPUs, which then get allocated to Job 2, which enjoys faster training time to completion because it runs on 8 GPUs.
In the following images, you see the complete tasks lists and the slope differences that indicate the change of iteration completion rate. The completed tasks continue to rise, demonstrating dynamic scaling and speed-up. Here, each task represents 1 training iteration. Monitoring this convergence trend in the following chart, you see there was no interruption on the job, the loss curve kept going down during the scaling-up and down. The scheduler manages the training process in consideration of hyperparameters, which enables continued convergence trend during dynamic scaling.
Learn more about IBM Watson Machine Learning Accelerator.