IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Learn how the Elastic Deep Learning capabilities of IBM Watson Machine Learning Accelerator are designed for large-scale distributed deep learning workloads.


The Elastic Deep Learning capabilities of IBM Watson® Machine Learning Accelerator are designed for large-scale distributed deep learning workloads. It transforms static monolithic training into a dynamic process that is resilient to failures and automatically scales GPU allocation while training.

Data scientists, deep learning developers, and administrators can use Elastic Deep Learning capabilities to simplify production deployment, improve run time efficiency, and deliver on service level agreements (SLAs).

Elastic Deep Learning

Training your deep learning models on frameworks such as TensorFlow and PyTorch takes a long time. Accelerator technologies such as GPUs help reduce training time, but as models and data sets get bigger, they can still require days or weeks to train. Training is also an iterative and resource-intensive process – you run many rounds of training, tweak parameters, run more training, tweak parameters, and so on, all towards the goal of high model accuracy.

Scaling out in a cluster and running distributed training is a typical approach to improve training time, but there are still many challenges. The current approaches to distributed training tend to be complex, inflexible, and monolithic. Requiring developers to figure out cluster and network topology, data ingest, and parallel execution plans, and hand-coding it into the model increases complexity and development time. Intermingling the model logic with run time control creates portability challenges when moving from development to production and out to the cloud.

Here’s where Elastic Deep Learning from IBM Watson Machine Learning Accelerator can help. It’s a technology designed to resolve many of these distributed training challenges, including:

  • Dynamic distributed run time management: Replace the framework-specific data ingest and run time configuration with a scheduler that determines data partitions and number of batches, and dynamically allocates resources for every iteration. A high-efficiency scheduling engine ensures the necessary throughput to fully utilize all allocated resources.

  • Automatic and dynamic model scaling (up and down): Transparent to model code and without disrupting the deep learning jobs in progress, this function is a dynamic resource allocation that adds resources when they become available, providing training speed-up, and removes them when needed by other jobs. Hyperparameter-aware scheduling ensures that the adjustment of resources does not affect the training convergence and accuracy.

  • Resiliency to failures and environment changes: When a host fails or cloud resources disappear, the running tasks are automatically rerun on another host with the latest parameters and weights.

  • Automatic quality of service (QoS): Dynamic rebalancing of resource allocations after every iteration based on policy and job priority. Existing workloads continue to run without interruption.

  • Decouple model logic from run time control: Hides the complexity of resource allocation, data ingest, configuration, and run time control from model developers, reducing the need to have knowledge of the underlying cluster or cloud resources, availability, and topology.

  • Support native models (graph, neural net definition) in TensorFlow and PyTorch: The data scientist or model developer can focus on the model itself and not deployment and production run time configuration.

  • Optimized training with accuracy: By using available resources and not just a fixed number, the training jobs finish faster. By considering model constraints and infrastructure limitations (for example, GPU memory or network bandwidth), training is optimized by scaling for both speed and convergence.

  • Better return on investment (ROI): By fully utilizing your resources (for example, CPUs, GPUs, and network bandwidth) and making sure none are idle you get the most out of your investment. Enabling multitenancy allows multiple developers and line of businesses to share common resources, eliminating the need for dedicated clusters.

When comparing an Elastic Deep Learning job to a typical training job, the typical job requires a static allocation of resources without interruption (for example, 16 GPUs on 4 machines running for 2 weeks). This means that typical jobs cannot take advantage of free resources and cannot reduce the number of allocated resources (without killing entire jobs) to accommodate higher priority workloads. A deep learning framework uses MPI-like communications, so a single hardware failure or loss of rank causes the entire job to fail. This is especially painful if the job has been running for days or weeks. It is also unrealistic to assume zero changes or failures in cloud and commodity-based clusters.

Instead of using a static resource allocation strategy, Elastic Deep Learning uses resilient, dynamic, fine-grained controls for the training process. Watson Machine Learning Accelerator is responsible for distributing the training tasks (iterations), coordinating synchronization among tasks, and dynamically allocating the resources.

Learn more about IBM Watson Machine Learning Accelerator.