Get dynamic, elastic, and fine-grained resource allocations and controls for accelerating multiple model trainings simultaneously

IBM Watson™ Machine Learning Accelerator is a software solution for quickly and easily building an end-to-end deep learning environment for your organization. Formerly known as IBM PowerAI Enterprise, it bundles IBM Spectrum Conductor, IBM Spectrum Conductor Deep Learning Impact, and support from IBM for the whole stack, including the open source deep learning frameworks. Watson Machine Learning Accelerator provides an end-to-end deep learning platform for data scientists. This includes the complete lifecycle management from installation and configuration; data ingest and preparation; building, optimizing, and distributing the training model to moving the model into production. Watson Machine Learning Accelerator truly shines when you expand your deep learning environment to include multiple compute nodes. There’s even a free evaluation available; see the Prerequisites from our first tutorial: “Classify images with Watson Machine Learning Accelerator.”

In this tutorial, you will run multiple model trainings simultaneously with the Watson Machine Learning Accelerator Elastic Distributed Training feature (EDT). With EDT, we can distribute model training across multiple GPUs and compute nodes. The distribution of training jobs is elastic, which means GPUs are dynamically allocated: GPUs can be added or removed while executing and without having to kill the job. Since the scheduler dynamically allocates the GPUs, you don’t need to code the GPU topology into the model (for example, GPU01 out of node 1 and GPU01 out of node 2). You can take a model built on a stand-alone system, EDT does the distribution, and it’s transparent to the end user.

Learning objectives

After completing this tutorial, you will:

  • Be able to build a single node model to run distributed training with the EDT feature
  • Show how GPUs are dynamically allocated across two parallel trainings by adjusting the priority

Estimated time

The tutorial takes approximately three hours, which includes about 30 minutes of model training, plus installation and configuration, as well as driving the model through the GUI.


The tutorial requires access to a GPU-accelerated IBM Power Systems server model AC922 or S822LC. In addition to acquiring a server, there are multiple options to access Power Systems servers listed on the developer portal.


Step 1. Download, install, and configure the Watson Machine Learning Accelerator

  1. Download the IBM Watson Machine Learning Accelerator evaluation software from the IBM software repository. This is a 4.9 GB download and requires an IBM ID.
  2. Install and configure Watson Machine Learning Accelerator using the instructions listed in the IBM Knowledge Center or the POWER-Up User Guide.

Step 2. Configure operating system user

  1. At the operating system level, as root, on all nodes, create an operating system group and user for the operating system execution user:

    1. groupadd egoadmin
    2. useradd -g egoadmin -m egoadmin
  2. The GID and UID of the created user and group must be the same on all nodes.

Step 3. Configure the resource groups

  1. Log on as the cluster Admin user.
  2. Open the Resource Group configuration.

    Resource group configuration

  3. Select the ComputeHosts resource group.

    Resource group selection window

  4. Properly configure the number of slots to a value.

    Number of slots window

  5. Optionally, but recommended; change the resource selection method to static, and select only the servers that will provide computing power (processor power) to the cluster.

    Number of slots window

  6. Click Apply to commit the changes.

  7. Create a new resource group.

    Creating a new resource group window

  8. Name the resource group ComputeHostsGPU.

    Naming resource group

  9. The number of slots should use the advanced formula equal to the number of GPUs on the systems by using ngpus. Optionally, but recommended: change the resource selection method to static and select the nodes that are GPU-capable.

    Defining resource selections

  10. Under the Members Host column, click Preferences and select the attribute ngpus to be displayed.

    Selecting ngpus

  11. Click Apply and validate that the Members Host column now displays ngpus.

    Member hosts list

  12. Finish the creation of the resource group by clicking Create.

Step 4. Create Spark Instance group

  1. Finish the creation of the resource group by clicking Create.

    Creating new spark instance group

  2. Click New.

    Click new to create new instance group

  3. Select Template.

    Selecting the template

  4. Select dli-sig-template-2-2-0.

    Selecting dli template

  5. Enter the following three values:

    • Instance group: sig-elastic
    • Spark deployment directory: /home/egoadmin/sig-elastic
    • Execution user for instance group: egoadmin

      Selecting dli template

  6. Click Configuration, modify Spark parameters, and set fairshare as the value for the SPARK_EGO_APP_SCHEDULE_POLICY variable.


    • Set true as the value for the SPARK_EGO_ENABLE_PREEMPTION variable.


  7. Scroll down and select the ComputeHostsGPU resource group for the Spark executors (GPU slots) you created in previous screens. Do not change any other configuration there.


    1. Click the Create and Deploy Instance group.
    2. Click Continue to Instance Group.
    3. Watch as your instance group gets deployed.

Step 5. Download the instrumented PyTorch mnist model from Model Zoo

Download all of the files in dli-1.2.3-pytorch-samples.

Updates to models

Optional for an advanced user: Update your neural network model to run EDT

  1. Import and load the Elastic Distributed Training Model Library

     path=os.path.join(os.getenv("FABRIC_HOME"), "libs", "")
     print('> fabric loaded from %s'%path)
     from fabric_model import FabricModel
  2. Load user input (entered at Watson Machine Learning Accelerator UI) into model definition, including location of data set, optimizer and epoch:

     import pth_parameter_mgr
     global train_data_dir, test_data_dir
     train_data_dir = pth_parameter_mgr.getTrainData(False)
     test_data_dir = pth_parameter_mgr.getTestData(False)
     optimizer = pth_parameter_mgr.getOptimizer(model)
     epochs = pth_parameter_mgr.getEpoch()
     train_batch_size = pth_parameter_mgr.getTrainBatchSize()
  3. Instantiate the Elastic Distributed Training Model instance, along with the following parameters:

     edt_m = FabricModel(model, getDatasets, F.cross_entropy, optimizer)
     # model = model definition to be trained
     # getDatasets = function name defined in model definition to retrieve dataset at each learner node
     # F.cross_entropy = loss function
  4. Start Elastic Distribute Training, along with the following parameters retrieved in step 2:

     edt_m.train(epochs, train_batch_size , engines_number)
     # engines_number = maximum number of GPUs requested by this training

Step 6. Download the data sets

For this tutorial, we’re going to use a cifar-10-batches-py; you can download cifar10 Python version from

Create two directories, train_db and test_db, and copy cifar-10-batches-py folder into each of these directories:

[root@node1 cifar10]# ls -lrt
total 8
total 8
drwxr-xr-x. 3 root root 4096 Apr 26 08:33 train_db
drwxr-xr-x. 3 root root 4096 Apr 26 08:59 test_db
[root@node1 train_db]# ls -lrt
total 4
drwxr-xr-x. 2 root root 4096 Apr 26 08:57 cifar-10-batches-py
[root@node1 test_db]# ls -lrt
total 4
drwxr-xr-x. 2 root root 4096 Apr 26 08:59 cifar-10-batches-py
[root@node1 cifar-10-batches-py]# ls -lrt
total 181868
-rwxrwxrwx. 1 root root 31035623 Apr  1 10:20 data_batch_5
-rwxrwxrwx. 1 root root 31035526 Apr  1 10:20 test_batch
-rwxrwxrwx. 1 root root 31035704 Apr  1 10:20 data_batch_1
-rwxrwxrwx. 1 root root 31035696 Apr  1 10:20 data_batch_4
-rwxrwxrwx. 1 root root 31035999 Apr  1 10:20 data_batch_3
-rwxrwxrwx. 1 root root 31035320 Apr  1 10:20 data_batch_2

Step 7. Load data into Watson Machine Learning Accelerator

Associate the images with Watson Machine Learning Accelerator by creating a new data set.

Spark deep learning cluster

  1. In the Datasets tab, select New.

    Select new database

  2. Click Any. When presented with a dialog box, provide a unique name (for example, ‘cifar10EDTpytorch’) and select the folder that contains the binary files obtained in the previous step. When you’re ready, click Create.

    Create new database

Step 8. Build the model

  1. Select the Models tab and click New.

    Create new model

  2. Select Add Location.

    Add model location

  3. Select PyTorch as the Framework and provide a unique name (for example, PytorchResnet), and select the folder that contains the ResNet model obtained in previous steps. When you’re ready, click Add.

    Select framework

  4. Select PytorchResnet for your new model, and click Next.

    Select framework

  5. Ensure that the Training engine is set to Elastic Distributed Training and that the data set points to the one you just created, then click Add..

    Select framework Select framework

Step 9: Run training

  1. Submit Training No. 1:

    1. Back at the Models tab, select Train to view the models you can train, then select the model you created in the previous step.

    2. Submit the first training with eight max workers.

      Select max workers

    3. This cluster has two POWER9™ servers with a total of eight GPUs available in the resource grid. The scheduler takes all eight GPUs available from the resource grid and assigns to Training #1.

      Select max workers

  2. Submit training No. 2:

    1. Back at the Models tab, select Train to view the models you can train, then select the model you created in the previous step.

      Select max workers

    2. Submit the second training with eight max workers.

      Max workers to 8

    3. Fairshare Policy enables equal sharing of available resources among running training with equal priority. With the Fairshare Policy, the scheduler pre-empties and reclaims four GPUs from training No. 1. The scheduler will wait for the current iteration of training to complete, including uploading gradient sync across workers. Afterward, the scheduler reclaims four GPUs from Training No. 1 and allocates them to Training No. 2.

      Showing submission dates Showing GPU slots

    4. Each training gets its deserved portion of resources according to priority. Training priority could be adjusted to reallocate GPU resources on the fly, without any interruption of running training and losing the latest progress. We will adjust Training No. 2 priority from 5,000 (default value) to 10,000.

      Application priority

    5. With the priority update, the scheduler reassigns GPUs based on priority ratio. Training No. 2 GPU allocation becomes five (10000/15000 -> 2/3) out of a total of eight GPUs, while Training No. 1 GPU allocation becomes three (5000/15000 -> 1/3). The scheduler pre-empties and reclaims one GPU from Training No. 1 and assigns to Training No. 2.

      Priority updates

    6. Training No. 1 eventually completes at 6:04 and returns three GPUs back to the resource grid. The scheduler always looks for available GPUs from the resource grid and will assign these three GPUs to Training No. 2, dynamically scaling up to eight GPUs and speeding up overall iteration running time.

      Priority updates

Following is the resource usage chart for Training No. 1 and Training No. 2.

Resource usage chart


IBM Watson Machine Learning Accelerator Elastic Distributed Training technology provides dynamic and fine-grained resource allocations and controls for trainings run simultaneously. The scheduler is responsible for distributing the training tasks (iterations), coordinating synchronization among tasks, and dynamically allocating resources (GPUs) based on policies defined by the end user. The scheduler kicks off a training task with a minimal resource request (1 GPU) and constantly looks for available GPUs from the resource grid. Elastic Distributed Training is fault-tolerant and resilient. When a host fails or the environment changes (e.g., loss of a cloud host) the scheduler re-queues and reruns the training job tasks on another host using the latest parameters and weights. This key feature accelerates execution time, as well as driving data scientist productivity.