IBM PowerAI Enterprise includes IBM PowerAI Distributed Deep Learning (DDL). Using DDL tools, you can run TensorFlow and Caffe models on multiple nodes and multiple GPUs. To run DDL tasks, users usually specify one or more hosts and the number of GPUs to be used on each host.
In large clusters with multiple users, IBM Spectrum Conductor Deep Learning Impact can help to manage resources, automatically allocate hosts and GPUs, and run DDL tasks. This blog details how to set up and run IBM PowerAI 5.2 Distributed Deep Learning tasks with IBM Spectrum Conductor Deep Learning Impact 1.1.0 command line interface (CLI).

Prerequisites:

Install IBM PowerAI Enterprise 1.1. Make sure that you can run IBM PowerAI Distributed Deep Learning samples on multiple hosts using the same execution user first before trying to run using IBM Spectrum Conductor Deep Learning Impact just to make sure that you have resolved all networking, OS permission issues, and so on.

Running IBM PowerAI Distributed Deep Learning requires a specific setup in IBM Spectrum Conductor with Spark; therefore, this blog is divided into 2 parts: configuring IBM Spectrum Conductor for IBM PowerAI Distributed Deep Learning and running IBM PowerAI Distributed Deep Learning with IBM Spectrum Conductor Deep Learning Impact CLI.

Configuring IBM Spectrum Conductor with Spark for IBM PowerAI Distributed Deep Learning

  1. Log in to the cluster management console as an Administrator.
  2. From the cluster management console, navigate to Resources > Resource Planning (Slot) > Resource Groups. Under Global Actions, click Create a Resource Group. Create a resource group called gpus as follows:
  3. Navigate to Resources/Consumers and select a consumer, such as SampleApplications. Select it and click Consumer Properties. Select the resource group that was just created and click Apply.
  4. Navigate to Resources > Resource Planning (Slot) > Resource Plan. Select Resource Group: gpus and Exclusive as shown below, to indicate that when IBM Spectrum Conductor with Spark allocates resources from this resource group, using all free slots from a host. For example, assuming there are 4 GPUs on a host, a request for 1, 2, 3, or 4 GPUs would take the whole host.
  5. Navigate to Workload > Spark > Spark Instance Groups and click New to create an instance group. Click dli-sig-template to prepopulate the instance group with deep learning values, as follows:

    In addition to the prepopulated values, make sure to set the following:

    • Specify an instance group name. For the purposes of this blog, the name of the instance group is set to myddl.
    • Specify an execution user in the Execution user for instance group field. Make sure to enter an operating system execution user. Hint: A good practice is to run IBM PowerAI Distributed Deep Learning samples on multiple hosts using the same execution user before trying to run using IBM Spectrum Conductor Deep Learning Impact to ensure that you have resolved all networking, OS permission issues, and so on.
    • In the field Spark executors (GPU slots), select gpus – the resource group created in step 2.
    • Click Configuration to open the configuration window and edit the Spark parameters:
    • Under All Parameters, select Session Scheduler and locate the SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK field and change the value from 1,2,4 to 1.
    • After the instance group is configured, click Create and Deploy Instance Group. Verify to make sure there is only one master (as opposed to 3 masters):
  6. Now, increase the number of masters. From the cluster management console, navigate to System & Services > EGO Services > Service Profiles, select Other Services, and click the name of the service such as myddl-sparkms-batch – assuming that you have named the instance group myddl.
  7. Search for the field sc:MaxInstancesPerHost and click Value field to change the value. Change the value to any number less than 100. This value is the total number of deep learning tasks that can be run and wait for resources. For example, if you choose 10, then there may be 6 deep learning tasks running and 4 tasks in the queue waiting to be run. If you start the 11th task, then an error is returned, indicating that you need to submit the task at a later time. Note: After making changes to the service profile, modifying the Spark instance group is not supported.
  8. Because this service runs within the ComputeHosts resource group, make sure that there is a sufficient number of free slots in this resource group by navigating to Resources > Resource Planning (Slot) > Resource Groups and clicking ComputeHosts. Check the number of free slots and adjust if necessary.
  9. Navigate back to the instance group details page and select Services:
  10. Click Scale.
  11. Update the limits to the value you set for sc:MaxInstancesPerHost in step 7, for example, 10:
  12. Click Scale then verify that the number of instances has scaled up:
  13. Verify that the number of masters has scaled by clicking Overview and ensuring that the number of masters has changed (in this case, 10 masters are now available):

Now, the instance group is set up to handle IBM PowerAI Distributed Deep Learning workload and it is ready to run DDL tasks with IBM Spectrum Conductor Deep Learning Impact CLI.

Running IBM PowerAI Distributed Deep Learning with IBM Spectrum Conductor Deep Learning Impact CLI

To run DDL tasks with IBM Spectrum Conductor Deep Learning Impact CLI, the following is assumed for the remainder of the blog:

  • IBM Spectrum Conductor is installed under /opt/ibm/spectrumcomputing
  • IBM Spectrum Conductor Deep Learning Impact is installed under /opt/ibm/spectrumcomputing/dli
  • The following variables are set:
    • $ export DLPD_HOME=/opt/ibm/spectrumcomputing/dli/dlpd
    • $ export dlicmd=$DLPD_HOME/bin/dlicmd.py
    • $ export masterHost=
    • $ export username=
    • $ export password=
    • $ export ig=myddl
  • Each host used in this blog has 4 GPUs per host.

Complete the following steps to run DDL tasks:

  1. Show IBM Spectrum Conductor Deep Learning Impact CLI usage by running:
    $ python $dlicmd
  2. Log in to the CLI:
    $ python $dlicmd --logon --master-host $masterHost --username $username --password $password
  3. List the supported DL frameworks:
    $ python $dlicmd --dl-frameworks --master-host $masterHost

TensorFlow example with 1 learner (or GPU)

To run a TensorFlow model called mnist-env.py with one learner (or GPU), as shown below, set number of worker (numWorker) to 1 so that IBM Spectrum Conductor with Spark allocates 1 GPU and number of accelerator (–accelerators) to 1 to run 1 learner per host. This way, one host will be allocated and 1 learner will run on that host:
$ n=1
$ acc=1
$ python $dlicmd --exec-start ddlTensorFlow --exclusive --master-host $masterHost --ig $ig --numWorker $n --mpiarg --allow-run-as-root --accelerators $acc python --model-main mnist-env.py

Verify that the GPU was allocated:

  • From the cluster management console, navigate to Workload > Spark > My Applications & Notebooks, you can see that one GPU is allocated:
  • On the execution host, you can check that one learner runs using 1 GPU:

TensorFlow example with 4 learners (or GPUs)

To run a TensorFlow model called mnist-env.py with 4 learners (or GPUs), set number of worker (numWorker) to 4 so that IBM Spectrum Conductor with Spark allocates 4 GPUs. Since whole host is used and each host has 4 GPUs, set number of accelerator to 4 to run 4 learners on a host. Note that as there are only maximum of 4 GPUs per host in this example, the maximum value for accelerator is 4. This way, one host will be allocated, and 4 learners will run on that host.

$ n=4
$ acc=4
$ python $dlicmd --exec-start ddlTensorFlow --exclusive --master-host $masterHost --ig $ig --numWorker $n --mpiarg --allow-run-as-root --accelerators $acc python --model-main mnist-env.py

Verify that the GPU was allocated:

  • From the cluster management console, navigate to Workload/Spark/My Applications & Notebooks, you can see that 4 GPUs are allocated:
  • On the execution host, you can check that 4 GPUs are used:

TensorFlow example with 8 learners (or GPUs)

To run a TensorFlow model called mnist-env.py with 8 learners (or GPUs), set number of worker (numWorker) to 8 so that IBM Spectrum Conductor allocates 8 GPUs. Since whole host is used and each host has 4 GPUs, we set number of accelerator to 4. This way, 2 hosts will be allocated, each running 4 learners:
$ n=8
$ acc=4
$ python $dlicmd --exec-start ddlTensorFlow --exclusive --master-host $masterHost --ig $ig --numWorker $n --mpiarg --allow-run-as-root --accelerators $acc python --model-main mnist-env.py

Verify that the GPU was allocated:

  • From the cluster management console, navigate to Workload > Spark > My Applications & Notebooks, you can see that 8 GPUs are allocated on 2 hosts:

  • On the execution hosts, you can check that 4 GPUs are used:
  • Assume that you have scaled up to 10 masters as shown above, if you start multiple tasks rapidly one after another, then you will see the following message: Error 400: No batch master available to start in exclusive mode. You can try later.

Join The Discussion

Your email address will not be published. Required fields are marked *