With the 1.5.2 release of PowerAI, IBM’s Distributed Deep Learning (DDL) framework has been updated to include many usablity improvements. One of these usability improvements comes in the form of ddlrun.

ddlrun is a utility which provides a streamlined interface to launch programs which use the DDL framework. Created in response to customer feedback, ddlrun automates many of the tedious and error-prone steps required to run a DDL program.

Example Using ddlrun With Tensorflow

We will start with walking through an example of using ddlrun to launch a topology-aware distributed Tensorflow job.

In the following example ddlrun is used to run the mnist Tensorflow example across a cluster of 4 hosts.


Before using ddlrun or PowerAI Tensorflow some setup has to be done. These steps assume that PowerAI has already been installed.

  1. First we need to source the ddl-tensorflow-activate script to setup both the DDL and PowerAI Tensorflow environments:
    $ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
  2. The ddl-tensorflow-install-samples script should then be used to copy the PowerAI Tensorflow scripts to a user directory.
    $ ddl-tensorflow-install-samples ~/PowerAI_Scripts/

Using ddlrun

The basic usage of ddlrun is similar to that of mpirun.

This command will generate the necessary rankfile and launch a distributed instance of the mnist Tensorflow example on each host, using the DDL framework for all communication.

$ ddlrun -H host1,host2,host3,host4 python ~/PowerAI_Scripts/mnist/mnist-env.py

The -H option is used to pass in the comma separated list of host names to run across. For more information on the ddlrun parameters please see /opt/DL/ddl/doc/README.md or run ddlrun -h.

ddlrun Overview

Noteworthy ddlrun features

The overall goal of ddlrun is to improve the user experience DDL users.
To this end the primary features of ddlrun are:

  1. Error Checking/Configuration Verification
  2. Automatic Topology Detection and Rankfile generation
  3. Automatic mpirun option handling

1. Error checking/Configuration Verification

MPI can be unclear when it comes to error messages, therefore ddlrun encapsulates a fair amount of troubleshooting in order to provide the user with relevant and useful error messages. The following are some examples of error messages from checks that ddlrun performs. The -v flag can be used for more verbose output.

The following is the result of ddlrun not being able to find the mpirun executable. This is commonly caused by forgetting to source the /opt/DL/ddl/bin/ddl-activate script.

[ERROR DDL-2-15] Cannot find 'mpirun' executable. Do you need to run an activate script?

The following is the result of passing an unreachable host to ddlrun

$ ddlrun -v -H unreachable1 python ...
[ERROR DDL-2-18] An error was encountered trying to ssh to unreachable1.
ssh: Could not resolve hostname unreachable1: Name or service not known
Continuing to test host connectivity.
ERROR: An ssh connection failed.

The following is the result of using a host that does not have ssh keys correctly configured. Here we see that ddlrun prints out exactly which machine has the key error.

$ ddlrun -v -H nokeyhost,goodkeyhost python ...
[ERROR DDL-2-18] An error was encountered trying to ssh to nokeyhost.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
Continuing to test host connectivity.

The following is the result of attempting to run across nodes that have different hardware configurations. Here we see that ddlrun prints out the configurations of each machine to facilitate troubleshooting.

$ ddlrun -v -H 4gpuhost,2gpuhost python ...
[ERROR DDL-2-13] All machine configurations are not the same. Please verify that the correct hosts are being used.
Host: 4gpuhost    Accelerators: 4    Sockets: 2
Host: 2gpuhost    Accelerators: 2    Sockets: 2

It’s possible that the error checks performed by ddlrun are more extensive than needed. To skip the checks performed by ddlrun use the --skipchecks flag.

2. Automatic Topology Detection and Rankfile generation

Another common source of frustration when getting started with DDL is the generation of the rankfile.
With this version of ddlrun, the topology is inferred from the host list and a rankfile is automatically generated by discovering the configuration of the first host in the host list and verifying that all other hosts have the same configuration.

ddlrun -H host1,host2,host3,host4 python ...

This command will automatically generate and use the following rankfile:

#host = host1,host2,host3,host4
#aisles = 1
#racks = 1
#nodes = 4
#accelerators = 4
#sockets = 2
#cores = 16

rank 0=host1	   slot=0:0-7
rank 4=host1	   slot=0:8-15
rank 8=host1	   slot=1:0-7
rank 12=host1	   slot=1:8-15

rank 1=host2	   slot=0:0-7
rank 5=host2	   slot=0:8-15
rank 9=host2	   slot=1:0-7
rank 13=host2	   slot=1:8-15

rank 2=host3	   slot=0:0-7
rank 6=host3	   slot=0:8-15
rank 10=host3	   slot=1:0-7
rank 14=host3	   slot=1:8-15

rank 3=host4	   slot=0:0-7
rank 7=host4	   slot=0:8-15
rank 11=host4	   slot=1:0-7
rank 15=host4	   slot=1:8-15

There are options that can be used to specify a different topology, including --accelerators,--sockets, and --racks. For more information about and a complete list of these options see /opt/DL/ddl/doc/README.md. E.g.:

ddlrun --racks 2 -H host1,host2,host3,host4 python...

3. Automatic mpirun option handling

There are quite a few options that have to be passed to mpirun every time a job is launched, and some that only need to be passed depending on what version of mpi is being used or how the environment is set up.
ddlrun now handles these options automatically, displaying the fully constructed mpirun command it used. E.g.:

$ ddlrun -H host1,host2,host3,host4 python /mnist/mnist-env.py
+ mpirun -x PATH -x LD_LIBRARY_PATH -x DDL_OPTIONS -gpu --rankfile /tmp/ddlrun.BxI9Ufpz1Ycz/RANKFILE -n 16 python /mnist/mnist-env.py

If there’s ever a need to pass additional options to mpirun, the --mpiarg option can be used. E.g.:

ddlrun --mpiarg "-pami_noib" -H host1,host2,host3,host4 python /mnist/mnist-env.py

Join The Discussion

Your email address will not be published. Required fields are marked *