With the 1.5.2 release of PowerAI, IBM’s Distributed Deep Learning (DDL) framework has been updated to include many usablity improvements. One of these usability improvements comes in the form of
ddlrun is a utility which provides a streamlined interface to launch programs which use the DDL framework. Created in response to customer feedback,
ddlrun automates many of the tedious and error-prone steps required to run a DDL program.
ddlrun With Tensorflow
We will start with walking through an example of using
ddlrun to launch a topology-aware distributed Tensorflow job.
In the following example
ddlrun is used to run the mnist Tensorflow example across a cluster of 4 hosts.
ddlrun or PowerAI Tensorflow some setup has to be done. These steps assume that PowerAI has already been installed.
- First we need to source the
ddl-tensorflow-activatescript to setup both the DDL and PowerAI Tensorflow environments:
$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
ddl-tensorflow-install-samplesscript should then be used to copy the PowerAI Tensorflow scripts to a user directory.
$ ddl-tensorflow-install-samples ~/PowerAI_Scripts/
The basic usage of
ddlrun is similar to that of
This command will generate the necessary rankfile and launch a distributed instance of the mnist Tensorflow example on each host, using the DDL framework for all communication.
$ ddlrun -H host1,host2,host3,host4 python ~/PowerAI_Scripts/mnist/mnist-env.py
-H option is used to pass in the comma separated list of host names to run across. For more information on the
ddlrun parameters please see
/opt/DL/ddl/doc/README.md or run
The overall goal of
ddlrun is to improve the user experience DDL users.
To this end the primary features of
- Error Checking/Configuration Verification
- Automatic Topology Detection and Rankfile generation
1. Error checking/Configuration Verification
MPI can be unclear when it comes to error messages, therefore
ddlrun encapsulates a fair amount of troubleshooting in order to provide the user with relevant and useful error messages. The following are some examples of error messages from checks that
ddlrun performs. The
-v flag can be used for more verbose output.
The following is the result of
ddlrun not being able to find the
mpirun executable. This is commonly caused by forgetting to source the
[ERROR DDL-2-15] Cannot find 'mpirun' executable. Do you need to run an activate script?
The following is the result of passing an unreachable host to
$ ddlrun -v -H unreachable1 python ... [ERROR DDL-2-18] An error was encountered trying to ssh to unreachable1. ssh: Could not resolve hostname unreachable1: Name or service not known Continuing to test host connectivity. ERROR: An ssh connection failed.
The following is the result of using a host that does not have ssh keys correctly configured. Here we see that
ddlrun prints out exactly which machine has the key error.
$ ddlrun -v -H nokeyhost,goodkeyhost python ... [ERROR DDL-2-18] An error was encountered trying to ssh to nokeyhost. Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). Continuing to test host connectivity.
The following is the result of attempting to run across nodes that have different hardware configurations. Here we see that
ddlrun prints out the configurations of each machine to facilitate troubleshooting.
$ ddlrun -v -H 4gpuhost,2gpuhost python ... [ERROR DDL-2-13] All machine configurations are not the same. Please verify that the correct hosts are being used. Host: 4gpuhost Accelerators: 4 Sockets: 2 Host: 2gpuhost Accelerators: 2 Sockets: 2
It’s possible that the error checks performed by
ddlrun are more extensive than needed. To skip the checks performed by
ddlrun use the
2. Automatic Topology Detection and Rankfile generation
Another common source of frustration when getting started with DDL is the generation of the rankfile.
With this version of
ddlrun, the topology is inferred from the host list and a rankfile is automatically generated by discovering the configuration of the first host in the host list and verifying that all other hosts have the same configuration.
ddlrun -H host1,host2,host3,host4 python ...
This command will automatically generate and use the following rankfile:
#host = host1,host2,host3,host4 #aisles = 1 #racks = 1 #nodes = 4 #accelerators = 4 #sockets = 2 #cores = 16 rank 0=host1 slot=0:0-7 rank 4=host1 slot=0:8-15 rank 8=host1 slot=1:0-7 rank 12=host1 slot=1:8-15 rank 1=host2 slot=0:0-7 rank 5=host2 slot=0:8-15 rank 9=host2 slot=1:0-7 rank 13=host2 slot=1:8-15 rank 2=host3 slot=0:0-7 rank 6=host3 slot=0:8-15 rank 10=host3 slot=1:0-7 rank 14=host3 slot=1:8-15 rank 3=host4 slot=0:0-7 rank 7=host4 slot=0:8-15 rank 11=host4 slot=1:0-7 rank 15=host4 slot=1:8-15
There are options that can be used to specify a different topology, including
--racks. For more information about and a complete list of these options see
ddlrun --racks 2 -H host1,host2,host3,host4 python...
3. Automatic mpirun option handling
There are quite a few options that have to be passed to
mpirun every time a job is launched, and some that only need to be passed depending on what version of mpi is being used or how the environment is set up.
ddlrun now handles these options automatically, displaying the fully constructed
mpirun command it used. E.g.:
$ ddlrun -H host1,host2,host3,host4 python
/mnist/mnist-env.py ... + mpirun -x PATH -x LD_LIBRARY_PATH -x DDL_OPTIONS -gpu --rankfile /tmp/ddlrun.BxI9Ufpz1Ycz/RANKFILE -n 16 python /mnist/mnist-env.py
If there’s ever a need to pass additional options to
--mpiarg option can be used. E.g.:
ddlrun --mpiarg "-pami_noib" -H host1,host2,host3,host4 python