Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL communication library with Horovod. DDL uses the hierarchical topology of the network to minimize the communication cost.

Minimum requirements:

  • IBM PowerAI 1.5.2 (1.5.3 for using Horovod and Python 3)
  • Horovod v0.13.11

Setting up Horovod and DDL

The following setup steps need to be executed on all the machines that the distributed run will use.

  1. Download PowerAI using the PowerAI docker image or following the Ordering information.
    You can skip next 2 steps if you use the docker container.
  2. Install the deep learning framework(s) you want to use (Tensorflow, pytorch). In this tutorial, we will focus on Tensorflow.
  3. Install DDL and its header files
    RHEL: sudo yum install ddl ddl-dev
  4. Run the deep learning framework(s) and DDL activation scripts
    source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
  5. Install Horovod with DDL backend
    HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir
    Note: Horovod needs to be reinstalled to use a different backend

Training a model with Horovod+DDL

We will use the Tensorflow framework with the High-Performance Models as an example.

  1. First, copy the model scripts to your current directory (repeat on each machine if the filesystem is not distributed)
    /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
  2. Run the deep learning framework(s) and DDL activation scripts
    source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
  3. Use ddlrun to execute the distributed run
ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=horovod

Note: HOROVOD_FUSION_THRESHOLD=16777216 is recommended to increase performance by better overlapping communication with computation.

The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
total images/sec: 5682.34

For more information on how to integrate your model with Horovod, see their github repository: https://github.com/uber/horovod

Join The Discussion

Your email address will not be published. Required fields are marked *