Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL communication library with Horovod. DDL uses the hierarchical topology of the network to minimize the communication cost.

Minimum requirements:

  • IBM PowerAI 1.5.2 (1.5.3 for using Horovod and Python 3)
  • Horovod v0.13.11

If you are using a version of IBM PowerAI below 1.6.0, go to Setting up Horovod and DDL (for PowerAI versions below 1.6.0).

Setting up Horovod and DDL

The following setup steps need to be executed on all the machines that the distributed run will use.

  1. Download PowerAI using the PowerAI docker image or use the Anaconda channel below.
    You can skip steps 2-4 if you use the docker container.
  2. Add the PowerAI Anaconda channel (requires Anaconda to be installed)
    conda config --add channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
  3. Create and activate a conda environment with your preferred python version
    conda create -n yourenvname python=3.6; conda activate yourenvname
  4. Install DDL and the deep learning framework(s) you want to use (Tensorflow, pytorch). In this tutorial, we will focus on Tensorflow
    conda install ddl tensorflow
  5. Install the packages to build Horovod
    conda install gxx_linux-ppc64le=7.3.0 cffi cudatoolkit-dev
  6. Install Horovod with DDL backend
    HOROVOD_CUDA_HOME=$CONDA_PREFIX HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir
    Note: Horovod needs to be reinstalled to use a different backend

Training a model with Horovod and DDL

We will use the Tensorflow framework with the High performance benchmarks as an example.

  1. First, install and copy the model scripts to your current directory (repeat on each machine if the filesystem is not distributed)
    conda install tf_cnn_benchmark; install_tf_cnn_benchmarks hpms
  2. Use ddlrun to execute the distributed run
ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=horovod

Note: HOROVOD_FUSION_THRESHOLD=16777216 is recommended to increase performance by better overlapping communication with computation. The argument -mpiarg --allow-run-as-root needs to be added to ddlrun if the user is root.

The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------

For more information on how to integrate your model with Horovod, see their github repository: https://github.com/uber/horovod

Setting up Horovod and DDL (for PowerAI versions below 1.6.0)

The following setup steps need to be executed on all the machines that the distributed run will use.

  1. Download PowerAI using the PowerAI docker image or following the Ordering information.
    You can skip next 2 steps if you use the docker container.
  2. Install the deep learning framework(s) you want to use (Tensorflow, pytorch). In this tutorial, we will focus on Tensorflow.
  3. Install DDL and its header files
    RHEL: sudo yum install ddl ddl-dev
  4. Run the deep learning framework(s) and DDL activation scripts
    source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
  5. Install Horovod with DDL backend
    HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir
    Note: Horovod needs to be reinstalled to use a different backend

Training a model with Horovod and DDL (for PowerAI versions below 1.6.0)

We will use the Tensorflow framework with the High-Performance Models as an example.

  1. First, copy the model scripts to your current directory (repeat on each machine if the filesystem is not distributed)
    /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
  2. Run the deep learning framework(s) and DDL activation scripts
    source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
  3. Use ddlrun to execute the distributed run
ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=horovod

Note: HOROVOD_FUSION_THRESHOLD=16777216 is recommended to increase performance by better overlapping communication with computation.

The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------

For more information on how to integrate your model with Horovod, see their github repository: https://github.com/uber/horovod

2 comments on"Distributed deep learning with Horovod and PowerAI DDL"

  1. Hello, I just found a small bug that is easy to correct. When running DDL on machines with other local languages (french, for example) the script rank_gen.py cannot parse the output of lscpu because it is in another language. To correct this, just call lscpu with a fixed langage:

    rank_gen.py line 406:
    _, lscpu_results, _ = utilities.run_command_capture(“LANG=en_US.utf8 lscpu”)

    • Nicolas Castet May 17, 2019

      Thank you Angelo for this great find. We will have it fixed for the next coming release PowerAI 1.6.1.
      Also, if you find any other bugs or have improvement ideas, you can also open an issue at https://github.com/IBM/powerai
      Are you using Horovod and DDL at l’universitĂ© de Reims?

Join The Discussion

Your email address will not be published. Required fields are marked *