Skill Level: Any Skill Level

IBM has developed a new approach called DDL to help Distributed Deep Learning models train on scale.This helps in significantly reducing the training time of deep neural networks from days to hours. This recipe helps to configure &run DDL on Power Systems


Beginner's knowledge on TensorFlow

Basic linux installation skills


  1. Install CUDA

    Download and install NVIDIA CUDA 9.2 from https://developer.nvidia.com/cuda-downloads

  2. Install CuDNN

    Download and Install NVIDIA cuDNN for CUDA 9.2 from https://developer.nvidia.com/cudnn

  3. Install Anaconda

    Download and Install Anaconda. Installation requires input for license agreement, install location (default is $HOME/anaconda2) and permission to modify the PATH environment variable (via .bashrc).

    $ wget https://repo.continuum.io/archive/Anaconda2-5.0.0-Linux-ppc64le.sh

    $ bash Anaconda2-5.1.0-Linux-ppc64le.sh

    $ source ~/.bashrc

  4. Install IBM PowerAI package

    IBM TensorFlow package for Power AC922 Deep Learning packages are distributed in an rpm file and is available from the PowerAI download site (Release 5.1) (https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/). Installing the rpm creates an installation repository on the local machine.

    $ sudo rpm -ihv mldl-repo-*.rpm

  5. Install the R5.1 framework all at once

    $ sudo yum install power-mldl

  6. Install the PowerAI Distributed Deep Learning packages

    $ sudo yum install power-ddl

  7. Accept the license agreement for powerAI

    $ sudo IBM_POWERAI_LICENSE_ACCEPT=yes /opt/DL/license/bin/accept-powerai-license.sh

  8. Install the tensorflow dependencies

    $ /opt/DL/tensorflow/bin/install_dependencies

  9. Setup the infiniband adapter

    Make sure all the participating nodes have a good IB adapter from Mellanox

    Download and install the respective drivers (MLNX_OFED_LINUX-4.3 - and make sure the adapters are pingable from each other using IB

    Make sure the ‘ibstat’ command shows the port as ‘Active’

  10. Enable Performance Governor on the Power System

    $ sudo yum install kernel-tools

    $ sudo cpupower -c all frequency-set -g performance

  11. Enable GPU persistence mode

    $ sudo systemctl enable nvidia-persistenced

    $ sudo systemctl start nvidia-persistenced

  12. Set the SMT mode

    For TensorFlow with DDL, set the SMT mode

    $ sudo ppc64_cpu –smt=2

  13. Configure SSH across nodes

    Setup passwordless ssh between all the participating nodes for the DDL

  14. Setup the bash script

    Set the .bashrc of the user used to include tensorflow activation on login

    $ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

  15. Rank file creation

    Create a rank file with below contents for a 4 Node with 4 GPU scenario. Change the rank appropriately for each environment

    rank 0=<<Node1>>      slot=0:0-9

    rank 4=<<Node1>>      slot=0:10-19

    rank 8=<<Node1>>      slot=1:0-9

    rank 12=<<Node1>>     slot=1:10-19

    rank 1=<<Node2>>      slot=0:0-9

    rank 5=<<Node2>>      slot=0:10-19

    rank 9=<<Node2>>      slot=1:0-9

    rank 13=<<Node2>>     slot=1:10-19

    rank 2=<<Node3>>      slot=0:0-9

    rank 6=<<Node3>>      slot=0:10-19

    rank 10=<<Node3>>     slot=1:0-9

    rank 14=<<Node3>>     slot=1:10-19

    rank 3=<<Node4>>      slot=0:0-9

    rank 7=<<Node4>>      slot=0:10-19

    rank 11=<<Node4>>     slot=1:0-9

    rank 15=<<Node4>>     slot=1:10-19

  16. Setting up the enviroment for running distributed training

    Change the directory to the example which houses HPM code integrated with DDL operators

    $ cd /opt/DL/tensorflow-performance-models/scripts/tf_cnn_benchmarks


    Make sure you have the Imagenet2012 dataset. Before you start with TensorFlow run you need to have TFRecord data created instead of the raw images and place the same in a storage mount point of your choice.


    To start using the HPM with DDL operator; Run the foll. using attributes of your choice

    mpirun -gpu -x LD_LIBRARY_PATH -x PATH -n <<Number of ranks> -rf <<rank file>> python tf_cnn_benchmarks.py   –model=<<model name>>   –batch_size=<<batch size>>   –num_batches=<<Num of batches>>  –num_gpus=<<Num of GPUs>> –data_dir=<<Directory  to TF records>> –data_name=imagenet –variable_update=<<mode of variable update>>  –ddl_options=<<DDL options>>

    To replicate the same results as that of the POWER9 Collateral on a 4 Node cluster(https://developer.ibm.com/linuxonpower/perfcol/perfcol-mldl/#tab_tensorflow4); Run the following command ( add ‘–use_fp16=True’ to the below command if you would like to exercise fp16 capability)

    DDL_OPTIONS=”-mode b:4×4 -dev_sync 2 -dump_iter 100 -dbg_level 1 ” mpirun –allow-run-as-root -gpu -x DDL_OPTIONS -x LD_LIBRARY_PATH -x PATH -n 16 -rf minsik_4x4_paiws3_4_5_7 python tf_cnn_benchmarks.py  –model=resnet50 –batch_size=64 –num_batches=5000  –num_gpus=1 –data_dir=/data/TF_records/ –data_name=imagenet –variable_update=”ddl“

Join The Discussion