IBM recently announced a technical preview of Distributed Deep Learning (DDL) for TensorFlow and Caffe in the IBM PowerAI 4.0.0 distribution. IBM Research has demonstrated close to ideal scaling with DDL software achieving record communication overhead and 95% scaling efficiency on the Caffe deep learning framework over 256 NVIDIA GPUs in 64 IBM Power systems. If you would like to try out DDL, the NIMBIX public cloud can get you started quickly. NIMBIX allows you to run Dockerized deep learning jobs in GPU attached containers on one or more POWER systems connected with RDMA based Mellanox switches.

Follow these steps:

  1. Bring up at least two nodes in NIMBIX using these steps, comes up in a few minutes:
    1. From, launch the IBM PowerAI: ML/DL and DDL app from Categories -> IBM POWER

    2. Choose the Machine Type as “128 thread POWER8, 512GB RAM, 4xP100 GPU w/NVLink (np8g4)” and increase Cores: to 256 to deploy a two node cluster:

  2. After the 2 node cluster is launched, please take the Address and Password fields and ssh into the master container:
  3. ssh to the master node using nimbix@<host> and <password>:
  4. Locate the slave IP addresses by running more /etc/hosts:
  5. Install a Tensorflow DDL sample by running the following commands
    source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
    /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-install-samples ddl-samples
  6. Download the flowers data set with this command:
    python /home/nimbix/ddl-samples/examples/slim/ --dataset_name=flowers --dataset_dir=/home/nimbix/data
  7. Copy the dataset and the sample application to the default shared filesystem for all NIMBIX containers which is mounted to the /data folder on each container:
    cp -r data/ /data/flowers-data; cp -r ddl-samples/ /data
  8. Run Tensorflow with DDL on all the containers using MPI, for example:
    mpirun -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -n `grep JARVICENAE /etc/hosts | wc -l` -host `
    grep JARVICENAE /etc/hosts | awk '{print $1}' | paste -sd ","`  
    python /data/ddl-samples/examples/slim/ --train_dir=/tmp/train --num_readers=4 --dataset_dir=/data/flowers-data 
    --dataset_name=flowers  --max_number_of_steps=502 --num_preprocessing_threads=4 --num_clones=1 --batch_size=32 --optimizer=sgd

  9. You should see training proceed to 500 steps with most of the steps running at about 0.5 sec/step:
  10. The trained model is stored in the /tmp/train directory in each container

These steps do not specify an MPI rankfile for a topology aware deployment. Watch out for a separate blog post on this topic.

1 comment on"PowerAI Distributed Deep Learning with Tensorflow in NIMBIX"

  1. Where do i go for “a second blog post on this topic”??? Thank You

Join The Discussion

Your email address will not be published. Required fields are marked *