IBM PowerAI Distributed Deep Learning (DDL) can be deployed directly into your enterprise private cloud with IBM Cloud Private (ICP). This blog post explains how to do that using TCP or InfiniBand communication between the worker nodes. We will use the command line interface, however the Web interface could also be used for most of the steps.

Minimum requirements

Before you begin

You need to install the Kubernetes and Helm CLI to deploy your application from the command line. After installing the CLI, add the IBM Helm Chart repository:

    helm repo add ibm-charts https://raw.githubusercontent.com/IBM/charts/master/repo/stable/

Deploying IBM PowerAI DDL with TCP cross node communication

  1. Create container SSH keys as a Kubernetes secret.
    mkdir -p .tmp
    yes | ssh-keygen -N "" -f .tmp/id_rsa
    kubectl create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub
  2. Deploy the PowerAI Helm Chart with DDL enabled.
    helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret
    • –name release_name: Name for the deployment
    • –set resources.gpu=gpu_count: Total number of requested GPUs
    • –set ddl.enabled=true: Enable Distributed Deep Learning
    • –set ddl.sshKeySecret=sshkeys-secret: Name of the Kubernetes secret containing the SSH keys
  3. Check that the pods were created and wait until they are in a running and ready state.
    kubectl get pod -l app=ddl-instance-ibm-powerai
    NAME                         READY     STATUS    RESTARTS   AGE
    ddl-instance-ibm-powerai-0   1/1       Running   0          30s
    ddl-instance-ibm-powerai-1   1/1       Running   0          30s
    

    NOTE: One pod per worker node is created. DDL deployments currently always take all the GPUs of a node. Run kubectl describe pod pod_name to get more info about a pod.

  4. Get a shell to the first pod, create a local copy of the model, and run the activation script.
    We will use the Tensorflow framework with the High-Performance Models as an example.

    kubectl exec -it ddl-instance-ibm-powerai-0 bash
    cd; /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
    source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
    
  5. Train the model with DDL.
    ddlrun --mpiarg '-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0' --tcp --hostfile /powerai/config/hostfile python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl
    • –mpiarg ‘-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0’: Specify the network interface to use for MPI and NCCL. eth0 connects the different nodes
    • –hostfile /powerai/config/hostfile: Use the autogenerated hostfile available inside the pod.

    The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

    I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
    ...
    ----------------------------------------------------------------
    total images/sec: 2284.62
    ----------------------------------------------------------------
    
  6. Delete your deployment.
    helm delete ddl-instance --purge --tls

Using host network on container

The Helm Chart provides the option to use the host network for communication to get better performance. The potential disadvantage of this option is that all the host network interfaces will be visible inside the container. A different SSH port than 22 needs to be picked for the containers not to interfere with the host SSH. Here is an example of deploying with the host network:

    helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret --set ddl.useHostNetwork=true --set ddl.sshPort=2200

Deploying IBM PowerAI DDL with InfiniBand cross node communication

  1. Create container SSH keys as a Kubernetes secret.
    mkdir -p .tmp
    yes | ssh-keygen -N "" -f .tmp/id_rsa
    kubectl create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub
  2. Deploy InfiniBand device plugin.
    kubectl -n kube-system apply -f https://raw.githubusercontent.com/nimbix/k8s-rdma-device-plugin/deploy-bionic/rdma-device-plugin.yml
  3. Install the latest Mellanox OFED driver user-space on a PowerAI Docker container.
    • Download latest MOFED into the container.
    • Install needed packages, decompress archive, and run the installer.
      sudo apt-get update; sudo apt-get install -y lsb-release
      tar -xzvf MLNX_OFED_LINUX-*
      MLNX_OFED_LINUX-*-ppc64le/mlnxofedinstall --user-space-only --without-fw-update --all -q
  4. Create a Docker image from this container and store it in a registry accessible by all the worker nodes.
  5. Deploy the PowerAI Helm Chart with InfiniBand communication.
    helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret --set ddl.useInfiniBand=true --set image.repository=my_docker_repo --set image.tag=powerai-mofed
    • –name release_name: Name for the deployment
    • –set resources.gpu=gpu_count: Total number of requested GPUs
    • –set ddl.enabled=true: Enable Distributed Deep Learning
    • –set ddl.sshKeySecret=sshkeys-secret: Name of the Kubernetes secret containing the SSH keys
    • –set ddl.useInfiniBand=true: Use InfiniBand for communication
    • –set image.repository=repo: Repository containing PowerAI image with MOFED installed
    • –set image.tag=tag: Tag of the PowerAI image with MOFED installed
  6. Check that the pods were created and wait until they are in a running and ready state.
    kubectl get pod -l app=ddl-instance-ibm-powerai
    NAME                         READY     STATUS    RESTARTS   AGE
    ddl-instance-ibm-powerai-0   1/1       Running   0          30s
    ddl-instance-ibm-powerai-1   1/1       Running   0          30s
    

    NOTE: One pod per worker node is created. DDL deployments currently always take all the GPUs of a node. Run kubectl describe pod pod_name to get more info about a pod.

  7. Get a shell to the first pod, create a local copy of the model, and run the activation script.
    We will use the Tensorflow framework with the High-Performance Models as an example.

    kubectl exec -it ddl-instance-ibm-powerai-0 bash
    cd; /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
    source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
    
  8. Restart a login session to get the correct ulimit settings.
    sudo su - $USER

    Note: Alternatively, you can modify the default ulimit by adding --default-ulimit memlock=-1 to the Docker daemon on all the worker nodes.

  9. Train the model with DDL using InfiniBand.
    ddlrun --hostfile /powerai/config/hostfile python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl

    –hostfile /powerai/config/hostfile: Use the autogenerated hostfile available inside the pod

    The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

    I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
    ...
    ----------------------------------------------------------------
    total images/sec: 2855.78
    ----------------------------------------------------------------
    
  10. Delete your deployment.
    helm delete ddl-instance --purge --tls

Join The Discussion

Your email address will not be published. Required fields are marked *