This is the first part of a two-part article describing TensorFlow deployment for training using Docker and Kubernetes cluster running on OpenPower servers with NVIDIA Tesla P100 GPUs.

We’ll make use of PowerAI offering. PowerAI is a software distribution for deep learning from IBM. Pre-built binaries for popular machine learning frameworks and their dependencies are available for PowerPC architecture. More details available from: http://ibm.biz/powerai

If you are looking to leverage the community TensorFlow code directly, then have a look at the following article: https://goo.gl/mQM1Gp

In the second part of the article, we’ll look at inference. A special shout out to my team member Abhishek Dasgupta without whom this article would not have been possible.

What is TensorFlow

Tensorflow is an opensource software for design, build, and training of deep learning models.

A typical end-to-end workflow with TensorFlow looks like this:

The first step is the training, which can be either on GPU or CPU based systems. The trained model is then made available (exported) to applications via TensorFlow Serving. Exporting a model for inference is like deploying any application and handling application specific nuances like scaling, availability etc. Once the model is made available, any application can make use of the exported model for inference. Inference can also leverage either GPU or CPU based systems.

Both training and inference can be on the same Kubernetes cluster or different clusters. For example, the training can be on an on-
prem cluster, whereas inference using the trained model can happen off-prem for test/dev applications.

Prerequisites

This section describes the various prerequisites that are required when planning to deploy TensorFlow with Docker and Kubernetes on OpenPower servers.

Kubernetes

At a minimum, you’ll need to use Kubernetes 1.6 version which adds support for multiple GPUs. Kubernetes binaries for Power are available from the project release page. For more information about setting up Kubernetes on OpenPower servers, see Managing docker containers with orchestration.

NVIDIA software

The following software stack is used on the OpenPower servers (IBM S822LC for HPC) in our setup:

  • Ubuntu-16.04
  • CUDA 8.0 toolkit

Download CUDA 8.0

In Ubuntu 16.04, the Cuda 8.0 packages will be available under /usr/local/cuda-8.0/ and /usr/lib/powerpc64le-linux-gnu/ after installation.

  • cuDNN

Download cuDNN

Ensure cuDNN is extracted in /usr/local/cuda-8.0

  • Nvidia 375 (nvidia-375) driver

Download the Nvidia driver

The nvidia-375 driver is installed on the host on /usr/lib/nvidia-375/

The above mentioned paths will be used in the steps mentioned below. Ensure you use the correct paths for CUDA toolkit and Nvidia river based on your specific environment.

Set up Instructions

  1. Build Tensorflow Docker Image using PowerAI binaries

    We’ll leverage the TensorFlow binaries shipped as part of PowerAI distribution to build a Docker image for training. We have used the example described in How to Fine-Tune a Pre-Trained Model on a New Task.

    The following instructions will build the Docker image for training:

    $ git clone https://github.com/ai-infra/tensorflow-automated-training.git tf-training
    $ cd tf-training/powerai

    Run the following command to build the Docker image:

    $ docker build -t ppc64le/tf-train-flowers -f Dockerfile-powerai.ppc64le

  2. Start training using standalone Docker

    The following command will start the training. Trained model is available in the host at /root/runs

    $ docker run -it --privileged -v /usr/local/cuda-
    8.0/:/usr/local/cuda-8.0/ \
    -v /usr/lib/powerpc64le-linux-gnu/:/usr/lib/powerpc64le-linux-gnu/ \
    -v /usr/lib/nvidia-375/:/usr/lib/nvidia-375/ \
    -v /root/runs:/flowers-train ppc64le/tf-train-flowers \
    /bin/bash -c \
    "source /opt/DL/bazel/bin/bazel-activate && source
    /opt/DL/tensorflow/bin/tensorflow-activate && ./run-trainer.sh
    10000 && rsync -ah flowers_train/ flowers-train/"
    $ ls /root/runs/
    checkpoint model.ckpt-35000.data-
    00000-of-00001 model.ckpt-40000.meta model.ckpt-
    49999.index
    events.out.tfevents.1490704717.jarvis model.ckpt-35000.index
    model.ckpt-45000.data-00000-of-00001 model.ckpt-49999.meta
    model.ckpt-30000.data-00000-of-00001 model.ckpt-35000.meta
    model.ckpt-45000.index
    model.ckpt-30000.index model.ckpt-40000.data-
    00000-of-00001 model.ckpt-45000.meta
    model.ckpt-30000.meta model.ckpt-40000.index
    model.ckpt-49999.data-00000-of-00001
  3. Start training by deploying on a Kubernetes cluster

    Once the Docker image is ready, deploying it in a Kubernetes cluster is a breeze. An example YAML file is available from the repo.

    $ kubectl create –f https://raw.githubusercontent.com/ai-
    infra/tensorflow-automated-training/master/powerai/tf-inception
    -trainer-flowers-powerai.yaml

  4. Let us know if you come across any issues when using TensorFlow with Docker or Kubernetes on OpenPower.

4 comments on"TensorFlow Training with Kubernetes on OpenPower Servers using PowerAI"

  1. Hi Pradipta

    Nice article 🙂
    I started a helm chart to operate scalable Tensorflow on K8s, I’d be interested in your feedback. Mini tuto / post is available here : https://medium.com/intuitionmachine/kubernetes-gpus-tensorflow-8696232862ca and the chart is in https://github.com/madeden/charts

    The Canonical Distribution of Kubernetes is planned for July on ppc64le, so you can’t do the deployment with it yet. However, the chart should work provided you build Tiller (Helm) for your architecture. This would make things very easyto deploy apps on top of K8s. Have you given some thoughts to this?

    Also, I can’t help myself but to share this (which comes from one of my readers): it seems it’s no longer required to be in a privileged context to run GPU workloads. You might want to experiment with a standard security context.

    • Pradipta_Kumar April 24, 2017

      Hi Samuel, Thanks.
      I have used helm chart for TF serving but haven’t used for training. Will take a look at your helm chart and give it a try. Helm charts are great way to deploy K8s apps. No second thoughts!!
      Good to know that the Canonical distribution of Kubernetes for ppc64le is planned for July. I have been looking forward to it since quite some time 🙂
      Regarding not requiring ‘privilege’ access for GPUs any more, I’m not sure if this is the case when using nvidia-docker only. I’ll check and update this thread.

  2. My brother recommended I might like this website.
    He used to be entirely right. This publish truly made my
    day. You cann’t believe simply how so much time I had spent for this information! Thank you!

  3. […] der aktuell spannendsten Aspekte ist die Integration von GPUs (Global Processing Unit) und für neuronale Netze spezialisierte Hardware in Kubernetes. Mehrere […]

Join The Discussion

Your email address will not be published. Required fields are marked *