We’ll make use of PowerAI offering. PowerAI is a software distribution for deep learning from IBM. Pre-built binaries for popular machine learning frameworks and their dependencies are available for PowerPC architecture. More details available from: http://ibm.biz/powerai
If you are looking to leverage the community TensorFlow code directly, then have a look at the following article: https://goo.gl/mQM1Gp
In the second part of the article, we’ll look at inference. A special shout out to my team member Abhishek Dasgupta without whom this article would not have been possible.
What is TensorFlow
Tensorflow is an opensource software for design, build, and training of deep learning models.
A typical end-to-end workflow with TensorFlow looks like this:
The first step is the training, which can be either on GPU or CPU based systems. The trained model is then made available (exported) to applications via TensorFlow Serving. Exporting a model for inference is like deploying any application and handling application specific nuances like scaling, availability etc. Once the model is made available, any application can make use of the exported model for inference. Inference can also leverage either GPU or CPU based systems.
Both training and inference can be on the same Kubernetes cluster or different clusters. For example, the training can be on an on-
prem cluster, whereas inference using the trained model can happen off-prem for test/dev applications.
This section describes the various prerequisites that are required when planning to deploy TensorFlow with Docker and Kubernetes on OpenPower servers.
At a minimum, you’ll need to use Kubernetes 1.6 version which adds support for multiple GPUs. Kubernetes binaries for Power are available from the project release page. For more information about setting up Kubernetes on OpenPower servers, see Managing docker containers with orchestration.
The following software stack is used on the OpenPower servers (IBM S822LC for HPC) in our setup:
- CUDA 8.0 toolkit
In Ubuntu 16.04, the Cuda 8.0 packages will be available under
/usr/lib/powerpc64le-linux-gnu/ after installation.
Ensure cuDNN is extracted in
- Nvidia 375 (nvidia-375) driver
The nvidia-375 driver is installed on the host on
The above mentioned paths will be used in the steps mentioned below. Ensure you use the correct paths for CUDA toolkit and Nvidia river based on your specific environment.
Set up Instructions
- Build Tensorflow Docker Image using PowerAI binaries
We’ll leverage the TensorFlow binaries shipped as part of PowerAI distribution to build a Docker image for training. We have used the example described in How to Fine-Tune a Pre-Trained Model on a New Task.
The following instructions will build the Docker image for training:
$ git clone https://github.com/ai-infra/tensorflow-automated-training.git tf-training
$ cd tf-training/powerai
Run the following command to build the Docker image:
$ docker build -t ppc64le/tf-train-flowers -f Dockerfile-powerai.ppc64le
- Start training using standalone Docker
The following command will start the training. Trained model is available in the host at /root/runs
$ docker run -it --privileged -v /usr/local/cuda-
-v /usr/lib/powerpc64le-linux-gnu/:/usr/lib/powerpc64le-linux-gnu/ \
-v /usr/lib/nvidia-375/:/usr/lib/nvidia-375/ \
-v /root/runs:/flowers-train ppc64le/tf-train-flowers \
/bin/bash -c \
"source /opt/DL/bazel/bin/bazel-activate && source
/opt/DL/tensorflow/bin/tensorflow-activate && ./run-trainer.sh
10000 && rsync -ah flowers_train/ flowers-train/"
$ ls /root/runs/
00000-of-00001 model.ckpt-40000.meta model.ckpt-
- Start training by deploying on a Kubernetes cluster
Once the Docker image is ready, deploying it in a Kubernetes cluster is a breeze. An example YAML file is available from the repo.
$ kubectl create –f https://raw.githubusercontent.com/ai-
Let us know if you come across any issues when using TensorFlow with Docker or Kubernetes on OpenPower.