This blog builds on my previous blog on “Compiling OpenMPI with IBM Spectrum LSF in a Docker container image” and extends the concept to include TensorFlow plus Horovod and is specifically written for only the IBM Power server platform.
With the advent of Docker container support in IBM Spectrum LSF 10.1, it’s become much easier to build and maintain environments for containerized workloads. This blog will explorer building a custom NVIDIA Docker container, which will allow running the TensorFlow benchmark using Horovod across multiple servers and multiple GPUs.
This blog assumes you have installed IBM Spectrum LSF on Power Little Endian platform (or linux3.10-glibc2.17-ppc64le) and Docker and NVIDIA Docker and CUDA and both are up and running on nodes in your cluster. To start with you will need the following:
|Red Hat Linux Server||7.6||Enterprise|
|IBM Spectrum LSF||10.1.0.8+||Standard Edition or Suite|
|Docker||17.03+||Community or Enterprise Edition|
Verify your Docker Engine version with this command:
Build a new TensorFlow and Horovod Docker container with Open MPI compile with LSF
Login as a user with the ability to run docker commands. The steps below assume your working directory (pwd) will remain the same through out the steps below.
Prepare minimal LSF files for Open MPI compile
The goal is to prepare the minimal files from your LSF environment necessary to compile Open MPI with LSF inside a Docker container. Copy the script below and paste into a file called mktmplsf.sh. This script will generate a directory called “lsf” with the LSF libraries, include files and configuration file. The files in the “lsf” directory will be used in the next step.
Here are the steps to run the script and see the directories and files created:
Create a Dockerfile
Copy the text below and paste into a file called Dockerfile.
Build a new Docker container
Use the command below to create the new container. It will take several minutes to perform all the steps to create the new container image, which will be called “docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3-horovod”. Note, both the Dockerfile file and lsf directory should be in your current working directory.
Now, run the docker images command and your new container image should be there unless the docker build command failed.
You can repeat the above docker build process on every Docker NVIDIA enabled compute node in your LSF cluster or you can use other methods such as publishing the container to your internal Docker Registry or use the docker save and the docker load commands.
Setting up LSF with Docker
1). Prepare IBM Spectrum LSF to run jobs in Docker container by following these steps: LSF docker integration instruction.
2). Configure LSF Docker Application profile for the new Docker container image by adding the following lines (while changing the LSF_TOP to your LSF TOP direction location) to the end of lsb.applications file (and then run badmin reconfig or badmin mbdrestart on the LSF Master):
Testing the new container with LSF
Testing the new container with MPI Hello World
Make sure MPI is working as expected before attempting to run the TensorFlow benchmark across nodes.
Example of running MPI hello world on a single node with 1 message
Example of running MPI hello world on a single node with 2 job slots or 2 messages.
Example of running MPI hello world on 2 nodes with 1 message per node.
Testing the new container with requests for GPUs
Example job requesting 1 GPU and showing nvidia-smi output
Example job requesting 2 GPUs and showing nvidia-smi output
Testing the new container with TensorFlow benchmark on a single compute node
Example TensorFlow benchmark with 1 GPU on a single compute node
Example TensorFlow benchmark with 4 GPU on a single compute node
A few notes on the above examples:
1) Above was tested with NVIDIA Tesla V100 GPUs with 16 GB RAM. You will likely need to decrease the batch size parameter value if you have less RAM on your GPUs or you may want to try increasing RAM if your GPUs have more RAM.
2) The number of batches is specifically small in above examples for testing. Increase the num_batches value to have the benchmark run for a longer period of time.
3) If problems with the above jobs running, check the standard error file, which is stderr<JOBID>.txt.
Testing the new container with TensorFlow benchmark with Horovod
A few notes on the examples below:
1) For the benchmark use your fastest network, which should be 10Gb or faster or potentially Infiniband. The examples use a 40Gb ethernet network. For the btl_tcp_if_include and HOROVOD_GLOO_IFACE parameter values replace my network interface, which is “enP48p1s0f0”, with your fastest network interface available on your compute nodes.
2) The mpirun command has several debugging option enabled.
3) If problems with the jobs below, check the standard error file, which is stderr<JOBID>.txt.
4) If you only have 1 GPU per node, change the LSF bsub parameters
In the first example (on a single compute node) below from:
-n 4 -R "span[ptile=4]" -gpu "num=4:mode=exclusive_process"
-n 1 -R "span[ptile=1]" -gpu "num=1:mode=exclusive_process"
In the second example (on two compute nodes) below from:
-n 4 -R "span[ptile=2]" -gpu "num=2:mode=exclusive_process"
-n 2 -R "span[ptile=1]" -gpu "num=1:mode=exclusive_process"
Example TensorFlow benchmark using Horovod with 4 GPUs on a single compute node
Example TensorFlow benchmark using Horovod with 4 GPUs (2 GPUs per node) on two compute nodes
Now, you have a new container image that is ready to run TensorFlow with Horovod across multiple nodes using an LSF cluster. Please leave comments or feedback on the above information and if you would like the article to include X86_64 equivalents.