Improve system throughput by running multiple containers with H2O Driverless AI on IBM Power Systems

Introduction

When running H2O Driverless AI on IBM® Power Systems™ servers, each job can be run in a container. Each job constitutes training the machine leaning model using a data set. If these jobs are run serially one at a time, it takes a longer time to complete. Instead, you can reduce time significantly by running all these jobs simultaneously, independent of each other.

This tutorial explains how to set up and train multiple H2O Driverless AI containers in parallel. In the example provided in this tutorial, for simplicity, only one data set was used. The scripts can be changed to use multiple data sets.

Prerequisites

In order to set up and train multiple H2O Driverless AI containers in parallel, users should be familiar with Docker containers and H2O Driverless AI. Familiarity with NVIDIA Docker container is helpful but not essential to follow the instructions provided in this tutorial to create multiple containers to run H2O Driverless AI in each container. However, any customization such as implementing a separate file structure within each container, modifying the scripts to run a different model other than the one used to illustrate in this tutorial needs more familiarity with Docker containers and H2O Driverless AI.

You can find the list of prerequisites at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/install/ibm-docker.html

If GPUs are available, follow the steps for GPU. Otherwise use the steps for CPU. The data set used in the example below is allyears with a size of about 15 GB and can be found at https://s3.amazonaws.com/h2o-public-test-data/airlines/allyears.1987.2013.csv

The requirement for disk space for each container depends on factors such as the model size and the number of experiments that need to be saved for later review. The total requirement for multiple containers can be estimated by simply adding up the requirements for each container. For general sizing information about each H2O Driverless AI experiment, refer to the document at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/sizing-requirements.html

Estimate time

It will take approximately 2 hours to perform the steps described in this tutorial.

Steps

To set up H2O Driverless AI and run multiple containers in parallel, you need to perform the tasks described in the following sections.

Step 1. Install Docker, H2O Driverless AI Docker image, and NVIDIA container libraries (if using GPUs)

For detailed instructions to perform these operations, refer to the document at: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/install/ibm-docker.html. The instructions in that document are grouped under two subheadings:

  • Install on IBM with GPUs – follow steps 1 through 9
  • Install on IBM with CPUs – follow steps 1 through 4

You can use the instructions relevant to your system installation.

After completing the setup, you would see results similar output for the following commands:

Command: rpm -qa | grep docker

Output:

docker-common-1.13.1-75.git8633870.el7_5.ppc64le
docker-rhel-push-plugin-1.13.1-75.git8633870.el7_5.ppc64le
docker-1.13.1-75.git8633870.el7_5.ppc64le
docker-client-1.13.1-75.git8633870.el7_5.ppc64le

Command: systemctl status docker

Output:

docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-08-17 16:01:42 EDT; 3 weeks 2 days ago
     Docs: http://docs.docker.com
 Main PID: 5989 (dockerd-current)
    Tasks: 277
   Memory: 849.2M
   CGroup: /system.slice/docker.service
           ├─  5989 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization...
           ├─  6118 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2...
           ├─  7850 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─ 30275 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─ 34143 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─ 36910 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─ 46455 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─ 84683 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─ 84791 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─112222 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─116806 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─117948 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─120774 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─125084 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─129060 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           ├─152654 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
           └─159145 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...

If using GPU:

Command: rpm -qa | grep nvidia-container

Output:

libnvidia-container-tools-1.1.1-1.ppc64le
libnvidia-container1-1.1.1-1.ppc64le
nvidia-container-toolkit-1.1.2-2.ppc64le

Step 2. Install H2O Docker image and directory structure

You can find all the scripts discussed in this tutorial at the following repository: https://github.com/lilianrom/multijob. Download, extract the ZIP file, and change to the multijob directory.

  • The multijob directory contains the following scripts to create, remove, and resume containers to load data into H2O Driverless AI running in a container to run the training model in each container:

    • multi-job-setup.sh
    • multi-job-stop.sh
    • multi-job-resume.sh
    • multi-run-db.sh
    • multi-run-model.sh
    • make.sh
    • workloads.txt
  • The job-template directory is used to create a separate working directory for each container. It assumes that the location of the data sets is located at /h2o/databackup

  • The scripts directory contains the scripts to run the training model and configuration files.

After all the prerequisites are installed, verify if the Docker image is installed by running the following command:

Command: docker image ls

Output:

REPOSITORY                  TAG                   IMAGE ID            CREATED             SIZE
h2oai/dai-centos7-ppc64le   1.9.0.2-cuda10.0.11   424df3891bd3        8 days ago          17.7 GB

In this example, note that the REPOSITORY:TAG information for h2oai is:
h2oai/dai-centos7-ppc64le:1.9.0.2-cuda10.0.11

Later, this information is used as one of the parameters to create Docker with H2O Driverless AI.

Step 3. Create working directories for each container and start containers

Before setting up multiple containers, modify the make.sh script located in the top-level working directory and add the REPOSITORY:TAG information.

For each model that need to be trained, create a directory under the job-template/scripts directory. In the example provided, the directory used is allyears. Use the files provided as a reference to create your own set of files:

< YOURMODEL-model-1.9.0.gpu.py>, < YOURMODEL -1.9.0.cpu.py>, < YOURMODEL-database.py>

For example, if a new data set, higgs needs to be added, you can create a directory for higgs under job-template/scripts/higgs.

The workloads.txt file contains the list of models to run on each container and each line corresponds to a container. This file is used for both loading data sets and running the models in a container. For example, if a user wants to run workload higgs on container 1, the file workloads.txt will have higgs in line 1.

The job-template/data is a symbolic link which points to the directory where the trained data sets are located. Change the link to point to the location where your new datasets are located. For example, use the command:

ln -sf /h2o/databackup data

Where /h2o/databackup is the directory that contains the new data sets.

In job-template/license/license.sig add the h2o license file.

To set up multiple containers, run the following command:

multi-job-setup.sh <beginning container #> <ending container #>

This script uses job-template to create the directory structure needed for each container. This script takes the beginning and the end container as the input. For example:

./multi-job-setup.sh 1 1 (this will create 1 container)

If more containers are needed, you can run the script starting with the new container number and ending with the end container. For example, if seven more containers are needed, use:

./multi-job-setup.sh 2 8

The log file, setup-job-1-log, is located under the job-<#>/job directory.

Note: This script will start H2O Driverless AI and the initial port number to connect to the web interface is 12345. To connect to the other containers using the web interface, the number 12345 is increased by 1 to N containers. In this case, where we need seven more containers, the IP address of the first container will be 12345 and the IP address of the last container will be 12352.

After the container is started, you can access H2O Driverless AI running in that container by using the following URL:

https://host-ip-address:<port assigned to the container>

For example, the H2O Driverless AI running in containers 1and 3 can be accessed using the following URLs:

https://host-ip-address:<12345>

https://host-ip-address:<12347>

Only one URL can be accessed at a time. This is for monitoring the progress of loading the model and the experiment, retrieving the results, and so on. Do not use the web interface to perform any updates.

Each of the job-NN directories creates the directory structure as shown in the following output:

ls job-1
data job license log scripts

Notice that data, job, license, log, scripts, and tmp are created.

  • data: Contains a link to the location of the data set. Example: data -> /h2o/databackup
  • job: Is an empty directory. This is used to store any output from running the scripts
  • license: Contains the H2O Driverless AI license file. Example: license.sig
  • log: Is initially empty. It contains the H20 Driverless AI logs.
  • scripts: Contains the Python client scripts and run scripts to load the database run model in the container. The scripts directory contains the allyears subdirectory, which contains the Python scripts to load the data set to train and run the model (run-model.sh) and a configuration template (allyears-model-1.9.0-cpu.py).
  • tmp: Stores the data and results generated during the training.

multi-job-setup.sh uses make.sh. to create a container.

After completing this step, run the docker ps command to list the running containers. The output will look similar to this:

docker ps
CONTAINER ID        IMAGE                                          COMMAND             CREATED             STATUS              PORTS                                NAMES
22609cf9ec7a        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12352->12345/tcp   wizardly_pasteur
661b8f469e54        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12351->12345/tcp   objective_mestorf
e5d5774bc4e3        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12350->12345/tcp   quizzical_brown
bacd0b6ad231        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12348->12345/tcp   stupefied_torvalds
376b56f1d246        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12349->12345/tcp   wizardly_davinci
4e7c8147c42d        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12347->12345/tcp   frosty_bose
f3eaf5bd386c        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12345->12345/tcp   hungry_shockley
73846e4b1a54        h2oai/dai-centos7-ppc64le:1.9.0.1-cuda10.0.8   "./run.sh"          4 minutes ago       Up 4 minutes        8888/tcp, 0.0.0.0:12346->12345/tcp   confident_agnesi

Step 4. Add a data set to the H2O Driverless AI database

Before performing this step make sure that the workloads.txt file has been modified.

To add a data set to the database, run the multi-run-db.sh script as follows:

./ multi-run-db.sh <beginning container> <end container>

For example, to add a data set to the first container, run the following command:

./ multi-run-db.sh 1 1

To add data sets to containers 2 through 8, run the following command:

./ multi-run-db.sh 2 8

Verify the log file for any errors. For example, to check the log file in container 1, navigate to the job-1/job directory and view the run-db-upload-log file.

Step 5. Submit and train the job

To train a job, use the multi-run-model.sh script as follows:

multi-run-model.sh <beginning container> <end container> <cpu|gpu>

For example, to train a single data set using cpu:

./multi-run-model.sh 1 1 cpu

Verify the log file for any errors. For example, to check the log file in container 1, navigate to the job-1/job directory and view the run--model-log file.

The following screen captures show the web interface when a job is running:

img1

img2

Step 6. Stop and resume containers

The containers are initially created using the multi-job-setup.sh script. After the directory structure is created, databases loaded, and training model are run, the containers can be deleted and re-created. The job-NN working directories are preserved and reused when a new container NN is created. This is helpful because the database need not be reloaded into the container as the data had already been stored in the job-NN working directory from the previous runs. Also, because the working directory of the previous container is not destroyed, it is useful to view the log of previous experiments.

You can use scripts in the following format to stop and start the containers:

./mutli-job-stop.sh <beginning container> <ending container>
./multi-job-resume.sh <beginning container> <ending container>

Summary

This tutorial describes how to set up multiple Docker containers each running a copy of the H2O Driverless AI, load the database into each container, and train the model in all the containers to started simultaneously. This results in efficient utilization of system resources and significant reduction in the total completion time for all the submitted jobs. In lab experiments, it was observed that when 16 jobs were run simultaneously, the performance gains were in 8x range.