When running H2O Driverless AI on IBM® Power Systems™ servers, each job can be run in a container. Each job constitutes training the machine leaning model using a data set. If these jobs are run serially one at a time, it takes a longer time to complete. Instead, you can reduce time significantly by running all these jobs simultaneously, independent of each other.
This tutorial explains how to set up and train multiple H2O Driverless AI containers in parallel. In the example provided in this tutorial, for simplicity, only one data set was used. The scripts can be changed to use multiple data sets.
In order to set up and train multiple H2O Driverless AI containers in parallel, users should be familiar with Docker containers and H2O Driverless AI. Familiarity with NVIDIA Docker container is helpful but not essential to follow the instructions provided in this tutorial to create multiple containers to run H2O Driverless AI in each container. However, any customization such as implementing a separate file structure within each container, modifying the scripts to run a different model other than the one used to illustrate in this tutorial needs more familiarity with Docker containers and H2O Driverless AI.
You can find the list of prerequisites at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/install/ibm-docker.html
If GPUs are available, follow the steps for GPU. Otherwise use the steps for CPU. The data set used in the example below is allyears with a size of about 15 GB and can be found at https://s3.amazonaws.com/h2o-public-test-data/airlines/allyears.1987.2013.csv
The requirement for disk space for each container depends on factors such as the model size and the number of experiments that need to be saved for later review. The total requirement for multiple containers can be estimated by simply adding up the requirements for each container. For general sizing information about each H2O Driverless AI experiment, refer to the document at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/sizing-requirements.html
It will take approximately 2 hours to perform the steps described in this tutorial.
To set up H2O Driverless AI and run multiple containers in parallel, you need to perform the tasks described in the following sections.
Step 1. Install Docker, H2O Driverless AI Docker image, and NVIDIA container libraries (if using GPUs)
For detailed instructions to perform these operations, refer to the document at: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/install/ibm-docker.html. The instructions in that document are grouped under two subheadings:
- Install on IBM with GPUs – follow steps 1 through 9
- Install on IBM with CPUs – follow steps 1 through 4
You can use the instructions relevant to your system installation.
After completing the setup, you would see results similar output for the following commands:
rpm -qa | grep docker
docker-common-1.13.1-75.git8633870.el7_5.ppc64le docker-rhel-push-plugin-1.13.1-75.git8633870.el7_5.ppc64le docker-1.13.1-75.git8633870.el7_5.ppc64le docker-client-1.13.1-75.git8633870.el7_5.ppc64le
systemctl status docker
docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-08-17 16:01:42 EDT; 3 weeks 2 days ago Docs: http://docs.docker.com Main PID: 5989 (dockerd-current) Tasks: 277 Memory: 849.2M CGroup: /system.slice/docker.service ├─ 5989 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization... ├─ 6118 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2... ├─ 7850 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─ 30275 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─ 34143 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─ 36910 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─ 46455 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─ 84683 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─ 84791 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─112222 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─116806 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─117948 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─120774 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─125084 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─129060 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... ├─152654 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e... └─159145 /usr/bin/docker-containerd-shim-current e79d35e55cb7776a80c607e76a6dcf39b0111a08cd2fac26f3615786ad861950 /var/run/docker/libcontainerd/e...
If using GPU:
rpm -qa | grep nvidia-container
libnvidia-container-tools-1.1.1-1.ppc64le libnvidia-container1-1.1.1-1.ppc64le nvidia-container-toolkit-1.1.2-2.ppc64le
Step 2. Install H2O Docker image and directory structure
You can find all the scripts discussed in this tutorial at the following repository: https://github.com/lilianrom/multijob. Download, extract the ZIP file, and change to the multijob directory.
The multijob directory contains the following scripts to create, remove, and resume containers to load data into H2O Driverless AI running in a container to run the training model in each container:
The job-template directory is used to create a separate working directory for each container. It assumes that the location of the data sets is located at /h2o/databackup
- The scripts directory contains the scripts to run the training model and configuration files.
After all the prerequisites are installed, verify if the Docker image is installed by running the following command:
docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE h2oai/dai-centos7-ppc64le 18.104.22.168-cuda10.0.11 424df3891bd3 8 days ago 17.7 GB
In this example, note that the
REPOSITORY:TAG information for
Later, this information is used as one of the parameters to create Docker with H2O Driverless AI.
Step 3. Create working directories for each container and start containers
Before setting up multiple containers, modify the make.sh script located in the top-level working directory and add the
For each model that need to be trained, create a directory under the job-template/scripts directory. In the example provided, the directory used is allyears. Use the files provided as a reference to create your own set of files:
< YOURMODEL-model-1.9.0.gpu.py>, < YOURMODEL -1.9.0.cpu.py>, < YOURMODEL-database.py>
For example, if a new data set, higgs needs to be added, you can create a directory for higgs under job-template/scripts/higgs.
The workloads.txt file contains the list of models to run on each container and each line corresponds to a container. This file is used for both loading data sets and running the models in a container. For example, if a user wants to run workload higgs on container 1, the file workloads.txt will have higgs in line 1.
The job-template/data is a symbolic link which points to the directory where the trained data sets are located. Change the link to point to the location where your new datasets are located. For example, use the command:
ln -sf /h2o/databackup data
Where /h2o/databackup is the directory that contains the new data sets.
In job-template/license/license.sig add the h2o license file.
To set up multiple containers, run the following command:
multi-job-setup.sh <beginning container #> <ending container #>
This script uses job-template to create the directory structure needed for each container. This script takes the beginning and the end container as the input. For example:
./multi-job-setup.sh 1 1 (this will create 1 container)
If more containers are needed, you can run the script starting with the new container number and ending with the end container. For example, if seven more containers are needed, use:
./multi-job-setup.sh 2 8
The log file, setup-job-1-log, is located under the job-<#>/job directory.
Note: This script will start H2O Driverless AI and the initial port number to connect to the web interface is 12345. To connect to the other containers using the web interface, the number 12345 is increased by 1 to N containers. In this case, where we need seven more containers, the IP address of the first container will be 12345 and the IP address of the last container will be 12352.
After the container is started, you can access H2O Driverless AI running in that container by using the following URL:
https://host-ip-address:<port assigned to the container>
For example, the H2O Driverless AI running in containers 1and 3 can be accessed using the following URLs:
Only one URL can be accessed at a time. This is for monitoring the progress of loading the model and the experiment, retrieving the results, and so on. Do not use the web interface to perform any updates.
Each of the job-NN directories creates the directory structure as shown in the following output:
data job license log scripts
Notice that data, job, license, log, scripts, and tmp are created.
- data: Contains a link to the location of the data set. Example: data -> /h2o/databackup
- job: Is an empty directory. This is used to store any output from running the scripts
- license: Contains the H2O Driverless AI license file. Example: license.sig
- log: Is initially empty. It contains the H20 Driverless AI logs.
- scripts: Contains the Python client scripts and run scripts to load the database run model in the container. The scripts directory contains the allyears subdirectory, which contains the Python scripts to load the data set to train and run the model (run-model.sh) and a configuration template (allyears-model-1.9.0-cpu.py).
- tmp: Stores the data and results generated during the training.
multi-job-setup.sh uses make.sh. to create a container.
After completing this step, run the
docker ps command to list the running containers. The output will look similar to this:
docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 22609cf9ec7a h2oai/dai-centos7-ppc64le:22.214.171.124-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12352->12345/tcp wizardly_pasteur 661b8f469e54 h2oai/dai-centos7-ppc64le:126.96.36.199-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12351->12345/tcp objective_mestorf e5d5774bc4e3 h2oai/dai-centos7-ppc64le:188.8.131.52-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12350->12345/tcp quizzical_brown bacd0b6ad231 h2oai/dai-centos7-ppc64le:184.108.40.206-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12348->12345/tcp stupefied_torvalds 376b56f1d246 h2oai/dai-centos7-ppc64le:220.127.116.11-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12349->12345/tcp wizardly_davinci 4e7c8147c42d h2oai/dai-centos7-ppc64le:18.104.22.168-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12347->12345/tcp frosty_bose f3eaf5bd386c h2oai/dai-centos7-ppc64le:22.214.171.124-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12345->12345/tcp hungry_shockley 73846e4b1a54 h2oai/dai-centos7-ppc64le:126.96.36.199-cuda10.0.8 "./run.sh" 4 minutes ago Up 4 minutes 8888/tcp, 0.0.0.0:12346->12345/tcp confident_agnesi
Step 4. Add a data set to the H2O Driverless AI database
Before performing this step make sure that the workloads.txt file has been modified.
To add a data set to the database, run the multi-run-db.sh script as follows:
./ multi-run-db.sh <beginning container> <end container>
For example, to add a data set to the first container, run the following command:
./ multi-run-db.sh 1 1
To add data sets to containers 2 through 8, run the following command:
./ multi-run-db.sh 2 8
Verify the log file for any errors. For example, to check the log file in container 1, navigate to the job-1/job directory and view the run-db-upload-log file.
Step 5. Submit and train the job
To train a job, use the multi-run-model.sh script as follows:
multi-run-model.sh <beginning container> <end container> <cpu|gpu>
For example, to train a single data set using cpu:
./multi-run-model.sh 1 1 cpu
Verify the log file for any errors. For example, to check the log file in container 1, navigate to the job-1/job directory and view the run-
The following screen captures show the web interface when a job is running:
Step 6. Stop and resume containers
The containers are initially created using the multi-job-setup.sh script. After the directory structure is created, databases loaded, and training model are run, the containers can be deleted and re-created. The job-NN working directories are preserved and reused when a new container NN is created. This is helpful because the database need not be reloaded into the container as the data had already been stored in the job-NN working directory from the previous runs. Also, because the working directory of the previous container is not destroyed, it is useful to view the log of previous experiments.
You can use scripts in the following format to stop and start the containers:
./mutli-job-stop.sh <beginning container> <ending container> ./multi-job-resume.sh <beginning container> <ending container>
This tutorial describes how to set up multiple Docker containers each running a copy of the H2O Driverless AI, load the database into each container, and train the model in all the containers to started simultaneously. This results in efficient utilization of system resources and significant reduction in the total completion time for all the submitted jobs. In lab experiments, it was observed that when 16 jobs were run simultaneously, the performance gains were in 8x range.