Now available! Red Hat OpenShift Container Platform for Linux on IBM Z and LinuxONE Learn more

Drive higher GPU utilization and throughput with Watson Machine Learning Accelerator

GPUs are designed and sized to run some of the most complex deep learning models such as RESNET, NMT, Transformer, DeepSpeech, and NCF. Most enterprise models being trained or deployed use only a fraction of the GPU compute and memory capacity. So, how do you reclaim this memory and compute headroom so that you can get the most out of your GPU investment? Watson Machine Learning Accelerator provides facilities to share GPU resources across multiple small jobs. This allows maximal return-on-investment for IT teams in enterprises where GPUs are in high demand. Additionally, you benefit from sharing a GPU across multiple jobs when your jobs are waiting for GPU resources or your distributed jobs running across GPUs might be stacked on top of each other on as few GPUs as possible to reduce the execution footprint. Running multiple jobs in parallel without any resource conflicts allows the overall job throughput of the cluster across multiple tenants or users to have a multiplicative effect. GPU sharing of jobs improves throughput for training, inferencing, technical, and matrix math-heavy quantitative workloads.

Learning objectives

In this tutorial, we explain how to use the Watson Machine Learning Accelerator advanced scheduler to accelerate multiple deep learning training jobs by batching and running 4 jobs in a single GPU. We submit 16 running jobs to 4 GPUs, and each GPU runs 4 jobs in parallel. By default, Watson Machine Learning Accelerator assigns 1 execution slot to a GPU. We enable multiple execution slots per GPU so that multiple jobs are dispatched by the scheduler to the GPU.

Estimated time

It should take you approximately 2 hours to complete this tutorial, including approximately 30 minutes of model training, installation, and configuration as well as driving the model through the GUI.

Installation and configuration

Download, install, and configure Watson Machine Learning Accelerator

  1. Download the Watson Machine Learning Accelerator evaluation software from the IBM software repository. This is a 4.9 GB download and requires an IBM ID.
  2. Install and configure Watson Machine Learning Accelerator by using the instructions listed in the IBM Knowledge Center.

  3. Configure the operating system user.

    1. At the operating system level, as root on all nodes, create an operating system group and user for the operating system execution user:

      groupadd egoadmin
      useradd -g egoadmin -m egoadmin
  4. Create the GPU Resource Group ComputeHostGPT_multijob. Note that Advanced Formula is set to ngpus*4. This allows multiple jobs (4 in this example) to share the same GPU resource. This way, if you have many lightweight workloads that can be fit into GPU memory, you can stack them together and run them in parallel.

    Resource group dashboard

  5. Create the Spark Instance Group dli-multi. Click Create to complete the creation of the resource group.

    Creating Spark Instance Group

    1. Click New.

      Spark Instance groups for all consumers

    2. Select Templates.

      New Spark Instance Group dashboard

    3. Select dli-sig-template-2-2-0.

      Template dashboard

    4. Enter the following three values:

      • Instance group: dli-multi
      • Spark deployment directory: /home/egoadmin/dli-multi
      • Execution user for instance group: egoadmin
    5. Select ComputeHostGPT_multijob in Spark executors (GPU slots).

  6. Deploy the Spark Instance Group.

    1. Click Create and Deploy Instance group.
    2. Click Continue to Instance Group.
    3. Watch as your instance group gets deployed.

Launch 16 jobs to 4 GPUs with bring-your-own-framework features

  1. Download the Credit Card Fraud data set from Kaggle.

    1. Place the data set under $DLI_DATA_FS. In our case, we set DLI_DATA_FS: /dlidata/, and we store the creditcard.csv file under /dlidata/dataset/multijob/creditcard.csv.
  2. Download the model file.

    1. Place the model under $DL_NFS_PATH. In our case, we set $DL_NFS_PATH = /dlishared, and we store the file under /dlishared/autotest/examples/multijob/fc_model.

    2. By default, TensorFlow pre-allocates the entire memory of the GPU card. We use the config option per_process_gpu_memory_fraction. A value between 0 – 1 indicates what fraction of the available GPU memory to pre-allocate for each process. 1 indicates to pre-allocate all of the GPU memory. In our case, we set the value to 0.2, which means that the process allocates approximately 20% of the available GPU memory.

      config.gpu_options.per_process_gpu_memory_fraction = 0.2
      session = tf.Session(config=config)
    3. Update the model with the path of data set.

      df = pd.read_csv('/dlidata/dataset/multijob/creditcard.csv')
  3. Create an environment file called dlicmd-env.txt with contents similar to the following information.

     export PATH=/opt/anaconda3/bin:$PATH (WML-CE installed path)
     export EGO_TOP=${EGO_TOP}
     export DLPD_HOME=${EGO_TOP}/dli/1.2.3/dlpd
     export dlicmd=$DLPD_HOME/bin/
     export masterHost=$Master_Host
     export username=$username
     export password=$password
     export ig= dli-multi
     export dlirestport=9280
     export BYOF_model_top=/dlishared/autotest/examples/multijob/fc_model
  4. Source the environment: source ./dlicmd_env.txt.

  5. Log on to master-host using dlicmd.

     python $dlicmd --logon --master-host $masterHost --dli-rest-port $dlirestport --username $username --password $password
  6. Submit 16 training jobs.

    1. Create an script with following content.

      for ((i=1; i<=16; i++))
      python $dlicmd --exec-start tensorflow --master-host $masterHost --dli-rest-port $dlirestport --ig $ig --model-dir  ${BYOF_model_top} --model-main --cs-datastore-meta type=fs --debug-level=debug
      echo $i
    2. Run the script.

      [root@colonia04 prashant]# ./
      Copying files and directories ...
      Exec id Admin-304068407245025-498756644 created
      Copying files and directories ...
      Exec id Admin-304070357812332-369823390 created
      Copying files and directories ...
      Exec id Admin-304071905599067-1955804738 created
      Copying files and directories ...
      Exec id Admin-304073609133798-402707217 created
      Copying files and directories ...
      Exec id Admin-304075411972223-1414479207 created
      Copying files and directories ...
      Exec id Admin-304077274965883-1968820453 created
      Copying files and directories ...
      Exec id Admin-304078953068401-1159214027 created
      Copying files and directories ...
      Exec id Admin-304080731467734-1788381765 created
      Copying files and directories ...
      Exec id Admin-304082727464962-730766126 created
      Copying files and directories ...
      Exec id Admin-304084633178436-1947332630 created
      Copying files and directories ...
      Exec id Admin-304086929287836-1777346414 created
      Copying files and directories ...
      Exec id Admin-304089188990888-1423786338 created
      Copying files and directories ...
      Exec id Admin-304092951501272-1827537637 created
      Copying files and directories ...
      Exec id Admin-304095523336369-5713970 created
      Copying files and directories ...
      Exec id Admin-304097780378443-1739096834 created
      Copying files and directories ...
      Exec id Admin-304099849311454-674376479 created
  7. Log in to Watson Machine Learning Accelerator and monitor the 16 multiple TF jobs running in parallel.

    First 8 jobs running Second 8 jobs running

  8. Run NVIDIA-SMI to monitor the 16 jobs running across 4 GPUs.

     [root@colonia04 prashant]# nvidia-smi -l
     Thu Jan  9 22:31:21 2020
     | NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |   0  Tesla P100-SXM2... On   | 00000002:01:00.0 Off |                    0 |
     | N/A   37C    P0    57W / 300W |  14494MiB / 16280MiB |     90%      Default |
     |   1  Tesla P100-SXM2... On   | 00000003:01:00.0 Off |                    0 |
     | N/A   34C    P0    56W / 300W |  14494MiB / 16280MiB |     88%      Default |
     |   2  Tesla P100-SXM2... On   | 00000006:01:00.0 Off |                    0 |
     | N/A   37C    P0    53W / 300W |  14494MiB / 16280MiB |     82%      Default |
     |   3  Tesla P100-SXM2... On   | 00000007:01:00.0 Off |                    0 |
     | N/A   35C    P0    53W / 300W |  14494MiB / 16280MiB |     85%      Default |
     | Processes:                                                       GPU Memory |
     |  GPU       PID   Type   Process name                             Usage      |
     |    0     23665      C   python                                      3621MiB |
     |    0     29378      C   python                                      3621MiB |
     |    0     37582      C   python                                      3621MiB |
     |    0     39954      C   python                                      3621MiB |
     |    1     21378      C   python                                      3621MiB |
     |    1     29020      C   python                                      3621MiB |
     |    1     36054      C   python                                      3621MiB |
     |    1     38589      C   python                                      3621MiB |
     |    2     18656      C   python                                      3621MiB |
     |    2     27194      C   python                                      3621MiB |
     |    2     33068      C   python                                      3621MiB |
     |    2     38638      C   python                                      3621MiB |
     |    3     15624      C   python                                      3621MiB |
     |    3     25161      C   python                                      3621MiB |
     |    3     29849      C   python                                      3621MiB |
     |    3     38154      C   python                                      3621MiB |


Enterprises are seeing a large demand for GPUs from data scientists and programmers looking to accelerate their compute-heavy training, inferencing, technical, and quantitative workloads. Most enterprise jobs use a fraction of the available GPU compute and memory capacity. By default, several deep learning frameworks disallow additional jobs to be shared by the GPU. This tutorial described how the unused GPU compute and memory can be leveraged through additional Watson Machine Learning Accelerator slot-based scheduling on the GPU. This improves job throughput and provides productivity benefits to data scientists and enterprise IT professionals.

You can find more tutorials in the Watson Machine Learning Accelerator series.

Prashantha Subbarao
Raj Krishnamurthy
Kelvin Lui