Drive higher GPU utilization and throughput

This article is part of the Learning path: Get started with Watson Machine Learning Accelerator series.

Introduction

GPUs are designed to run some of the most complex deep learning models, such as RESNET, NMT, Transformer, DeepSpeech, and NCF. Most enterprise models are trained or deployed to use only a fraction of the GPU compute and memory capacity. How can you reclaim this memory and compute headroom to get the most out of your GPU investment?

IBM Watson® Machine Learning Accelerator provides facilities to share GPU resources across multiple small jobs. This allows maximum return on investment for IT teams in enterprises where GPUs are in high demand. Additionally, you share a GPU across multiple jobs when your jobs are waiting for resources, or distributed jobs are running across GPUs that might be stacked on top of each other. Running multiple jobs in parallel without any resource conflicts allows the overall job throughput of the cluster to cross multiple tenants and users to have a multiplicative effect. GPU job-sharing improves throughput for training, inferencing, technical, and math-heavy quantitative workloads.

Learning objectives

In this tutorial, learn how to use the Watson Machine Learning Accelerator advanced scheduler to accelerate multiple deep learning training jobs by batching and running four jobs on a single GPU. We submit 16 running jobs to four GPUs, with each GPU distributing resources evenly to run four jobs in parallel. By default, Watson Machine Learning Accelerator assigns one execution slot to a GPU. We enable multiple execution slots per GPU so multiple jobs are dispatched by the scheduler to the GPU.

Estimated time

It should take you approximately two hours to complete this tutorial, including approximately 30 minutes of model training, installation, and configuration, as well as driving the model through the GUI.

Prerequisites

  1. Download the Watson Machine Learning Accelerator evaluation software. This is a 4.9 GB download and requires an IBM ID.
  2. Install and configure Watson Machine Learning Accelerator by using the instructions listed in the IBM Knowledge Center.
  3. Configure the operating system user. At the operating system level, as root on all nodes, create a group and user for the operating system execution user.

     groupadd egoadmin
     useradd -g egoadmin -m egoadmin
    
  4. Create the GPU Resource Group ComputeHostGPT_multijob. Note that Advanced Formula is set to ngpus*4. This allows multiple jobs (four in this example) to share the same GPU resource. This way, if you have many lightweight workloads that can be accommodated into GPU memory, you can stack them and run them in parallel. Resource group dashboard

  5. Create the Spark Instance Group dli-multi. Click Create to complete the creation of the resource group. Creating Spark Instance Group

    a. Click New. Spark Instance groups for all consumers

    b. Select Templates. New Spark Instance Group dashboard

    c. Select dli-sig-template-2-2-0. Template dashboard

    d. Enter the following three values:

    • Instance group: dli-multi
    • Spark deployment directory: /home/egoadmin/dli-multi
    • Execution user for instance group: egoadmin

      e. Select ComputeHostGPT_multijob in the Spark executors (GPU slots).

  6. Deploy the Spark Instance Group by clicking Create and Deploy Instance group > Continue to Instance Group, and watch as your instance group deploys.

Launch the jobs

To test the Watson Machine Learning Accelerator, you must first train a model or use an existing one and connect it to a data set to solve a specific problem. After you configure the environment, submit multiple training jobs and run the execution script. Then, you can log on to the Watson Machine Learning Accelerator GUI to monitor the jobs running across the GPUs and how the GPU resources are distributed evenly for the runs.

  1. Place the data set under $DLI_DATA_FS. In this case, we set DLI_DATA_FS: /dlidata/, and we store the creditcard.csv file under /dlidata/dataset/multijob/creditcard.csv.

  2. Download the model fc_model.py file.

    a. Place the model under $DL_NFS_PATH. In this case, we set $DL_NFS_PATH = /dlishared, and we store the fc_model.py file under /dlishared/autotest/examples/multijob/fc_model.

    b. By default, TensorFlow pre-allocates the entire memory of the GPU card. We use the per_process_gpu_memory_fraction configuration option. A value between 0 – 1 indicates what fraction of the available GPU memory to pre-allocate for each process. 1 indicates pre-allocation of all of the GPU memory. In our case, we set the value to 0.2, which means that the process allocates approximately $20%$ of the available GPU memory.

     config.gpu_options.per_process_gpu_memory_fraction = 0.2
     session = tf.Session(config=config)
     K.set_session(session)
    

    c. Update the model with the path of the data set.

     df = pd.read_csv('/dlidata/dataset/multijob/creditcard.csv')
    
  3. Create an environment file called dlicmd-env.txt with contents similar to the following information.

     export PATH=/opt/anaconda3/bin:$PATH (WML-CE installed path)
     export EGO_TOP=${EGO_TOP}
     export DLPD_HOME=${EGO_TOP}/dli/1.2.3/dlpd
     export dlicmd=$DLPD_HOME/bin/dlicmd.py
     export masterHost=$Master_Host
     export username=$username
     export password=$password
     export ig= dli-multi
     export dlirestport=9280
     export BYOF_model_top=/dlishared/autotest/examples/multijob/fc_model
    
  4. Source the environment: source ./dlicmd_env.txt.

  5. Log on to master-host using dlicmd.

     python $dlicmd --logon --master-host $masterHost --dli-rest-port $dlirestport --username $username --password $password
    
  6. Submit 16 training jobs.

    a. Create an execution_script.sh script with following content.

     ```
     for ((i=1; i<=16; i++))
     do
       python $dlicmd --exec-start tensorflow --master-host $masterHost --dli-rest-port $dlirestport --ig $ig --model-dir  ${BYOF_model_top} --model-main fc_model.py --cs-datastore-meta type=fs --debug-level=debug
       echo $i
     done
     ```
    

    b. Run the script.

     ```
     [root@colonia04 prashant]# ./execution_script.sh
     Copying files and directories ...
     Exec id Admin-304068407245025-498756644 created
     1
     Copying files and directories ...
     Exec id Admin-304070357812332-369823390 created
     2
     Copying files and directories ...
     Exec id Admin-304071905599067-1955804738 created
     3
     Copying files and directories ...
     Exec id Admin-304073609133798-402707217 created
     4
     Copying files and directories ...
     Exec id Admin-304075411972223-1414479207 created
     5
     Copying files and directories ...
     Exec id Admin-304077274965883-1968820453 created
     6
     Copying files and directories ...
     Exec id Admin-304078953068401-1159214027 created
     7
     Copying files and directories ...
     Exec id Admin-304080731467734-1788381765 created
     8
     Copying files and directories ...
     Exec id Admin-304082727464962-730766126 created
     9
     Copying files and directories ...
     Exec id Admin-304084633178436-1947332630 created
     10
     Copying files and directories ...
     Exec id Admin-304086929287836-1777346414 created
     11
     Copying files and directories ...
     Exec id Admin-304089188990888-1423786338 created
     12
     Copying files and directories ...
     Exec id Admin-304092951501272-1827537637 created
     13
     Copying files and directories ...
     Exec id Admin-304095523336369-5713970 created
     14
     Copying files and directories ...
     Exec id Admin-304097780378443-1739096834 created
     15
     Copying files and directories ...
     Exec id Admin-304099849311454-674376479 created
     16
     ```
    
  7. Log in to Watson Machine Learning Accelerator, and monitor the 16 multiple TensorFlow jobs running in parallel.

    First 8 jobs running Second 8 jobs running

  8. Run NVIDIA-SMI to monitor the 16 jobs running across four GPUs.

     [root@colonia04 prashant]# nvidia-smi -l
     Thu Jan  9 22:31:21 2020
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |===============================+======================+======================|
     |   0  Tesla P100-SXM2... On   | 00000002:01:00.0 Off |                    0 |
     | N/A   37C    P0    57W / 300W |  14494MiB / 16280MiB |     90%      Default |
     +-------------------------------+----------------------+----------------------+
     |   1  Tesla P100-SXM2... On   | 00000003:01:00.0 Off |                    0 |
     | N/A   34C    P0    56W / 300W |  14494MiB / 16280MiB |     88%      Default |
     +-------------------------------+----------------------+----------------------+
     |   2  Tesla P100-SXM2... On   | 00000006:01:00.0 Off |                    0 |
     | N/A   37C    P0    53W / 300W |  14494MiB / 16280MiB |     82%      Default |
     +-------------------------------+----------------------+----------------------+
     |   3  Tesla P100-SXM2... On   | 00000007:01:00.0 Off |                    0 |
     | N/A   35C    P0    53W / 300W |  14494MiB / 16280MiB |     85%      Default |
     +-------------------------------+----------------------+----------------------+
    
     +-----------------------------------------------------------------------------+
     | Processes:                                                       GPU Memory |
     |  GPU       PID   Type   Process name                             Usage      |
     |=============================================================================|
     |    0     23665      C   python                                      3621MiB |
     |    0     29378      C   python                                      3621MiB |
     |    0     37582      C   python                                      3621MiB |
     |    0     39954      C   python                                      3621MiB |
     |    1     21378      C   python                                      3621MiB |
     |    1     29020      C   python                                      3621MiB |
     |    1     36054      C   python                                      3621MiB |
     |    1     38589      C   python                                      3621MiB |
     |    2     18656      C   python                                      3621MiB |
     |    2     27194      C   python                                      3621MiB |
     |    2     33068      C   python                                      3621MiB |
     |    2     38638      C   python                                      3621MiB |
     |    3     15624      C   python                                      3621MiB |
     |    3     25161      C   python                                      3621MiB |
     |    3     29849      C   python                                      3621MiB |
     |    3     38154      C   python                                      3621MiB |
     +-----------------------------------------------------------------------------+
    

Conclusion

Enterprises are seeing a large demand for GPUs from data scientists and programmers looking to accelerate their compute-heavy training of inferencing, technical, and quantitative workloads. Most enterprise jobs use a fraction of the available GPU compute and memory capacity. By default, several deep learning frameworks disallow additional jobs to be shared by the GPU. This tutorial described how the unused GPU compute and memory can be used through additional Watson Machine Learning Accelerator slot-based scheduling on the GPU. This improves job throughput and provides productivity benefits to data scientists and enterprise IT professionals.

You can find more tutorials in the Watson Machine Learning Accelerator series.