This article is part of the Learning path: Get started with Watson Machine Learning Accelerator series.
Topic | Type |
---|---|
An introduction to Watson Machine Learning Accelerator | Article |
Accelerate your deep learning and machine learning | Article + notebook |
Elastic Distributed Training in Watson Machine Learning Accelerator | Article + notebook |
Expedite retail price prediction with Watson Machine Learning Accelerator hyperparameter optimization | Tutorial |
Drive higher GPU utilization and throughput | Tutorial |
Classify images with Watson Machine Learning Accelerator (Optional) | Article + notebook |
Introduction
GPUs are designed to run some of the most complex deep learning models, such as RESNET, NMT, Transformer, DeepSpeech, and NCF. Most enterprise models are trained or deployed to use only a fraction of the GPU compute and memory capacity. How can you reclaim this memory and compute headroom to get the most out of your GPU investment?
IBM Watson® Machine Learning Accelerator provides facilities to share GPU resources across multiple small jobs. This allows maximum return on investment for IT teams in enterprises where GPUs are in high demand. Additionally, you share a GPU across multiple jobs when your jobs are waiting for resources, or distributed jobs are running across GPUs that might be stacked on top of each other. Running multiple jobs in parallel without any resource conflicts allows the overall job throughput of the cluster to cross multiple tenants and users to have a multiplicative effect. GPU job-sharing improves throughput for training, inferencing, technical, and math-heavy quantitative workloads.
Learning objectives
In this tutorial, learn how to use the Watson Machine Learning Accelerator advanced scheduler to accelerate multiple deep learning training jobs by batching and running four jobs on a single GPU. We submit 16 running jobs to four GPUs, with each GPU distributing resources evenly to run four jobs in parallel. By default, Watson Machine Learning Accelerator assigns one execution slot to a GPU. We enable multiple execution slots per GPU so multiple jobs are dispatched by the scheduler to the GPU.
Estimated time
It should take you approximately two hours to complete this tutorial, including approximately 30 minutes of model training, installation, and configuration, as well as driving the model through the GUI.
Prerequisites
- Download the Watson Machine Learning Accelerator evaluation software. This is a 4.9 GB download and requires an IBM ID.
- Install and configure Watson Machine Learning Accelerator by using the instructions listed in the IBM Knowledge Center.
Configure the operating system user. At the operating system level, as root on all nodes, create a group and user for the operating system execution user.
groupadd egoadmin useradd -g egoadmin -m egoadmin
Create the GPU Resource Group
ComputeHostGPT_multijob
. Note that Advanced Formula is set tongpus*4
. This allows multiple jobs (four in this example) to share the same GPU resource. This way, if you have many lightweight workloads that can be accommodated into GPU memory, you can stack them and run them in parallel.Create the Spark Instance Group
dli-multi
. Click Create to complete the creation of the resource group.a. Click New.
b. Select Templates.
c. Select dli-sig-template-2-2-0.
d. Enter the following three values:
- Instance group:
dli-multi
- Spark deployment directory:
/home/egoadmin/dli-multi
Execution user for instance group:
egoadmin
e. Select ComputeHostGPT_multijob in the Spark executors (GPU slots).
- Instance group:
Deploy the Spark Instance Group by clicking Create and Deploy Instance group > Continue to Instance Group, and watch as your instance group deploys.
Launch the jobs
To test the Watson Machine Learning Accelerator, you must first train a model or use an existing one and connect it to a data set to solve a specific problem. After you configure the environment, submit multiple training jobs and run the execution script. Then, you can log on to the Watson Machine Learning Accelerator GUI to monitor the jobs running across the GPUs and how the GPU resources are distributed evenly for the runs.
Place the data set under
$DLI_DATA_FS
. In this case, we setDLI_DATA_FS: /dlidata/
, and we store thecreditcard.csv
file under/dlidata/dataset/multijob/creditcard.csv
.Download the model fc_model.py file.
a. Place the model under
$DL_NFS_PATH
. In this case, we set$DL_NFS_PATH = /dlishared
, and we store thefc_model.py
file under/dlishared/autotest/examples/multijob/fc_model
.b. By default, TensorFlow pre-allocates the entire memory of the GPU card. We use the
per_process_gpu_memory_fraction
configuration option. A value between 0 – 1 indicates what fraction of the available GPU memory to pre-allocate for each process. 1 indicates pre-allocation of all of the GPU memory. In our case, we set the value to 0.2, which means that the process allocates approximately $20%$ of the available GPU memory.config.gpu_options.per_process_gpu_memory_fraction = 0.2 session = tf.Session(config=config) K.set_session(session)
c. Update the model with the path of the data set.
df = pd.read_csv('/dlidata/dataset/multijob/creditcard.csv')
Create an environment file called
dlicmd-env.txt
with contents similar to the following information.export PATH=/opt/anaconda3/bin:$PATH (WML-CE installed path) export EGO_TOP=${EGO_TOP} export DLPD_HOME=${EGO_TOP}/dli/1.2.3/dlpd export dlicmd=$DLPD_HOME/bin/dlicmd.py export masterHost=$Master_Host export username=$username export password=$password export ig= dli-multi export dlirestport=9280 export BYOF_model_top=/dlishared/autotest/examples/multijob/fc_model
Source the environment:
source ./dlicmd_env.txt
.Log on to master-host using
dlicmd
.python $dlicmd --logon --master-host $masterHost --dli-rest-port $dlirestport --username $username --password $password
Submit 16 training jobs.
a. Create an
execution_script.sh
script with following content.``` for ((i=1; i<=16; i++)) do python $dlicmd --exec-start tensorflow --master-host $masterHost --dli-rest-port $dlirestport --ig $ig --model-dir ${BYOF_model_top} --model-main fc_model.py --cs-datastore-meta type=fs --debug-level=debug echo $i done ```
b. Run the script.
``` [root@colonia04 prashant]# ./execution_script.sh Copying files and directories ... Exec id Admin-304068407245025-498756644 created 1 Copying files and directories ... Exec id Admin-304070357812332-369823390 created 2 Copying files and directories ... Exec id Admin-304071905599067-1955804738 created 3 Copying files and directories ... Exec id Admin-304073609133798-402707217 created 4 Copying files and directories ... Exec id Admin-304075411972223-1414479207 created 5 Copying files and directories ... Exec id Admin-304077274965883-1968820453 created 6 Copying files and directories ... Exec id Admin-304078953068401-1159214027 created 7 Copying files and directories ... Exec id Admin-304080731467734-1788381765 created 8 Copying files and directories ... Exec id Admin-304082727464962-730766126 created 9 Copying files and directories ... Exec id Admin-304084633178436-1947332630 created 10 Copying files and directories ... Exec id Admin-304086929287836-1777346414 created 11 Copying files and directories ... Exec id Admin-304089188990888-1423786338 created 12 Copying files and directories ... Exec id Admin-304092951501272-1827537637 created 13 Copying files and directories ... Exec id Admin-304095523336369-5713970 created 14 Copying files and directories ... Exec id Admin-304097780378443-1739096834 created 15 Copying files and directories ... Exec id Admin-304099849311454-674376479 created 16 ```
Log in to Watson Machine Learning Accelerator, and monitor the 16 multiple TensorFlow jobs running in parallel.
Run NVIDIA-SMI to monitor the 16 jobs running across four GPUs.
[root@colonia04 prashant]# nvidia-smi -l Thu Jan 9 22:31:21 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-SXM2... On | 00000002:01:00.0 Off | 0 | | N/A 37C P0 57W / 300W | 14494MiB / 16280MiB | 90% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-SXM2... On | 00000003:01:00.0 Off | 0 | | N/A 34C P0 56W / 300W | 14494MiB / 16280MiB | 88% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P100-SXM2... On | 00000006:01:00.0 Off | 0 | | N/A 37C P0 53W / 300W | 14494MiB / 16280MiB | 82% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P100-SXM2... On | 00000007:01:00.0 Off | 0 | | N/A 35C P0 53W / 300W | 14494MiB / 16280MiB | 85% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 23665 C python 3621MiB | | 0 29378 C python 3621MiB | | 0 37582 C python 3621MiB | | 0 39954 C python 3621MiB | | 1 21378 C python 3621MiB | | 1 29020 C python 3621MiB | | 1 36054 C python 3621MiB | | 1 38589 C python 3621MiB | | 2 18656 C python 3621MiB | | 2 27194 C python 3621MiB | | 2 33068 C python 3621MiB | | 2 38638 C python 3621MiB | | 3 15624 C python 3621MiB | | 3 25161 C python 3621MiB | | 3 29849 C python 3621MiB | | 3 38154 C python 3621MiB | +-----------------------------------------------------------------------------+
Conclusion
Enterprises are seeing a large demand for GPUs from data scientists and programmers looking to accelerate their compute-heavy training of inferencing, technical, and quantitative workloads. Most enterprise jobs use a fraction of the available GPU compute and memory capacity. By default, several deep learning frameworks disallow additional jobs to be shared by the GPU. This tutorial described how the unused GPU compute and memory can be used through additional Watson Machine Learning Accelerator slot-based scheduling on the GPU. This improves job throughput and provides productivity benefits to data scientists and enterprise IT professionals.
You can find more tutorials in the Watson Machine Learning Accelerator series.