Get dynamic, elastic, and fine-grained resource allocations and controls for accelerating multiple model trainings simultaneously
Use the Watson Machine Learning Accelerator Elastic Distributed Training feature to distribute model training across multiple GPUs and compute nodes
IBM Watson™ Machine Learning Accelerator is a software solution for quickly and easily building an end-to-end deep learning environment for your organization. Formerly known as IBM PowerAI Enterprise, it bundles IBM Spectrum Conductor, IBM Spectrum Conductor Deep Learning Impact, and support from IBM for the whole stack, including the open source deep learning frameworks. Watson Machine Learning Accelerator provides an end-to-end deep learning platform for data scientists. This includes the complete lifecycle management from installation and configuration; data ingest and preparation; building, optimizing, and distributing the training model to moving the model into production. Watson Machine Learning Accelerator truly shines when you expand your deep learning environment to include multiple compute nodes. There’s even a free evaluation available; see the Prerequisites from our first tutorial: “Classify images with Watson Machine Learning Accelerator.”
In this tutorial, you will run multiple model trainings simultaneously with the Watson Machine Learning Accelerator Elastic Distributed Training feature (EDT). With EDT, we can distribute model training across multiple GPUs and compute nodes. The distribution of training jobs is elastic, which means GPUs are dynamically allocated: GPUs can be added or removed while executing and without having to kill the job. Since the scheduler dynamically allocates the GPUs, you don’t need to code the GPU topology into the model (for example, GPU01 out of node 1 and GPU01 out of node 2). You can take a model built on a stand-alone system, EDT does the distribution, and it’s transparent to the end user.
After completing this tutorial, you will:
- Be able to build a single node model to run distributed training with the EDT feature
- Show how GPUs are dynamically allocated across two parallel trainings by adjusting the priority
The tutorial takes approximately three hours, which includes about 30 minutes of model training, plus installation and configuration, as well as driving the model through the GUI.
The tutorial requires access to a GPU-accelerated IBM Power Systems server model AC922 or S822LC. In addition to acquiring a server, there are multiple options to access Power Systems servers listed on the developer portal.
Step 1. Download, install, and configure the Watson Machine Learning Accelerator
- Download the IBM Watson Machine Learning Accelerator evaluation software from the IBM software repository. This is a 4.9 GB download and requires an IBM ID.
- Install and configure Watson Machine Learning Accelerator using the instructions listed in the IBM Knowledge Center or the POWER-Up User Guide.
Step 2. Configure operating system user
At the operating system level, as root, on all nodes, create an operating system group and user for the operating system execution user:
useradd -g egoadmin -m egoadmin
The GID and UID of the created user and group must be the same on all nodes.
Step 3. Configure the resource groups
- Log on as the cluster Admin user.
Open the Resource Group configuration.
Select the ComputeHosts resource group.
Properly configure the number of slots to a value.
Optionally, but recommended; change the resource selection method to static, and select only the servers that will provide computing power (processor power) to the cluster.
Click Apply to commit the changes.
Create a new resource group.
Name the resource group ComputeHostsGPU.
The number of slots should use the advanced formula equal to the number of GPUs on the systems by using
ngpus. Optionally, but recommended: change the resource selection method to static and select the nodes that are GPU-capable.
Under the Members Host column, click Preferences and select the attribute ngpus to be displayed.
Click Apply and validate that the Members Host column now displays ngpus.
Finish the creation of the resource group by clicking Create.
Step 4. Create Spark Instance group
Finish the creation of the resource group by clicking Create.
Enter the following three values:
- Instance group: sig-elastic
- Spark deployment directory: /home/egoadmin/sig-elastic
Execution user for instance group: egoadmin
Click Configuration, modify Spark parameters, and set fairshare as the value for the SPARK_EGO_APP_SCHEDULE_POLICY variable.
Set true as the value for the SPARK_EGO_ENABLE_PREEMPTION variable.
Scroll down and select the ComputeHostsGPU resource group for the Spark executors (GPU slots) you created in previous screens. Do not change any other configuration there.
- Click the Create and Deploy Instance group.
- Click Continue to Instance Group.
- Watch as your instance group gets deployed.
Step 5. Download the instrumented PyTorch mnist model from Model Zoo
Download all of the files in dli-1.2.3-pytorch-samples.
Optional for an advanced user: Update your neural network model to run EDT
Import and load the Elastic Distributed Training Model Library fabric.zip:
path=os.path.join(os.getenv("FABRIC_HOME"), "libs", "fabric.zip") print('> fabric loaded from %s'%path) sys.path.insert(0,path) from fabric_model import FabricModel
Load user input (entered at Watson Machine Learning Accelerator UI) into model definition, including location of data set, optimizer and epoch:
import pth_parameter_mgr global train_data_dir, test_data_dir train_data_dir = pth_parameter_mgr.getTrainData(False) test_data_dir = pth_parameter_mgr.getTestData(False) optimizer = pth_parameter_mgr.getOptimizer(model) epochs = pth_parameter_mgr.getEpoch() train_batch_size = pth_parameter_mgr.getTrainBatchSize()
Instantiate the Elastic Distributed Training Model instance, along with the following parameters:
edt_m = FabricModel(model, getDatasets, F.cross_entropy, optimizer) # model = model definition to be trained # getDatasets = function name defined in model definition to retrieve dataset at each learner node # F.cross_entropy = loss function
Start Elastic Distribute Training, along with the following parameters retrieved in step 2:
edt_m.train(epochs, train_batch_size , engines_number) # engines_number = maximum number of GPUs requested by this training
Step 6. Download the data sets
For this tutorial, we’re going to use a cifar-10-batches-py; you can download cifar10 Python version from https://www.cs.toronto.edu/~kriz/cifar.html.
Create two directories, train_db and test_db, and copy cifar-10-batches-py folder into each of these directories:
[root@node1 cifar10]# ls -lrt total 8 total 8 drwxr-xr-x. 3 root root 4096 Apr 26 08:33 train_db drwxr-xr-x. 3 root root 4096 Apr 26 08:59 test_db [root@node1 train_db]# ls -lrt total 4 drwxr-xr-x. 2 root root 4096 Apr 26 08:57 cifar-10-batches-py [root@node1 test_db]# ls -lrt total 4 drwxr-xr-x. 2 root root 4096 Apr 26 08:59 cifar-10-batches-py
[root@node1 cifar-10-batches-py]# ls -lrt total 181868 -rwxrwxrwx. 1 root root 31035623 Apr 1 10:20 data_batch_5 -rwxrwxrwx. 1 root root 31035526 Apr 1 10:20 test_batch -rwxrwxrwx. 1 root root 31035704 Apr 1 10:20 data_batch_1 -rwxrwxrwx. 1 root root 31035696 Apr 1 10:20 data_batch_4 -rwxrwxrwx. 1 root root 31035999 Apr 1 10:20 data_batch_3 -rwxrwxrwx. 1 root root 31035320 Apr 1 10:20 data_batch_2
Step 7. Load data into Watson Machine Learning Accelerator
Associate the images with Watson Machine Learning Accelerator by creating a new data set.
In the Datasets tab, select New.
Click Any. When presented with a dialog box, provide a unique name (for example, ‘cifar10EDTpytorch’) and select the folder that contains the binary files obtained in the previous step. When you’re ready, click Create.
Step 8. Build the model
Select the Models tab and click New.
Select Add Location.
Select PyTorch as the Framework and provide a unique name (for example, PytorchResnet), and select the folder that contains the ResNet model obtained in previous steps. When you’re ready, click Add.
Select PytorchResnet for your new model, and click Next.
Ensure that the Training engine is set to Elastic Distributed Training and that the data set points to the one you just created, then click Add..
Step 9: Run training
Submit Training No. 1:
Back at the Models tab, select Train to view the models you can train, then select the model you created in the previous step.
Submit the first training with eight max workers.
This cluster has two POWER9™ servers with a total of eight GPUs available in the resource grid. The scheduler takes all eight GPUs available from the resource grid and assigns to Training #1.
Submit training No. 2:
Submit the second training with eight max workers.
Fairshare Policy enables equal sharing of available resources among running training with equal priority. With the Fairshare Policy, the scheduler pre-empties and reclaims four GPUs from training No. 1. The scheduler will wait for the current iteration of training to complete, including uploading gradient sync across workers. Afterward, the scheduler reclaims four GPUs from Training No. 1 and allocates them to Training No. 2.
Each training gets its deserved portion of resources according to priority. Training priority could be adjusted to reallocate GPU resources on the fly, without any interruption of running training and losing the latest progress. We will adjust Training No. 2 priority from 5,000 (default value) to 10,000.
With the priority update, the scheduler reassigns GPUs based on priority ratio. Training No. 2 GPU allocation becomes five (10000/15000 -> 2/3) out of a total of eight GPUs, while Training No. 1 GPU allocation becomes three (5000/15000 -> 1/3). The scheduler pre-empties and reclaims one GPU from Training No. 1 and assigns to Training No. 2.
Training No. 1 eventually completes at 6:04 and returns three GPUs back to the resource grid. The scheduler always looks for available GPUs from the resource grid and will assign these three GPUs to Training No. 2, dynamically scaling up to eight GPUs and speeding up overall iteration running time.
Following is the resource usage chart for Training No. 1 and Training No. 2.
IBM Watson Machine Learning Accelerator Elastic Distributed Training technology provides dynamic and fine-grained resource allocations and controls for trainings run simultaneously. The scheduler is responsible for distributing the training tasks (iterations), coordinating synchronization among tasks, and dynamically allocating resources (GPUs) based on policies defined by the end user. The scheduler kicks off a training task with a minimal resource request (1 GPU) and constantly looks for available GPUs from the resource grid. Elastic Distributed Training is fault-tolerant and resilient. When a host fails or the environment changes (e.g., loss of a cloud host) the scheduler re-queues and reruns the training job tasks on another host using the latest parameters and weights. This key feature accelerates execution time, as well as driving data scientist productivity.