Large Model Support (LMS) technology enables training of large deep neural networks that would exhaust GPU memory while training.
PyTorch is a relatively new and popular Python-based open source deep learning framework built by Facebook for faster prototyping and production deployment. With its more pythonic nature, and less steeper learning curve compared to other frameworks, it has been greeted with faster adoption. One of the primary features of PyTorch is its support for dynamic computation graphs. Dynamic graphs turn out to be valuable in situations where the amount of computation is not known ahead. So, with the growing popularity of PyTorch and with current neural networks being large enough, unable to fit in the GPU, this makes a case for a technology to support large models in PyTorch and run with limited GPU memory.
PyTorch with IBM® Watson™ Machine Learning Community Edition (WML CE) 1.6.1 comes with LMS to enable large PyTorch models and in this article, we capture the benefits of using PyTorch LMS on DeepLabv3+  along with the PASCAL Visual Object Classes (VOC) 2012 data set .
DeepLabv3+ and PASCAL data set
DeepLabv3+ is a state-of-art deep learning model for semantic image segmentation, where the goal is to assign semantic labels (such as, a person, a dog, a cat and so on) to every pixel in the input image. Open sourced by Google back in 2016, multiple improvements have been made to the model with the latest being DeepLabv3+ . The DeepLabv3+ model has an encoding phase and a decoding phase. The encoding phase extracts the essential information from the image using a convolutional neural network (CNN) whereas the decoding phase reconstructs the output of appropriate dimensions based on the information obtained from the encoder phase . The decoder module was added to give better segmentation results along object boundaries. DeepLab supports the following network backbones: MobileNetv2, Xception, ResNet, PNASNet, Auto-DeepLab. We use the Xception network backbone while training the model. The trained model is supposed to have been used in the Google’s Pixel smartphone for various image segmentation tasks .
We use the PASCAL Visual Object 2012 data set which is from the PASCAL VOC challenge. The goal of this challenge is to recognize objects from several visual object classes in realistic scenes (that is, not pre-segmented objects). The segmentation training data set contains 1464 images.
PyTorch LMS usage
A PyTorch program enables LMS by calling
torch.cuda.set_enabled_lms(True) prior to model creation.
The following LMS tunables are provided to limit the amount of swapping and the kind of tensors that are chosen to be swapped:
Defines the soft limit in bytes on GPU memory allocated for tensors (default: 0)
Defines the minimum tensor size in bytes that is eligible for LMS swapping (default: 1 MB)
For more information, refer to the Getting started page for PyTorch.
DeepLab with LMS and IBM POWER9 NVLINK advantage
DeepLabv3+ is a large model having large number of parameters to train and as we try to train higher resolution images and batch sizes, we would not be able to train the model with the limited GPU memory. For instance, in , we observe that we can go up to a resolution of 500 with the batch size at 16 on a 32 GB GPU. If DeepLab was used with higher resolution images such as satellite or medical images, then we would definitely need to use LMS to train the model. PyTorch LMS provides a large model capability by modifying the CUDA caching allocator algorithm to swap inactive tensors.
In the above scenarios, where the amount of data transfer between the CPU and the GPU is high, the link bandwidth between the CPU and the GPU becomes a bottleneck for faster training. IBM POWER9™ with its NVLink 2.0 having a unidirectional bandwidth of 75 GBps allows faster data transfer in comparison to other Intel® Xeon® x86 processor-based servers having a Peripheral Component Interconnect Express (PCIe) with a bandwidth of 16 GBps.
This section lists the hardware and the software used in the experimental setup.
|IBM Power System AC922||2x Intel Xeon E5-2698|
|40 cores (two 20c chips), POWER9 with NVLink 2.0||40 cores (two 20c chips)|
|3.8 GHz, 1 TB memory||2.40 GHz, 768 GB memory|
|Four Tesla V100 GPU, 16 GB-GPU||Eight Tesla V100 GPU, 16 GB-GPU|
|Red Hat® Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9)
with CUDA 10.1.168/ CUDNN 7.5.1
|Ubuntu 18.04.2 with CUDA 10.1.168 / CUDNN 7.5.1|
|nvidia-driver – 418.67||nvidia-driver – 418.67|
|Software : IBM PyTorch (POWER9), WML CE 1.6.1 PyTorch 1.1.0||Software : WML CE 1.6.1 PyTorch 1.1.0|
You can find the source code for PyTorch based DeepLabv3+ with LMS at: https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models
We use PyTorch 1.1.0 and IBM distributed deep learning (DDL) library available with WML CE 1.6.1 on both the platforms. We use DDL for multi-GPU runs on both the platforms.
PyTorch LMS parameters:
Note : The results are based on the IBM internal measurements for running 1000 iterations.
PyTorch LMS helps to go from a resolution of 900^2 to 2600^2 with a batch size of 2. Almost a 8.35x increase in the resolution. PyTorch LMS helps to go from a batch size of 2 to batch size of 21 at a resolution of 900^2 with a batch size of 21. This is almost a 10x increase in the batch size.
Figure 1. Maximum resolution attainable on DeepLabv3+ using PyTorchLMS
Figure 2. Maximum batch size attainable on DeepLabv3+ using PyTorch LMS
In PyTorch LMS, the CUDA caching allocator swaps inactive tensors to allow more space for other tensors that request allocation and thereby allows for training larger models. The link bandwidth between the GPU and CPU plays a key role in reducing the swapping overhead. The following two charts showcase the benefits of training on an IBM Power® System AC922 server with a NVIDIA NVLink 2.0 system verses an Intel Xeon x86 server with a PCIe Gen3 system.
Figure 3. Throughput comparison of Power AC922 four GPU versus Xeon x86 four GPU
Figure 4. Throughput comparison of Power AC922 four GPU versus Xeon x86 eight GPU
We observe that POWER9 processor-based server exhibits 2.92X and 2.4X better throughput in comparison to the x86 server with four GPUs and eight GPUs respectively.
With PyTorch LMS you can attain better resolutions and go to higher batch sizes for a given resolution. Using the IBM Power AC922 server also helps to train faster due to its high speed NVLink 2.0 at very high resolutions with LMS enabled.