Large Model Support (LMS) technology enables training of large deep neural networks that would exhaust GPU memory while training.

PyTorch is a relatively new and popular Python-based open source deep learning framework built by Facebook for faster prototyping and production deployment. With its more pythonic nature, and less steeper learning curve compared to other frameworks, it has been greeted with faster adoption. One of the primary features of PyTorch is its support for dynamic computation graphs. Dynamic graphs turn out to be valuable in situations where the amount of computation is not known ahead. So, with the growing popularity of PyTorch and with current neural networks being large enough, unable to fit in the GPU, this makes a case for a technology to support large models in PyTorch and run with limited GPU memory.

PyTorch with IBM’s WML-CE 1.6.1 comes with LMS to enable large PyTorch models and in this blog, we capture the benefits of using PyTorch LMS on DeepLabv3+ [3] along with the PASCAL Visual Object Classes (VOC) 2012 data set [4].

DeepLabv3+ and PASCAL data set

DeepLabv3+ is a state-of-art deep learning model for semantic image segmentation, where the goal is to assign semantic labels (such as, a person, a dog, a cat and so on) to every pixel in the input image. Open sourced by Google back in 2016, multiple improvements have been made to the model with the latest being DeepLabv3+ [5]. The DeepLabv3+ model has an encoding phase and a decoding phase. The encoding phase extracts the essential information from the image using a convolutional neural network (CNN) whereas the decoding phase reconstructs the output of appropriate dimensions based on the information obtained from the encoder phase [6]. The decoder module was added to give better segmentation results along object boundaries. DeepLab supports the following network backbones: MobileNetv2, Xception, ResNet, PNASNet, Auto-DeepLab. We use the Xception network backbone while training the model. The trained model is supposed to have been used in the Google’s Pixel smartphone for various image segmentation tasks [7].

We use the PASCAL Visual Object 2012 data set which is from the PASCAL VOC challenge. The goal of this challenge is to recognize objects from several visual object classes in realistic scenes ( that is, not pre-segmented objects). The segmentation training data set contains 1464 images .
Interested readers can find TFLMS studies on other models at [8] and [9].

PyTorch LMS usage

A PyTorch program enables LMS by calling torch.cuda.set_enabled_lms(True) prior to model creation.

The following LMS tunables are provided to limit the amount of swapping and the kind of tensors that are chosen to be swapped:

  • torch.cuda.set_limit_lms(limit)
    Defines the soft limit in bytes on GPU memory allocated for tensors (default: 0)
  • torch.cuda.set_size_lms(size)
    Defines the minimum tensor size in bytes that is eligible for LMS swapping (default: 1 MB)

For more information, refer to the Getting started page for PyTorch.

DeepLab with LMS and IBM POWER9 NVLINK advantage

DeepLabv3+ is a large model having large number of parameters to train and as we try to train higher resolution images and batch sizes, we would not be able to train the model with the limited GPU memory. For instance, in [5], we observe that we can go up to a resolution of 500 with the batch size at 16 on a 32 GB GPU. If DeepLab was used with higher resolution images such as satellite or medical images, then we would definitely need to use LMS to train the model. PyTorch LMS provides a large model capability by modifying the CUDA caching allocator algorithm to swap inactive tensors.

In the above scenarios, where the amount of data transfer between the CPU and the GPU is high, the link bandwidth between the CPU and the GPU becomes a bottleneck for faster training. IBM® POWER9™ with its NVLink 2.0 having a unidirectional bandwidth of 75 GBps allows faster data transfer in comparison to other Intel Xeon x86 processor-based servers having a Peripheral Component Interconnect Express (PCIe) with a bandwidth of 16 GBps.

Experimental setup

This section lists the hardware and the software used in the experimental setup.

IBM Power System AC922 2x Intel Xeon E5-2698
40 cores (two 20c chips), POWER9 with NVLink 2.0 40 cores (two 20c chips)
3.8 GHz, 1 TB memory 2.40 GHz, 768 GB memory
Four Tesla V100 GPU, 16 GB-GPU Eight Tesla V100 GPU, 16 GB-GPU
Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9)
with CUDA 10.1.168/ CUDNN 7.5.1
Ubuntu 18.04.2 with CUDA 10.1.168 / CUDNN 7.5.1
nvidia-driver – 418.67 nvidia-driver – 418.67
Software: IBM PyTorch (POWER9), WML-CE 1.6.1 PyTorch 1.1.0 Software: WML-CE 1.6.1 PyTorch 1.1.0

 

You can find the source code for PyTorch based DeepLabv3+ with LMS at: https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models

We use PyTorch 1.1.0 and IBM distributed deep learning (DDL) library available with WML-CE 1.6.1 on both the platforms. We use DDL for multi-GPU runs on both the platforms.

PyTorch LMS parameters: limit_lms=0, size_lms=1MB

Note: The results are based on the IBM internal measurements for running 1000 iterations.

Results

Figure 1. Maximum resolution attainable on DeepLabv3+ using PyTorchLMS
Maximum resolution attainable on DeepLabv3+ using PyTorchLMS

PyTorch LMS helps to go from a resolution of 900^2 to 2600^2 with a batch size of 2. Almost a 8.35x increase in the resolution.

Figure 2. Maximum batch size attainable on DeepLabv3+ using PyTorch LMS
Maximum batch size attainable on DeepLabv3+ using PyTorch LMS

PyTorch LMS helps to go from a batch size of 2 to batch size of 21 at a resolution of 900^2 with a batch size of 21. This is almost a 10x increase in the batch size.

Competitive comparison

In PyTorch LMS, the CUDA caching allocator swaps inactive tensors to allow more space for other tensors that request allocation and thereby allows for training larger models. The link bandwidth between the GPU and CPU plays a key role in reducing the swapping overhead. The following two charts showcase the benefits of training on a Power AC922 server with a NVIDIA NVLink 2.0 system verses an Intel Xeon x86 server with a PCIe Gen3 system.

Figure 3. Throughput comparison of Power AC922 four GPU versus Xeon x86 four GPU

Throughput comparison of Power AC922 four GPU versus Xeon x86 four GPU

Figure 4. Throughput comparison of Power AC922 four GPU versus Xeon x86 eight GPU

Throughput comparison of Power AC922 four GPU versus Xeon-x86 eight GPU

We observe that POWER9 processor-based server exhibits 2.92X and 2.4X better throughput in comparison to the x86 server with four GPUs and eight GPUs respectively.

Conclusion

With PyTorch LMS you can attain better resolutions and go to higher batch sizes for a given resolution. Using the IBM Power AC922 server also helps to train faster due to its high speed NVLink 2.0 at very high resolutions with LMS enabled.

[1] PyTorch

[2] PyTorch – favorite deep learning tool article

[3] DeepLabv3+

[4] PASCAL VOC 2012 Dataset

[5] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

[6] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

[7] Semantic Segmentation: Introduction to the Deep Learning Technique Behind Google Pixel’s Camera!

[8] Performance results with TensorFlow Large Model Support v2

[9] Performance of 3DUnet Multi GPU Model for Medical Image Segmentation using TensorFlow Large Model Support

Join The Discussion

Your email address will not be published. Required fields are marked *