Large model support (LMS) technology enables training of large deep neural networks that do not fit into GPU memory. In this blog, we showcase the advantages of using IBM’s WMLCE 1.6.1 TensorFlow Large Model Support (TFLMS) on DeepLabv3+ model and perform a competitive comparison to highlight IBM® POWER9™ processor’s NVLink 2.0 advantages while training such large neural networks.

TFLMS rewrites the computational graph introducing swap-in and swap-out operations using formal rules [1] [2]. The initial release of the TFLMS used bread-first search approach and chose tensors to swap the links between the forward and the backward phases. We had to specify a scope for optimizer, specify a starting point in the graph, and had to manually tune the LMS parameters. The latest TFLMS takes a different approach and also provides better opportunities for swapping.

The latest TFLMS uses topological sort distance between two operations to determine tensors that could be swapped. We need not either specify a starting point or the operations that are in the backward phase. Additionally, the latest TFLMS comes with a tuning simulator which can autotune and find the best parameters to achieve best performance. It also comes with synchronization modes to synchronize data transfer and computation in the GPU and also has a serialization feature to serialize operations at the same level in the topological sort, both of which would enable to fit large models.

For more detailed information on these technologies, refer to the WML-CE 1.6.1 documentation: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_tflmsv2.html.

DeepLabv3+ and PASCAL data set

Deeplabv3+ is a state-of-art deep learning model for semantic image segmentation [3], where the goal is to assign semantic labels (such as a person, a dog, a cat and so on) to every pixel in the input image. Open sourced by Google back in 2016, multiple improvements have been made to the model with the latest being DeepLabv3+ [4]. The DeepLabv3+ model has an encoding phase and a decoding phase. The encoding phase extracts the essential information from the image using a convolutional neural network (CNN ) whereas the decoding phase reconstructs the output of appropriate dimensions based on the information obtained from the encoder phase [5]. The decoder module was added to give better segmentation results along object boundaries. DeepLab supports the following network backbones: MobileNetv2, Xception, ResNet, PNASNet, Auto-DeepLab. We use the Xception network backbone for training the DeepLab model. The trained model is supposed to have been used in the Google’s Pixel smartphone for various image segmentation tasks [6].

We use the PASCAL Visual Object 2012 data set which is from the PASCAL VOC challenge. The goal of this challenge is to recognize objects from several visual object classes in realistic scenes (that is, not pre-segmented objects). The segmentation training data set contains 1464 images [7].

DeepLabv3+ is a large model having a large number of parameters to train and as we try to train higher resolution images and batch sizes, we would not be able to train the model with the limited GPU memory. For instance, in [8], we observe that we can go up to a resolution of 500 with the batch size of 16 on a 32 GB GPU. If DeepLab was used with higher resolution images such as satellite or medical images, then we would definitely need to use LMS to train the model.

Interested readers can find TFLMS studies on other models at [8] and [9].

IBM POWER9 and NVIDIA NVLink

In scenarios, where the amount of data transfer between the CPU and GPU is high , the link bandwidth between the CPU and the GPU becomes a bottleneck for faster training. IBM POWER9 with its NVLink 2.0 having a unidirectional bandwidth of 75 GBps allows faster data transfer in comparison to other Intel Xeon x86 processor-based servers having a Peripheral Component Interconnect Express (PCIe) with a bandwidth of 16 GBps.

Experimental setup

This section lists the hardware and the software used in the experimental setup.

IBM Power System AC922 2x Intel Xeon E5-2698
40 cores (two 20c chips), POWER9 with NVLink 2.0 40 cores (two 20c chips)
3.8 GHz, 1 TB memory 2.40 GHz, 768 GB memory
Four Tesla V100 GPU, 16 GB-GPU Eight Tesla V100 GPU, 16 GB-GPU
Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9)
with CUDA 10.1.168/ CUDNN 7.5.1
Ubuntu 18.04.2 with CUDA 10.1.168 / CUDNN 7.5.1
nvidia-driver – 418.67 nvidia-driver – 418.39
Software: IBM TFLMS (POWER9), TFLMSv2- WML-CE 1.6.1 tensorflow-large-model-support 2.0.1 Software: TFLMSv2: WML-CE 1.6.1 tensorflow-large-model-support 2.0.1

 

You can find the Deeplabv3+ source code with LMS enabled at:https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models

We use tensorflow-large-model-support 2.0.1, tensorflow-1.14.0a1 and IBM Distributed Deep Learning library( DDL) available with WML-CE 1.6.1 on both the platforms. We use DDL for multi-GPU runs on both the platforms.

TFLMS parameters: swapout_threshold=1, swapin_ahead=1, swapin_groupby=0, sync_mode=0

Note: The results are based on the IBM internal measurements for running 1000 iterations.

Results

Figure 1. Maximum resolution attainable on DeepLabv3+ using TFLMS
Maximum resolution attainable on DeepLabv3+ using TFLMS

TFLMSv2 helps to go from a resolution of 1400^2 to a resolution of 3200^2 for a batch size of 1. This is almost a 5X increase in resolution. For applications which benefit from examining or working on high resolution images, LMS enables to achieve such high resolutions with the limited GPU memory we have.

Figure 2. Maximum batch size attainable on DeepLabv3+ using TFLMS
Maximum batch size attainable on DeepLabv3+ using TFLMS

TFLMSv2 helps to go from a batch size of 1 to batch size of 5 for the resolution of 1400^2. This is almost a 5X increase in batch size. TFLMS can be used to increase both batch size and resolution depending on the model’s needs. We could push to higher resolutions or batch size using sync mode 3 or serialization.

Competitive comparison

As mentioned, TFLMSv2 performs swap-in and swap-out operations of tensors to allow training of large models that do not fit into GPU memory. Thus, the link bandwidth between the GPU and CPU has a significant impact on the training speed. The following two charts showcase the benefits of training on a Power AC922 server having a NVLink 2.0 system versus a Xeon x86 server with a PCIe Gen3 system.

Figure 3. Throughput comparison of Power AC922 4 GPU versus Xeon x86 8 GPU

Throughput comparison of Power AC922 4 GPU versus Xeon x86 8 GPU

Figure 4. Throughput comparison of Power AC922 4 GPU versus Xeon x86 4 GPU

Throughput comparison of Power AC922 4 GPU versus Xeon x86 4 GPU

We observe that POWER9 exhibits 3.13X and 2.77X better throughput when compared to the x86 server with four and eight GPUs respectively.

Conclusion

References

[1] TFLMS: Large Model Support in TensorFlow by Graph Rewriting

[2] Tung D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. 2019. Automatic GPU memory management for large neural models in TensorFlow. In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management (ISMM 2019). ACM, New York, NY, USA, 1-13. DOI

[3] DeepLab

[4] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

[5] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

[6] Semantic Segmentation: Introduction to the Deep Learning Technique Behind Google Pixel’s Camera!

[7] PASCAL VOC 2012 Development Kit

[8] Performance results with TensorFlow Large Model Support v2

[9] Performance of 3DUnet Multi GPU Model for Medical Image Segmentation using TensorFlow Large Model Support

Join The Discussion

Your email address will not be published. Required fields are marked *