Large Model Support (LMS) technology enables training of large deep neural networks that do not fit into GPU memory. In this article, we showcase the advantages of using IBM® Watson™ Machine Learning Community Edition (WML CE) 1.6.1 TensorFlow Large Model Support (TFLMS) on DeepLabv3+ model and perform a competitive comparison to highlight IBM POWER9™ processor’s NVLink 2.0 advantages while training such large neural networks.
TFLMS rewrites the computational graph introducing swap-in and swap-out operations using formal rules . The initial release of the TFLMS used bread-first search approach and chose tensors to swap the links between the forward and the backward phases. We had to specify a scope for optimizer, specify a starting point in the graph, and had to manually tune the LMS parameters. The latest TFLMS takes a different approach and also provides better opportunities for swapping.
The latest TFLMS uses topological sort distance between two operations to determine tensors that could be swapped. We need not either specify a starting point or the operations that are in the backward phase. Additionally, the latest TFLMS comes with a tuning simulator which can autotune and find the best parameters to achieve best performance. It also comes with synchronization modes to synchronize data transfer and computation in the GPU and also has a serialization feature to serialize operations at the same level in the topological sort, both of which would enable to fit large models.
For more detailed information on these technologies, refer to the WML CE 1.6.1 documentation.
DeepLabv3+ and PASCAL data set
DeepLabv3+ is a state-of-art deep learning model for semantic image segmentation , where the goal is to assign semantic labels (such as a person, a dog, a cat and so on) to every pixel in the input image. Open sourced by Google back in 2016, multiple improvements have been made to the model with the latest being DeepLabv3+ . The DeepLabv3+ model has an encoding phase and a decoding phase. The encoding phase extracts the essential information from the image using a convolutional neural network (CNN ) whereas the decoding phase reconstructs the output of appropriate dimensions based on the information obtained from the encoder phase . The decoder module was added to give better segmentation results along object boundaries. DeepLab supports the following network backbones: MobileNetv2, Xception, ResNet, PNASNet, Auto-DeepLab. We use the Xception network backbone for training the DeepLab model. The trained model is supposed to have been used in the Google’s Pixel smartphone for various image segmentation tasks .
We use the PASCAL Visual Object 2012 data set which is from the PASCAL VOC challenge. The goal of this challenge is to recognize objects from several visual object classes in realistic scenes (that is, not pre-segmented objects). The segmentation training data set contains 1464 images .
DeepLabv3+ is a large model having a large number of parameters to train and as we try to train higher resolution images and batch sizes, we would not be able to train the model with the limited GPU memory. For instance, in , we observe that we can go up to a resolution of 500 with the batch size of 16 on a 32 GB GPU. If DeepLab was used with higher resolution images such as satellite or medical images, then we would definitely need to use LMS to train the model.
IBM POWER9 and NVIDIA NVLink
In scenarios, where the amount of data transfer between the CPU and GPU is high, the link bandwidth between the CPU and the GPU becomes a bottleneck for faster training. IBM POWER9 with its NVLink 2.0 having a unidirectional bandwidth of 75 GBps allows faster data transfer in comparison to other Intel® Xeon® x86 processor-based servers having a Peripheral Component Interconnect Express (PCIe) with a bandwidth of 16 GBps.
This section lists the hardware and the software used in the experimental setup.
|IBM Power System AC922||2x Intel Xeon E5-2698|
|40 cores (two 20c chips), POWER9 with NVLink 2.0||40 cores (two 20c chips)|
|3.8 GHz, 1 TB memory||2.40 GHz, 768 GB memory|
|Four Tesla V100 GPU, 16 GB-GPU||Eight Tesla V100 GPU, 16 GB-GPU|
|Red Hat® Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9) with CUDA 10.1.168/ CUDNN 7.5.1||Ubuntu 18.04.2 with CUDA 10.1.168 / CUDNN 7.5.1|
|nvidia-driver – 418.67||nvidia-driver – 418.39|
|Software : IBM TFLMS (POWER9), TFLMSv2- WML CE 1.6.1 tensorflow-large-model-support 2.0.1||Software : TFLMSv2: WML CE 1.6.1 tensorflow-large-model-support 2.0.1|
You can find the DeepLabv3+ source code with LMS enabled at:https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models
We use tensorflow-large-model-support 2.0.1, tensorflow-1.14.0a1 and IBM Distributed Deep Learning library (DDL) available with WML CE 1.6.1 on both the platforms. We use DDL for multi-GPU runs on both the platforms.
swapout_threshold=1, swapin_ahead=1, swapin_groupby=0, sync_mode=0
Note : The results are based on the IBM internal measurements for running 1000 iterations.
As shown in figure 1, TFLMSv2 helps to go from a resolution of 1400^2 to a resolution of 3200^2 for a batch size of 1. This is almost a 5X increase in resolution. For applications which benefit from examining or working on high resolution images, LMS enables to achieve such high resolutions with the limited GPU memory we have.
Figure 1. Maximum resolution attainable on DeepLabv3+ using TFLMS
As shown in figure 2, TFLMSv2 helps to go from a batch size of 1 to batch size of 5 for the resolution of 1400^2. This is almost a 5X increase in batch size. TFLMS can be used to increase both batch size and resolution depending on the model’s needs. We could push to higher resolutions or batch size using sync mode 3 or serialization.
Figure 2. Maximum batch size attainable on DeepLabv3+ using TFLMS
As mentioned, TFLMSv2 performs swap-in and swap-out operations of tensors to allow training of large models that do not fit into GPU memory. Thus, the link bandwidth between the GPU and CPU has a significant impact on the training speed. The following two charts showcase the benefits of training on an IBM Power® System AC922 server having a NVLink 2.0 system versus a Xeon x86 server with a PCIe Gen3 system.
Figure 3. Throughput comparison of Power AC922 4 GPU versus Xeon x86 8 GPU
Figure 4. Throughput comparison of Power AC922 4 GPU versus Xeon x86 4 GPU
We observe that POWER9 exhibits 3.13X and 2.77X better throughput when compared to the x86 server with four and eight GPUs respectively.
With TFLMSv2, you can attain better resolutions and reach higher batch sizes for a given resolution. Using the IBM Power AC922 server also helps to train faster due to its high speed NVLink 2.0 at very high resolutions with LMS enabled.
 Tung D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. 2019. Automatic GPU memory management for large neural models in TensorFlow. In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management (ISMM 2019). ACM, New York, NY, USA, 1-13. DOI