Large model support (LMS) technology enables training of large deep neural networks that do not fit into GPU memory. In this blog, we showcase the advantages of using IBMâ€™s WMLCE 1.6.1 TensorFlow Large Model Support (TFLMS) on DeepLabv3+ model and perform a competitive comparison to highlight IBMÂ® POWER9â„˘ processorâ€™s NVLink 2.0 advantages while training such large neural networks.
TFLMS rewrites the computational graph introducing swap-in and swap-out operations using formal rules  . The initial release of the TFLMS used bread-first search approach and chose tensors to swap the links between the forward and the backward phases. We had to specify a scope for optimizer, specify a starting point in the graph, and had to manually tune the LMS parameters. The latest TFLMS takes a different approach and also provides better opportunities for swapping.
The latest TFLMS uses topological sort distance between two operations to determine tensors that could be swapped. We need not either specify a starting point or the operations that are in the backward phase. Additionally, the latest TFLMS comes with a tuning simulator which can autotune and find the best parameters to achieve best performance. It also comes with synchronization modes to synchronize data transfer and computation in the GPU and also has a serialization feature to serialize operations at the same level in the topological sort, both of which would enable to fit large models.
For more detailed information on these technologies, refer to the WML-CE 1.6.1 documentation: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_tflmsv2.html.
DeepLabv3+ and PASCAL data set
Deeplabv3+ is a state-of-art deep learning model for semantic image segmentation , where the goal is to assign semantic labels (such as a person, a dog, a cat and so on) to every pixel in the input image. Open sourced by Google back in 2016, multiple improvements have been made to the model with the latest being DeepLabv3+ . The DeepLabv3+ model has an encoding phase and a decoding phase. The encoding phase extracts the essential information from the image using a convolutional neural network (CNN ) whereas the decoding phase reconstructs the output of appropriate dimensions based on the information obtained from the encoder phase . The decoder module was added to give better segmentation results along object boundaries. DeepLab supports the following network backbones: MobileNetv2, Xception, ResNet, PNASNet, Auto-DeepLab. We use the Xception network backbone for training the DeepLab model. The trained model is supposed to have been used in the Googleâ€™s Pixel smartphone for various image segmentation tasks .
We use the PASCAL Visual Object 2012 data set which is from the PASCAL VOC challenge. The goal of this challenge is to recognize objects from several visual object classes in realistic scenes (that is, not pre-segmented objects). The segmentation training data set contains 1464 images .
DeepLabv3+ is a large model having a large number of parameters to train and as we try to train higher resolution images and batch sizes, we would not be able to train the model with the limited GPU memory. For instance, in , we observe that we can go up to a resolution of 500 with the batch size of 16 on a 32 GB GPU. If DeepLab was used with higher resolution images such as satellite or medical images, then we would definitely need to use LMS to train the model.
IBM POWER9 and NVIDIA NVLink
In scenarios, where the amount of data transfer between the CPU and GPU is high , the link bandwidth between the CPU and the GPU becomes a bottleneck for faster training. IBM POWER9 with its NVLink 2.0 having a unidirectional bandwidth of 75 GBps allows faster data transfer in comparison to other Intel Xeon x86 processor-based servers having a Peripheral Component Interconnect Express (PCIe) with a bandwidth of 16 GBps.
This section lists the hardware and the software used in the experimental setup.
|IBM Power System AC922||2x Intel Xeon E5-2698|
|40 cores (two 20c chips), POWER9 with NVLink 2.0||40 cores (two 20c chips)|
|3.8 GHz, 1 TB memory||2.40 GHz, 768 GB memory|
|Four Tesla V100 GPU, 16 GB-GPU||Eight Tesla V100 GPU, 16 GB-GPU|
|Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9)
with CUDA 10.1.168/ CUDNN 7.5.1
|Ubuntu 18.04.2 with CUDA 10.1.168 / CUDNN 7.5.1|
|nvidia-driver â€“ 418.67||nvidia-driver â€“ 418.39|
|Software: IBM TFLMS (POWER9), TFLMSv2- WML-CE 1.6.1 tensorflow-large-model-support 2.0.1||Software: TFLMSv2: WML-CE 1.6.1 tensorflow-large-model-support 2.0.1|
You can find the Deeplabv3+ source code with LMS enabled at:https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models
We use tensorflow-large-model-support 2.0.1, tensorflow-1.14.0a1 and IBM Distributed Deep Learning library( DDL) available with WML-CE 1.6.1 on both the platforms. We use DDL for multi-GPU runs on both the platforms.
Note: The results are based on the IBM internal measurements for running 1000 iterations.
Figure 1. Maximum resolution attainable on DeepLabv3+ using TFLMS
Figure 2. Maximum batch size attainable on DeepLabv3+ using TFLMS
As mentioned, TFLMSv2 performs swap-in and swap-out operations of tensors to allow training of large models that do not fit into GPU memory. Thus, the link bandwidth between the GPU and CPU has a significant impact on the training speed. The following two charts showcase the benefits of training on a Power AC922 server having a NVLink 2.0 system versus a Xeon x86 server with a PCIe Gen3 system.
Figure 3. Throughput comparison of Power AC922 4 GPU versus Xeon x86 8 GPU
Figure 4. Throughput comparison of Power AC922 4 GPU versus Xeon x86 4 GPU
We observe that POWER9 exhibits 3.13X and 2.77X better throughput when compared to the x86 server with four and eight GPUs respectively.
 Tung D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. 2019. Automatic GPU memory management for large neural models in TensorFlow. In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management (ISMM 2019). ACM, New York, NY, USA, 1-13. DOI