Introduction

Advancements in the field of Deep Learning are creating use cases that require larger Deep Learning models and large datasets. One such use case is the MRI image segmentation to identify brain tumors. Training such models increases the memory requirements in the GPU. However, the GPUs are limited in their memory capacities. The latest available GPUs have not surpassed 16GB or 32GB of memory capacity. Hence, larger models cannot fit into the limited memory available on GPUs. This article discusses how IBM PowerAI TensorFlow Large Model Support (TF-LMS) enables training large models by overcoming the memory limitation in GPUs by using the CPU memory in conjunction with the GPU memory. This is a seamless approach that can be used generally for any model and is easily enabled through a set of command line parameters used while training the model. Although the additional swap in/out operations between the CPU memory and GPU memory might be perceived as an overhead, the IBM Power AC922 systems with a high bandwidth NVLink 2.0 connecting the CPU and GPU reduces this overhead and enables efficient training of large models compared to competitive platforms as evident from the performance results that we discuss in the sections below.

TensorFlow Large Model Support

TensorFlow Large Model Support [1] (TF-LMS) was introduced as a technology preview in PowerAI 1.5.2 [2] to customers. TF-LMS enables usage of high-resolution datasets, larger models and/or larger batch sizes by allowing the system memory to be used in conjunction with the GPU memory. TF-LMS modifies the TensorFlow graph prior to training to inject swap nodes that will swap tensors in and out of GPU memory to system memory. It also provides certain controls to configure when and what will be swapped in/out. The detailed methodology used in graph rewriting is discussed in the paper TF-LMS: Large Model Support in TensorFlow by Graph Rewriting [3]. TF-LMS is part of the TensorFlow contrib in PowerAI. It is contributed to the community as a pull request available at GitHub [4]. PowerAI includes a component called Distributed Deep Learning (DDL) [5] library that is an optimized component for multi-gpu/multi-node distributed deep learning training. TF-LMS uses DDL to do model training on AC922/4xV100 for optimized performance.

3DUnet CNN Model for Medical Image Segmentation

The results discussed for large model training in this article is using a large 3DUnet CNN Model that was proposed in the Brain Tumor Segmentation (BraTS) 2017 Challenge [7] by Isensee et.al. [8] BraTS Challenge focuses on the evaluation of state-of-the-art methods for the segmentation of brain tumors in magnetic resonance imaging (MRI) scans. BraTS 2017 utilizes multi-institutional pre-operative MRI scans and focuses on the segmentation of intrinsically heterogeneous (in appearance, shape, and histology) brain tumors. 3DUnet CNN is based on the popular U-Net architecture [9]. However, there are different design choices in 3DUnet regarding the exact architecture, normalization schemes, number of feature maps throughout the network, nonlinearity and the structure of the up-sampling pathway. The 3DUnet network architecture shown below was trained with randomly sampled patches of size 192x192x192 voxels and batch size 1. Training was done using the Adam optimizer. Higher patch sizes improve the accuracy of prediction of tumors in the MRI images. 3DUnet and the BraTS dataset is a good example of large DL model being used in real-world scenarios. Without using TF-LMS, the model could not be fit in the 16GB GPU memory for the 192x192x192 patch size.

A Keras based 3DUNet Convolution Neural Network (CNN) model based on the proposed architecture by Isensee et.al. was used for the evaluations. This Keras model [10] was originally written by David G. Ellis and was for 1GPU. This model was enhanced to invoke TF-LMS module. A TF-LMS case study based on this enhanced 3DUnet model was published here. In addition, to optimize the performance of the 3DUnet CNN model training on AC922 and to scale to multi-GPUs, the PowerAI DDL (Distributed Deep Learning) library was integrated to the model. To do a fair comparison and to use a comparative technology to DDL on x86 platform, we modified the model to use the Horovod [11] opensource distributed multi-GPU deep learning technology. This article highlights the multi-GPU competitive comparison using TFLMS.

Competitive Comparison

We evaluated the performance of 3DUnet CNN model using TensorFlow Large Model Support on IBM POWER9 AC922 systems and on a competitive x86 server. Below are the details of the hardware and software used in this comparison and the results based on our observations.

HARDWARE SETUP

IBM’s AC922 servers with POWER9 processors and NVIDIA Volta (V100) GPUs connected through high speed NVLink2.0, with a max bandwidth of 150GB/s connecting CPU-GPU. The x86 based competitive platform with 32GB/s PCI Gen3 link between CPU-GPU and NVLink connectivity between the GPUs.

Hardware Stack:
  • IBM Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 512GB memory, 4xTesla V100 GPU (16GB), RHEL7.5 for P9, CUDA9.2/396.44, CuDNN7.2.1.
  • X86 server (Intel Xeon); 40 cores (2 x 20c chips); 2.20 GHz; 512GB memory, 8xTesla V100 GPU (16GB), Ubuntu 16.04, Cuda9.0/384.145, CuDNN7.2.1
Software Stack:
  • Framework on P9(AC922)/V100: TensorFlow 1.10 with LMS from PowerAI 1.5.3
  • Framework on x86: TensorFlow 1.10 standard distribution with LMS contrib code
  • Model on IBM AC922: 3DUnet CNN model was modified to be run on AC922 using TFLMS Keras Callback and use DDL for scaling to 4 GPUs
  • Model on x86: 3DUnet CNN model was modified to scale to multiple GPUs using Horovod distributed training framework for TensorFlow and Keras. The model also uses TFLMS Keras Callback to enable LMS Tensor Swapping.
  • The Keras 3DUnet CNN model was written to process the TCGA and MICCAI BraTS 2017 datasets [12]. BraTS 2017 dataset is preprocessed and converted to .h5 files. The dataset has 285 images/subjects – 228 (80%) for training and 57 (20%) for validation

Results

TF-LMS modifies the TensorFlow graph to insert swap in/out nodes to enable large models to use both GPU and CPU memory. These nodes help the tensors to be transferred from GPU to CPU memory and vice versa as required during the training. AC922 has higher bandwidth CPU-GPU communication through NVLink 2.0 (with a max bandwidth of 150GB/s bidirectional connecting CPU-GPU), compared to PCIe 3.0 (max of 32 GB/s bidirectional) on x86, which helps in efficient swap-in/swap-out of data to/from the system memory to GPU memory, thereby increasing the throughput in AC922 servers. In addition, the x86 server has two consecutive GPUs sharing the same PCIe switch connecting to the CPU socket, which further reduces the bandwidth on x86 systems in multi-GPU scenarios.

The results from our performance evaluation and comparison between IBM AC922 and x86 server are shown below.

TF-LMS – IBM AC922/V100 Vs x86/V100 – 1 GPU

Below chart shows the per epoch time comparison of TF-LMS on AC922 and x86 with 1 V100 GPU. TF-LMS on P9(AC922)/V100 1GPU is 2.4x better in epoch time compared to x86/V100 1GPU with 3DUnet CNN model for image segmentation with a patch size of 192^3 and batch size of 1.

*Note that the results are based IBM Internal Measurements running 1 Epoch training of 3DUnet CNN model (mini-batch size=1, patch_size= 192^3) on BRaTs dataset[12]

TF-LMS – IBM AC922/V100 Vs x86/V100 – 4 GPU

Below chart shows the per epoch time comparison of TF-LMS on AC922 and x86 in a multi-GPU scenario with 4 V100 GPUs. The competitive comparison shows that TF-LMS on P9(AC922)/V100 4GPUs optimized with DDL is 3.6x better in epoch time compared to x86/V100 4GPUs with 3DUnet CNN model for image segmentation with a patch size of 192^3 and batchsize of 1.

*Note that the results are based IBM Internal Measurements running 1 Epoch training of 3DUnet CNN model (mini-batch size=1, patch_size= 192^3) on BRaTs dataset[12]

TF-LMS – IBM AC922/4xV100 Vs x86/8xV100

Below chart shows the per epoch time comparison of TF-LMS on AC922 with 4-V100 GPUs Vs 8-V100 GPUs on x86. The competitive comparison shows that TF-LMS on P9(AC922)/V100 4GPUs is 2.1x better in epoch time compared to x86/V100 8GPUs with 3DUnet CNN model for image segmentation with a patch size of 192^3 and batchsize of 1.

*Note that the results are based IBM Internal Measurements running 1 Epoch training of 3DUnet CNN model (mini-batch size=1, patch_size= 192^3) on BRaTs dataset[12]

Conclusion

TFLMS enables training large Deep Learning models with high resolution data sets, that cannot be fit into the GPU memory. The multi-GPU performance comparison of 3DUnet model for Medical Image Segmentation using TFLMS shows the advantage of using IBM AC922 hardware platform with high speed NVLink 2.0 and the IBM PowerAI optimized software distribution with Large Model Support to enable faster training of such large models and high-resolution data, thereby maximizing the research productivity of data scientists and researchers.

References / Citations

  1. TensorFlow Large Model Support – https://developer.ibm.com/linuxonpower/deep-learning-powerai/#lms
  2. PowerAI – https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/
  3. TF-LMS: Large Model Support in TensorFlow by Graph Rewriting – https://arxiv.org/abs/1807.02037
  4. TF-LMS pull request – https://github.com/tensorflow/tensorflow/pull/19845
  5. Distributed Deep Learning (DDL)- https://www.ibm.com/blogs/research/2017/08/distributed-deep-learning/
  6. Gradient checkpointing – https://github.com/openai/gradient-checkpointing
  7. BraTs 2017 Challenge – https://www.cbica.upenn.edu/sbia/Spyridon.Bakas/MICCAI_BraTS/MICCAI_BraTS_2017_proceedings_shortPapers.pdf
  8. “Brain Tumor Segmentation and Radiomics Survival Prediction: Contribution to the BRATS 2017 Challenge” arXiv:1802.10508v1
  9. “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
  10. 3DUnet CNN – https://github.com/ellisdg/3DUnetCNN
  11. Horovod – https://github.com/uber/horovod
  12. BraTS 2017 dataset – https://www.med.upenn.edu/sbia/brats2017/data.html

Citations for BraTS’17 – Multimodal Brain Tumor Segmentation Challenge 2017 Dataset:

  1. Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, Lanczi L, Gerstner E, Weber MA, Arbel T, Avants BB, Ayache N, Buendia P, Collins DL, Cordier N, Corso JJ, Criminisi A, Das T, Delingette H, Demiralp Γ‡, Durst CR, Dojat M, Doyle S, Festa J, Forbes F, Geremia E, Glocker B, Golland P, Guo X, Hamamci A, Iftekharuddin KM, Jena R, John NM, Konukoglu E, Lashkari D, Mariz JA, Meier R, Pereira S, Precup D, Price SJ, Raviv TR, Reza SM, Ryan M, Sarikaya D, Schwartz L, Shin HC, Shotton J, Silva CA, Sousa N, Subbanna NK, Szekely G, Taylor TJ, Thomas OM, Tustison NJ, Unal G, Vasseur F, Wintermark M, Ye DH, Zhao L, Zhao B, Zikic D, Prastawa M, Reyes M, Van Leemput K. “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)”, IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) DOI: 10.1109/TMI.2014.2377694
  2. Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby JS, Freymann JB, Farahani K, Davatzikos C. “Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features”, Nature Scientific Data, 4:170117 (2017) DOI: 10.1038/sdata.2017.117
  3. Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-GBM collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.KLXWJJ1Q
  4. Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-LGG collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.GJQ7R0EF

Join The Discussion

Your email address will not be published. Required fields are marked *