In PowerAI 1.6, the TensorFlow Large Model Support (TFLMS) module has a new implementation and has graduated from tech preview status. This new implementation can achieve much higher levels of swapping which in turn, can provide training and inferencing with higher resolution data, deeper models, and larger batch sizes.
In this article, we investigated the runtime performance of model training with TensorFlow Large Model Support across image resolutions on three different models: ResNet50 from keras_applications run with TensorFlow Keras, DeepLabV3+, and 3D U-Net.
For these tests, a single NVIDIA V100 GPU with 32 GB of memory is used. The IBM Power Systems AC922 server, which is used in the Summit super computer, is used for the performance test runs. The POWER architecture used in the IBM Power Systems AC922 the only system architecture that has NVLink connections between the CPU and GPU. For a deeper look into the benefits of using TensorFlow Large Model Support on this architecture, see these resources:
- TensorFlow Large Model Support Case Study with 3D Image Segmentation
- Performance of 3DUnet Multi GPU Model for Medical Image Segmentation using TensorFlow Large Model Support
The software levels that were used are the levels included in PowerAI 1.6:
- TensorFlow 1.13.1
- TensorFlow Large Model Support 2.0.0
- CUDA 10.1
- cuDNN 7.5
For testing purposes, we ran each model training for a small number of iterations across a wide range of image resolutions. Since larger images take more time to process due to the larger amount of data they contain, the data rate was normalized to pixels or voxels per second to allow data rate comparisons across the resolution spectrum.
The first model we looked at was ResNet50. Why ResNet50? Namely, because it is a common network to benchmark with and secondly, while image classification may not benefit from high resolution data, image classification networks such as ResNet50 can be used as feature extractors for object detection and image segmentation networks such as DeepLabV3+, which can benefit from higher resolution data.
The model we used is provided as a PowerAI example here: https://github.com/IBM/powerai/tree/powerai-1.6.0/examples/tensorflow_large_model_support/v2. This example uses TensorFlow Keras and the ResNet50 model defined in the keras_applications module.
We started with a resolution that fits in GPU memory for training and then increment each image dimension by 500. Eventually the training fails with out of memory errors. At that point we enable TFLMS and continue incrementing the image dimensions until the training fails on out of memory errors with TFLMS. Here is a graph of the data rate over the various resolutions:
As we can see from the graph, TFLMS allows training with images that contain 10 times more megapixels than the maximum resolution without TFLMS. The megaxpixels per second data rate of the model training climbs and then levels off as the image resolution increases. Once TFLMS is enabled, the rate drops a bit due to the overhead of tensor swapping, but generally stays level through 6000×6000. At the 6000×6000 resolution, which has 4x the megapixels of the maximum resolution that could fit in memory, the data rate has only degraded about 5% from the non-TFLMS level. After 6000×6000, the data rate gradually drops off as fewer and fewer tensors fit in GPU memory and must be swapped. At the 9500×9500 resolution, the largest operation output is 16 GiB. When factoring in memory space for the model, GPU kernels, and input tensors, very few operation tenors are able to remain in memory.
The next model we looked at was 3D U-Net. For 3D U-Net, we used this model: https://github.com/ellisdg/3DUnetCNN, and then enabled the TFLMS Keras callback and converted the model to use TensorFlow Keras. The updated model code is here. We started with a resolution that fits in GPU memory and incremented each image dimension by 16 voxels. Eventually the training fails with out-of-memory errors and we enable TFLMS to continue incrementing the image dimensions.
The data rate graph is similar to the graph from ResNet50:
As you can see from the graph, TFLMS allows a 5x increase in MRI resolution above the maximum resolution without TFLMS. The data rate is level from 256^3 to 320^3 with TFLMS enabled and is showing a 14% degradation from the rate while the training fit in GPU memory. It is interesting to note that a 320^3 image has 15.6x the voxels of a 128^3 image, and the image training is only incurring a 14% performance overhead to train images with 15x the data. Similar to the ResNet50 model, the data rate gradually drops as the resolution increases. At the 400^3 resolution, the largest operation output of the model was 19 GiB, which again shows how few tensors can fit in GPU memory at this resolution.
The last model we looked at is DeepLabV3+. For this model, we used the DeepLab implementation from the TensorFlow models repository here: https://github.com/tensorflow/models/tree/master/research/deeplab (at commit
078575a) with the PASCAL VOC data set. For DeepLabV3+, we used a constant batch size of 16 across the resolution range and enabled fine_tune_batch_norm which enables training of the embedded xception65 network. For the DeepLabV3+ network, we also used NVIDIA Visual Profiler (nvprof) to profile 5 batches at several resolutions and measure the amount of data moved from system memory to GPU memory and the average GPU utilization. The resulting data graph is:
As you can see from the graph, TFLMS was enabled to allow training with the 600×600 resolution. TFLMS is allowing a 10x image resolution increase over the maximum non-LMS resolution of 500×500. In the 600×600 to 800×800 resolution range, TFLMS is showing a 15% overhead to the data rate seen at 500×500.
The nvprof data shows us that despite the swapping overhead, the GPU compute utilization actually increases. The amount of system to GPU memory transfers continue to climb to a very impressive 826 GB at 1300×1300 and 1.4 TB at 1600×1600. The nvprof data showed that the average throughput on the NVLink 2.0 connection between the CPU and GPU was 71.4 GB/s which is 95% of the link’s maximum. At the 1600×1600 resolution, the GPU compute utilization drops to 64%. This is due in some part to the memory overhead of nvprof. At the 1600×1600 resolution, a higher amount of swapping was enabled while profiling to allow the training to succeed with the additional memory overhead of nvprof. This higher amount of swapping in turn leads to the lower GPU utilization.
TensorFlow Large Model Support in PowerAI 1.6 allows training models with much higher resolution data. Combining the large model support with the IBM Power Systems AC922 server allows the training of these high resolution models with low data rate overhead.
June 11, 2019: Update links to models used.