Introduction

IBM Watson Machine Learning Community Edition 1.7.0 (WML CE) includes TensorFlow 2.1.0 which has been enhanced with Large Model Support. Large Model Support (LMS) allows successful training of deep learning models that would otherwise exhaust GPU memory, allowing them to scale significantly beyond what was previously possible and, ultimately, generate more accurate results.

In this article, we investigated the runtime performance of model training with LMS across image resolutions on three different models: ResNet50 and ResNet152v2 from keras_applications run with TensorFlow Keras and 3D U-Net.

For these tests, a single NVIDIA V100 GPU with 32 GB of memory is used. The IBM Power Systems AC922 server, which is used in the Summit super computer, is used for the performance test runs. The POWER architecture used in the IBM Power Systems AC922 is the only system architecture that has NVLink connections between the CPU and GPU.

The software levels that were used are the levels included in WML CE 1.7.0:

  • TensorFlow 2.1.0 with Large Model Support
  • CUDA 10.2
  • cuDNN 7.6.5

For testing purposes, we ran each model training for a small number of iterations across a wide range of image resolutions. Larger images have more data, so they take longer to process. Therefore, the data rate was normalized to pixels or voxels per second to allow data rate comparisons across the resolution spectrum.

3D U-Net

The first model we looked at was 3D U-Net. This is a common model for 3D image segmentation and is very memory intensive. For 3D U-Net, we started with this model, converted it to use TensorFlow Keras and then updated it to support TensorFlow 2.1.0. The updated model code is here. The model is run with LMS enabled, LMS defragmentation enabled, and LMS statistic logging enabled. The LMS runtime statistics, which include the average data rate and the GiB of data swapped/reclaimed per iteration are logged to a file. The following graph shows the data rate (megavoxels per second) and GiB swapped per batch curves:

As you can see from the graph, LMS allows a 7.1 times increase in MRI resolution above the maximum resolution without LMS. The data rate begins to drop after LMS is enabled, but is only showing a 21% reduction while processing 7.1 times the amount of voxels per batch. The 400^3 resolution is the largest possible resolution with this model. Above this resolution the tensors begin to contain more than 2 giga-elements which is not supported by cuDNN.

ResNet152v2

The next model we looked at was ResNet152v2. ResNet152 is a variant of the ResNet model with more layers than the typically used ResNet50. The tensors produced by the additional layers will consume more memory than ResNet50, making this model a good candidate to benefit from LMS.

The model we used is the ResNet152v2 model from TensorFlow Keras (tf.keras.applications). The LMS example, ManyModel.py, provides an easy way to test LMS with the various models provided by tf.keras. This model was used with the ResNet152v2 model option.

We start with a resolution that fits in GPU memory for training and then increment each image dimension by 500. The model is run with LMS enabled, LMS defragmentation enabled, and LMS statistic logging enabled. The LMS runtime statistics, which include the average data rate and the GiB of data swapped/reclaimed per iteration are logged to a file. The following graph shows the data rate (megapixels per second) and GiB swapped per batch curves:

This time the graph shows that LMS allows training with images that contain 14.4 times more megapixels than the maximum resolution without LMS. The megaxpixels per second data rate of the model training climbs and then begins to level off when the image resolution reaches 2500×2500, which is the maximum resolution that can be processed without LMS. As the resolution increases, LMS begins to reclaim GPU memory by swapping tensors to system memory. The rate drops a bit due to the overhead of tensor swapping, but levels off at the 5000×5000 resolution. At the 9500×9500 resolution, the data rate has has only reduced 30% while processing 14.4 times the number of pixels per batch.

ResNet50

The last model we looked at was ResNet50. Why ResNet50? Namely, because it is a common network to benchmark with and additionally, while image classification may not benefit from high resolution data, image classification networks such as ResNet50 can be used as feature extractors for object detection and image segmentation networks, which can benefit from higher resolution data.

The model we used is the ResNet50 model from TensorFlow Keras (tf.keras.applications). The LMS example, ManyModel.py, provides an easy way to test LMS with the various models provided by tf.keras. This model was used with the ResNet50 model option.

We start with a resolution that fits in GPU memory for training and then increment each image dimension by 500. The model is run with LMS enabled, LMS defragmentation enabled, and LMS statistic logging enabled. The LMS runtime statistics, which include the average data rate and the GiB of data swapped/reclaimed per iteration are logged to a file. The following graph shows the data rate (megapixels per second) and GiB swapped per batch curves:

The graph shows that LMS allows training with images that contain 8.3 times more megapixels than the maximum resolution without LMS. The data rate climbs and then generally levels off before 4000×4000, which is the maximum resolution that can be processed without LMS. After LMS is enabled, the data rate drops off a bit faster than with ResNet152v2, with a 45% reduction in the peak data rate at resolution 11000 x 11000. The maximum resolution of this model is 11500×11500. Above this resolution, the tensors begin to contain more than 2 giga-elements which is not supported by cuDNN.

Conclusion

Large Model Support for TensorFlow 2 in WML CE 1.7.0 allows you to train models with much higher resolution data. Combining the large model support with the IBM Power Systems AC922 server allows the training of these high resolution models with moderate data rate overhead.

Join The Discussion

Your email address will not be published. Required fields are marked *