What is Large Model Support?

Deep Learning is a rapidly evolving field under the umbrella of Artificial Intelligence. This segment of AI has already demonstrated the capability to solve a variety of problems in Computer Vision, Natural Language Processing, Video and Text Processing. Deep Learning neural networks consists of multiple hidden layers and the number of layers often depends on the number of features that you would like the network to learn. Depending on the complexity of the problem that you are solving, the neural networks or models become larger. Deeper neural networks with hundreds of layers (such as Resnet152) or even thousands (Resnet-1001) are currently available. The larger the models, the larger the memory required while training them.

Deep Learning (DL) trainings are generally run on accelerated hardware such as GPUs that can satisfy the high computational demands of the neural networks. But GPUs are limited in their memory capacities. The latest available GPUs have not surpassed 16GB or 32GB of memory capacity, which is far less than the amount of memory available for CPUs. Hence, larger models (with deeper or wider layers) cannot fit into the limited memory available on GPUs. Similar memory limitation arises when you have large input datasets (especially high-resolution images such as MRI images, satellite images etc.) and higher batch sizes.

To overcome this memory limitation in GPUs and enable large model support, multiple approaches can be used. In IBM PowerAI 1.5.2 release, TensorFlow framework includes a python module called Large Model Support (TF-LMS). TF-LMS addresses the memory limitation in GPUs by using the CPU memory as a temporary space to store the tensors during Deep Learning training phase. This is a seamless approach that can be used generally for any model and is easily enabled through a set of command line parameters used while training your model. Although the additional swap in/out operations between the CPU memory and GPU memory might be perceived as an overhead, the IBM Power AC922 systems with a high bandwidth NVLink 2.0 connecting the CPU and GPU reduces the impact of this overhead and enables efficient training of your large models compared to competitive platforms as evident from the results discussed in the following sections.

About TensorFlow Large Model Support (TF-LMS)

TensorFlow Large Model Support (TFLMS) is a new feature that is part of PowerAI 1.5.2, released as a Tech Preview to customers. TFLMS enables usage of high resolution datasets, larger models and/or larger batch sizes by allowing the system memory to be used in conjunction with the GPU memory. TFLMS modifies the TensorFlow graph prior to training to inject swap nodes that will swap tensors in and out of GPU memory to system memory. It also provides certain controls to configure when and what will be swapped in/out. The detailed methodology used in graph rewriting is discussed in this paper TFLMS: Large Model Support in TensorFlow by Graph Rewriting.

TFLMS is part of the TensorFlow contrib in PowerAI 1.5.2. It is contributed to the community (not accepted yet) and a pull request is available at https://github.com/tensorflow/tensorflow/pull/19845

PowerAI includes a component called the Distributed Deep Learning (DDL) library that is an optimized component for multi-gpu/multi-node distributed deep learning training. TF-LMS uses DDL to do a multi-gpu training on POWER systems for optimized performance.

Enlarged GoogLeNet Model

GoogLeNet is a deep neural network that is one of the incarnations of the Inception architecture from Google that won the ILSVRC14 ImageNet competition in 2014. GoogLeNet is 22 layers deep and the architecture is described in the paper Going deeper with convolutions. The original GoogLeNet model that comes with TensorFlow benchmarks (HPM) uses the image crop size of 224×224 when running with ImageNet dataset. The model was modified to use an image size of 2240×2240, thereby increasing the input data size of the model. This serves as a good use case to show the advantages of Large Model Support. Performance evaluation results provided in the following sections use this Enlarged GoogLeNet with Enlarged ImageNet dataset (2240×2240).

With larger batch sizes per iteration of the enlarged GoogLeNet, the data and model size become too big to fit in the GPU memory. Hence, with regular TensorFlow 1.8, only a batchsize of 11 or fewer can be run on GPU without getting an out-of-memory error for the Enlarged GoogLeNet model. But TF-LMS can go beyond that batchsize due to its ability to handle larger batch sizes and model by using CPU memory as a swap space. We picked a batch size of 15 for the following evaluations using Enlarged GoogLeNet.

Competitive Comparison of TF-LMS

We evaluated the performance of TensorFlow Large Model Support on IBM POWER9 AC922 systems and on a competitive x86 server. The following section describes the details of the hardware and software used in this comparison and the results based on our observations.

HARDWARE OVERVIEW

IBM’s AC922 servers comes with the flagship POWER9 processors and NVIDIA Volta (V100) GPUs connected through high speed NVLink2.0, with a max bandwidth of 150GB/s connecting CPU-GPU.

The x86 based competitive platform used for the comparison has 32GB/s PCI Gen3 link between CPU-GPU and NVLink connectivity between the GPUs.

HARDWARE STACK:

  • IBM Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 512GB memory, 4xTesla V100 GPU, RHEL7.5 for P9, CUDA9.2/396.26, CuDNN7.1.4.
  • Intel Xeon processor, 40 cores (2 x 20c chips); 2.20 GHz; 512GB memory, 8xTesla V100 GPU, Ubuntu 16.04, Cuda9.0/384.81, CuDNN7.1.1.5

In addition to the IBM AC922 hardware Infrastructure, IBM also has demonstrated its leadership in enabling Cognitive systems through IBM PowerAI software offering. PowerAI includes the leading Deep Learning Frameworks and features that differentiate IBM POWER platform. In the following measurements, we used TensorFlow with LMS that is part of PowerAI Release 1.5.2.

SETUP FOR COMPETITIVE COMPARISON

  1. On IBM AC922: Install PowerAI 1.5.2 packages to get TensorFlow 1.8 with LMS and HPM benchmarks
  2. On x86 platform: Install TensorFlow 1.8 standard distribution and integrate LMS contrib code (from PR 19845) using below steps
    • Pip install TF 1.8.0
    • Integrate LMS code to contrib
      • Copy LMS code to the TensorFlow contrib directory in the competitive platform (eg: ~/tf-v1.8.0/lib/python2.7/site-packages/tensorflow/contrib)
      • LMS Code can be found in the pull request https://github.com/tensorflow/tensorflow/pull/19845
      • NOTE: for python 3 based installations, use the path lib/python3.6/site-packages/tensorflow/contrib/lms
      • Change the contrib init to pick up lms
      • Open __init__.py and add the line: from tensorflow.contrib import lms
  3. HPM benchmark:
    Use HPM benchmarks that is installed as part of PowerAI 1.5.2 on both the platforms.

  4. On x86 platform, make the below change in benchmark_cnn.py
      To avoid the memory_optimizer incompatibility issue set the memory optimizer’s config to SCHEDULING_HEURISTICS by adding the below lines in benchmark_cnn.py
    • + if params.lms:
      + config.graph_options.rewrite_options.memory_optimization =ewriter_config_pb2.RewriterConfig.SCHEDULING_HEURISTICS
  5. Dataset and Model: Enlarged ImageNet Dataset (2240×2240) and Enlarged GoogLeNet Model (crop size – 2240×2240) with TensorFlow HPM (High Performance Benchmark)
    • Use the HPM command line options --model=googlenet --image_size=2240 to use the enlarged ImageNet Dataset and enlarged GoogLeNet model
  6. Run parameters:
    • Mini-Batch size: 15
    • LMS options used: --lms=True --lms_lb=1 --lms_n_tensors=-1 (When --lms_lb=1 --lms_n_tensors=-1 options are used, all the reachable tensors on the GPU are swapped, ensuring minimum memory use on GPU)
    • All results are with FP16 option of TensorFlow
  7. Command to run:
    • On P9(AC922): ddlrun --mpiarg -pami_noib -H hostname python tf_cnn_benchmarks.py --num_batches=500 --num_gpus=4 --batch_size=15 --local_parameter_device=gpu --variable_update=ddl --data_name=imagenet --data_dir=/tmp/imagenet --use_fp16=True --lms=True --lms_lb=1 --lms_n_tensors=-1 --image_size=2240 --model=googlenet
    • On x86 platform: python tf_cnn_benchmarks.py --num_batches=500 --num_gpus=4 --batch_size=15 --local_parameter_device=gpu --variable_update=replicated --data_name=imagenet --data_dir=/tmp/imagenet --use_fp16=True --lms=True --lms_lb=1 --lms_n_tensors=-1 --image_size=2240 --model=googlenet
    • NOTE: hostname and data_dir to be replaced respectively with the hostname of the system and the directory where the TF records for the ImageNet data resides.
  8. TF-LMS – IBM AC922/V100 Vs x86/V100 – 1 GPU

    The following chart shows the throughput comparison of TF-LMS on AC922 and x86 server with 1 V100 GPU. TF-LMS with 1GPU on AC922 is 2.7x better in throughput compared to the x86 server with Enlarged GoogLeNet Model with a Batchsize of 15.

    TFLMS modifies the TensorFlow graph to inserts swap in/out tensors. These tensors would be swapped in/out as required during the training. AC922 has a high bandwidth CPU-GPU communication through NVLink (with a max bandwidth of 150GB/s bidirectional connecting CPU-GPU), compared to PCIe 3.0 (max of 32 GB/s bidirectional) on the x86 server, which helps in efficient swap-in/swap-out of data to/from the system memory to GPU memory, thereby increasing the throughput.

    *Note that the results are based on IBM Internal Measurements running 500 steps of Enlarged GoogLeNet model (mini-batch size=15) on Enlarged ImageNet Dataset (2240×2240).

    TF-LMS – IBM AC922/V100 Vs x86/V100 – 4 GPU

    The following chart shows the throughput comparison of TF-LMS on AC922 and x86 server in a multi-GPU scenario with 4 V100 GPU. To optimize the performance of multi-GPU training on AC922, we used the PowerAI DDL (Distributed Deep Learning) library integrated with TensorFlow HPM to distribute the training across 4 GPUs. On x86, the regular TensorFlow HPM benchmark’s replicated multi-tower multi-GPU support was used.

    The competitive comparison shows that TF-LMS on AC922 4 GPU optimized with DDL is 4.7x better in throughput compared to x86 with Enlarged GoogLeNet Model and Batchsize of 15.

    AC922 has a high bandwidth CPU-GPU communication through NVLink (with a max bandwidth of 150GB/s connecting CPU-GPU), compared to PCIe 3.0 (max of 32 GB/s) on x86 server. In addition, x86 server topology has two consecutive GPUs share the same PCIe switch connecting to the CPU socket, which further reduces the bandwidth.

    *Note that the results are based on IBM Internal Measurements running 500 steps of Enlarged GoogLeNet model (mini-batch size=15) on Enlarged ImageNet Dataset (2240×2240).

    Conclusion

    Deep Learning is rapidly transforming the way AI is used in real life applications. The possibilities are explored in different dimensions by researchers and industry. At one side, deeper and wider models are being introduced to improve the training accuracy. On the other side, studies are ongoing to use large datasets for real life applications, like cancer detection using high resolution MRI images with 3D CNNs and Multi-View DCNNs. IBM AC922 hardware platform with high speed NVLink 2.0 and the IBM PowerAI optimized software distribution with Large Model Support enables the training of such large models and high-resolution data that does not fit in the GPU memory. The results that we showcased demonstrates up to 4.7x improvement in throughput of Deep Learning training on IBM AC922 with 4 Nvidia Volta GPUs compared to competitive platform, thereby maximizing the research productivity of data scientists and researchers.

Join The Discussion

Your email address will not be published. Required fields are marked *