Introduction

Large Model Support (LMS) is a feature provided in IBM Caffe that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out of memory” errors.

LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.

IBM POWER Systems with NVLink technology are especially well-suited to LMS because of their hardware topology that enables fast communication between CPU and GPUs.

Use cases

One or more elements of a deep learning model can lead to GPU memory exhaustion. These include:

  • Model depth and complexity
  • Base data size (e.g. high-resolution images)
  • Batch size

Traditionally, the solution to this problem has been to modify the model until it fits in GPU memory. This approach, however, can negatively impact accuracy – especially if concessions are made by reducing data fidelity or model complexity.

With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.

LMS in Action

Let’s take a look at an example using ResNet-1521 on the ImageNet2 dataset. ResNet-152 is a deep residual network that requires a significant amount of GPU memory.

In this example, let’s define six scenarios: A, B, C, D, E and F – with batch sizes of 8, 16, 32, 64, 128, and 256, respectively. We’ll consider training each of these scenarios on a system with 4 x 16GB GPUs (NVIDIA Tesla V100).

Only scenario ‘A’ trains successfully without LMS. The others run into GPU memory limitations. For example, attempting to train scenario ‘B’ yields the following:

$ caffe train -gpu 0,1,2,3 --solver=solver-B.prototxt 
...
I0824 1780 solver.cpp:294] Solving ResNet-152
...
F0824 1780 syncedmem.cpp:569] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
Aborted (core dumped)

 

To avoid the memory limitation, simply enable Large Model Support by including -lms on the command line:

$ caffe train -gpu 0,1,2,3 --solver=solver-B.prototxt -lms
...
I0828 3878 solver.cpp:294] Solving ResNet-152
...
I0828 3878 caffe.cpp:421] Optimization Done.

 

Enabling LMS allows all of our scenarios to successfully complete training without any modification to the model itself.

Performance Considerations

Before digging further into the results of running with LMS, let’s take a look at a useful tunable: the -lms_exclude command-line option.

This option allows the user to define a soft limit on GPU memory allocated for LMS tensors, where limit = GPU-capacity<user-specified value in MB>.

By default, LMS favors GPU memory reuse (moving inactive tensors to host memory) over new allocations. This effectively minimizes GPU memory consumption.

However, when a limit is defined via lms_exclude, the algorithm favors allocation of GPU memory up to the limit prior to swapping any tensors out to host memory. This allows the user to control the amount of GPU memory consumed when using LMS.

Tuning this option to optimize GPU memory utilization, therefore, can reduce data transfers and improve performance. Since the ideal tuning for any given scenario may differ, it is considered a best practice to determine the value of lms_exclude experimentally, arriving at the smallest value that does not result in an out of memory error.

Results

With this in mind, let us now examine the results from our example scenarios, looking at three variations:

Variation Description Command
Off LMS off
caffe train -gpu 0,1,2,3 --solver=solver-<scenario>.prototxt
Default LMS on
caffe train -gpu 0,1,2,3 --solver=solver-<scenario>.prototxt -lms
Tuned LMS on, with lms_exclude3
caffe train -gpu 0,1,2,3 --solver=solver-<scenario>.prototxt -lms -lms_exclude <value>

 

Let’s look first at GPU memory utilization:

Training fails for lack of memory in scenarios B through F without LMS, as indicated by the Xs. Note the difference in memory use between the default and tuned variations and that this difference is most pronounced in scenarios with lower memory demands.

Now let’s examine training performance. The figure below plots the training duration of a fixed number of training iterations. The values are normalized to show relative performance compared to the default LMS variation in each scenario:

Again, observe that tuning is most effective for the ‘smaller’ models. Transferring tensors between host and GPU memory does have a performance cost. Because smaller models that nearly fit in GPU memory have less need for those transfers, they benefit most from tuning. Larger models necessarily force more data to be transferred and therefore benefit less.

This leads us to an important point concerning performance and LMS. LMS allows training of large models that wouldn’t otherwise be possible with GPU only. The data transfers used by LMS come at a cost, however, and the performance of LMS will depend on the interconnects between GPU, CPU and system memory.

As stated before, IBM POWER Systems with NVLink technology are especially well-suited to LMS because of their hardware topology. Specifically, the NVLink 2.0 connections allow 150 GB/s communication in each direction between CPU and GPU compared to the 32 GB/s of PCI Gen3 in traditionally connected GPUs. See the article referenced in the Further Reading section for a detailed analysis on this point.

Conclusion

LMS allows you to increase the accuracy of your deep learning workloads by preserving the fidelity of your data and without any change to your model’s network architecture.

Have you hit memory issues with your models? You don’t need to anymore. Try LMS today in the latest PowerAI Release!

Further Reading

While increasing batch size is a simple way to demonstrate the mechanics of LMS, it is admittedly not the most compelling use case. For a real-world case study using LMS, along with a detailed analysis of the benefits of IBM POWER Systems with NVLink technology specific to LMS, see TensorFlow Large Model Support Case Study with 3D Image Segmentation.

See also Getting started with Caffe in the IBM Knowledge Center for more information on PowerAI Caffe, including additional optimizations and enhancements from IBM.


1 ResNet-152 example model based on https://github.com/antingshen/resnet-protofiles
2 ImageNet image database from http://image-net.org/
3 Tuned -lms-exclude values used: 1 (scenarios A, B), 5120 (scenarios C, D, E), 6144 (scenario F). Determined experimentally.

Join The Discussion

Your email address will not be published. Required fields are marked *