In PowerAI 1.6, the TensorFlow Large Model Support (TFLMS) module has a new implementation and has graduated from tech preview status. This new implementation can achieve much higher levels of swapping which in turn can provide training and inferencing with higher resolution data, deeper models, and larger batch sizes. For a review of TFLMS and the performance advantages of running it on IBM Power systems, see these articles:

- TensorFlow Large Model Support Case Study with 3D Image Segmentation
- Performance of 3DUnet Multi GPU Model for Medical Image Segmentation using TensorFlow Large Model Support

The new implementation addresses several usability issues from the old version and introduces many new features that we will discuss:

- Easier to enable in model code
- Automatic tuning of swapping parameters and faster graph modification times
- Faster graph modification times
- More swapping allows higher resolutions
- Finer tuning of asynchronous compute and memory transfer
- Serialization of operations in layers
- More model and tensor information output
- Simulated GPU memory usage graphs

## Easier to enable in model code

The original implementation of TFLMS (TFLMSv1) required the model developer to put their optimizer operation declarations within a named scope when using base TensorFlow APIs or the Estimator API and then pass this name into TFLMS. With TFLMSv2 this requirement and stumbling block has been removed. Enabling TFLMS in the model can now be as simple as importing TFLMS, instantiating the class, and either passing it in as callback or calling the run() method.

## Automatic tuning of swapping parameters and faster graph modification times

The previous version of TFLMS required a lot of timing consuming, manual tuning of the swapping control parameters. TFLMSv2 enables faster time to training by automatically tuning its parameters to minimize the amount of swapping required to avoid out of memory errors. It achieves this by running a quick simulated iteration through the model, keeping track of tensor memory allocations and garbage collections. TFLMSv2 tuning then uses binary searches with repeated simulations to quickly find the optimal parameter combinations. Once the optimal set of values is found for a given model and batch size they can be set directly on the LMS constructor to save the auto-tune time on future runs.

In addition to tuned swapping control parameters, the time required for modifying the graph has been greatly reduced. Example graph modification times and automatic tuning times for a deep 3D U-Net model using a 192^3 image on a 16GB V100 GPU are shown in this table:

TFLMSv1 | TFLMSv2 | |
---|---|---|

Graph modification / startup time | 4.4 minutes | 18 seconds |

LMS parameter tuning | Hour(s)? | 6.3 minutes |

## More swapping allows higher resolution

The new TFLMSv2 implementation can add more swapping nodes than the TFLMSv1 implementation. This allows training with higher resolution data. Here is the comparison of resolution increases measured with TFLMSv1 and TFLMSv2:

Resolution increase factors | TFLMSv1 | TFLMSv2 |
---|---|---|

2D models (such as GoogleNet, ResNet50) | 5x | 10x |

3D models (such as 3D U-Net) | 2.5x | 5x |

## Finer tuning of asynchronous compute and memory transfer

While training models with very large tensors, out of memory errors can occur because the output tensors of previous operations are not fully swapped out before the current operation allocates memory for its output tensor. TFLMSv2 allows more control over doing memory transfers for swapping asynchronously with compute, or synchronously with compute. This finer control allows the data scientist to synchronize the compute and transfer when necessary to achieve deeper models, higher resolutions, and higher batch sizes.

## Serialization of operations in layers

TensorFlow runs model operations in each layer in parallel. This can cause out of memory errors if the operations in the layer produce large tensors which cannot co-reside in GPU memory. TFLMSv2 addresses this limitation by enabling the data scientist to serialize all operations in selected layers of the model. This allows operations that produce very large tensors to run in a serialized fashion while allowing the rest of the model operations to run in parallel. This has the effect of allowing larger models, high resolutions, etc.

## More model and tensor information output

TFLMSv2 allows more efficient understanding and tuning of models by producing more information about the model graph and operations during its analysis and modification. For example, the largest operation allows the data scientist to gauge the size of the model and how much swapping will be performed. Other example information is provided below:

```
INFO:tensorflow:[LMS][0] Editing model for LMS
INFO:tensorflow:[LMS][0] The graph has 14678 vertices and 18812 edges.
INFO:tensorflow:[LMS][0] The graph has 179.18 MiB of learning parameters
INFO:tensorflow:[LMS][0] The largest operation is training/RMSprop/gradients/bn2c_branch2c/cond/FusedBatchNorm_grad/FusedBatchNormGrad consuming 10.06 GiB
INFO:tensorflow:[LMS][0] Original categorized topological sort has 742 levels
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 1, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 185, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 278, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 324, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 347, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 353, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 356, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 357, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 358, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 358, swapin_ahead 1, swapin_groupby 0, sync_mode 0
INFO:tensorflow:[LMS][0] [Simulator] Found a parameter set: swapout_threshold 358, swapin_ahead 1, swapin_groupby 742, sync_mode 0
INFO:tensorflow:[LMS][0] Added 292 operations to the model (146 swap-out operations (108.89 GiB) and 146 swap-in operations (108.89 GiB))
```

## Simulated GPU memory usage graphs

The TFLMS simulator used for auto-tuning can also produce a graph of expected GPU memory utilization over the course of an iteration. These graphs provide useful memory usage insights both with or without LMS enabled. For example, the simulator generated the following two expected memory usage graphs for a 32GB GPU with a TensorFlow Keras ResNet50 model and an image resolution of 7500×7500, both with and without TFLMSv2 enabled:

These graphs show that, without TFLMS, this particular model is expected use 140 GiB of GPU memory at its peak. After TFLMS graph modifications, the expected memory allocation peak drops to around 25 GiB, now fitting within a 32GB GPU.

For more information about these features see the TensorFlow Large Model Support documentation.