Previous blogs and videos have discussed tensor swapping with TensorFlow Large Model Support (TFLMS) while running on the IBM Power Systems AC922. Unlike other systems, IBM Power Systems connect their GPUs to their CPUs using high bandwidth NVLink connections. This has been shown to produce substantial speed improvements to model training while using TensorFlow Large Model Support. This blog will discuss the performance characteristics of TensorFlow’s built-in swapping for recurrent neural networks (RNNs).
TensorFlow implements RNNs using the while loop. RNNs can build up many intermediate tensors during the forward phase of the while loop cycle. These tensors live in memory until the backward phase. To conserve GPU memory, the while loop has an option to swap out the tensors to CPU / system memory once the GPU memory reaches 70% utilization. These tensors are then automatically brought back to GPU memory for processing during the backward phase.
The TensorFlow implementation of the Keras APIs always sets the while loop swap_memory option in RNNs and the LSTM child classes. This means that by default, TensorFlow models built using the RNN or LSTM layers will automatically swap tensors to avoid out of memory failures.
To investigate the performance impacts of swapping on LSTMs, a simple model was used on a single GPU of an AC922 with 32GB of memory. To make the model larger and reach the memory utilization threshold to initiate swapping, the number of LSTM units were increased. The Keras model summary of the model is:
Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 250, 300) 15000000 _________________________________________________________________ spatial_dropout1d_1 (Spatial (None, 250, 300) 0 _________________________________________________________________ lstm_1 (LSTM) (None, 1000) 5204000 _________________________________________________________________ dense_1 (Dense) (None, 9) 9009 =================================================================
The batch size was started at 250 and increased by 250 until the model failed to train due to out of memory errors. Here is a graph of the data rate in steps per second over the batch size range:
We observed that the while loop began swapping tensors at batch size 1000. The data rate was continually increasing before swapping kicked in. This is because the GPU can process increasing amounts of data, and the ratio of GPU batch processing time to batch handling overhead is continually increasing. After batch size 1000, the data rate goes through a gradual drop off as batch size and swapping increase.
We used NVIDIA Visual Profiler (nvprof) to measure the GPU utilization and amount of data swapped per batch. The result is shown below:
The GPU utilization declines as the batch size increases, which matches the decrease in the steps per second data rate. The amount of data swapped per batch increases dramatically as the batch size increases – topping out at 111GB per batch. At the 6000 batch size, the NVLink connection between the CPU and GPU had an average throughput of 66 GB/s. This high bandwidth connection between the CPU and GPU helps keep the GPU utilization rate high while doing the large amount of data swapping. Previous experiments comparing model training with tensor swapping on the AC922 server versus PCI connected GPUs on an x86 show that the dedicated NVLink CPU-GPU connections allow the models to train multiple times faster. Similar testing with RNNs and while loop swapping would be expected to have similar speed differences between the platforms.
Rather than using the while loop operation in TensorFlow, loops can be unrolled by passing
unroll=True to the layer. Keras documentation states the following about unrolling: “Unrolling can speed-up a RNN, although it tends to be more memory-intensive. Unrolling is only suitable for short sequences,”. In practice, when the loop is unrolled, TensorFlow does not use the while loop operation and the underlying model graph becomes very large as the operations in the loop are repeated. As the Keras documentation states, this becomes very memory intensive because all of the intermediate tensors remain in memory. TFLMS can be used to add swapping for unrolled loops. To test the effectiveness of TFLMS with the unrolled loops, we specified unroll=True on the LSTM layer of the model and ran with batch size 2500. As expected, the model training failed with an out of memory error. TFLMS was then enabled on the model and minimally tuned. TFLMS allowed the unrolled model to train without hitting out of memory errors. Despite not being optimally tuned, the unrolled model with TFLMS swapping performed 20% faster than the model with the while loop. Some of this speed-up can be attributed to TLMFS’ ability to schedule the swap-in operation earlier in the graph execution. This allows the graph execution to continue without waiting for the swap-in operations.
RNNs (and LSTMs) in TensorFlow Keras automatically take advantage the NVLink connection between the CPU and GPU to train models beyond the constraints of GPU memory without large performance penalties for tensor swapping. TensorFlow Large Model Support can be used with unrolled RNNs to avoid out of memory errors, while allowing the models to train faster than their default looped versions.