IBM Watson Machine Learning Community Edition (WML CE) 1.7.0 contains a new implementation of TensorFlow Large Model Support.
For readers unfamiliar with TensorFlow Large Model Support (TFLMS), it is a feature that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out-of-memory” errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed. With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.
Why a new TFLMS?
WML CE 1.7.0 is the first release to contain TensorFlow 2.x. In version 2, TensorFlow has made the new “eager execution” mode the default execution mode for models. Previous releases of TensorFlow built up a “whole graph view” of the neural network and executed the graph. This made it difficult for model authors to write and debug their models. Previous releases of TFLMS utilized the “whole graph view” of the model to make intelligent decisions on when to swap tensors. The new easier-to-use eager execution mode of TensorFlow in CE 1.7.0 requires a new implementation of Large Model Support (LMS).
Built-in, simple to enable, self-learning
LMS for TensorFlow 2 is built directly into TensorFlow’s GPU memory management. Since LMS is built in, there are no additional modules to install, and there is a simple on/off TensorFlow API to enable it. The integration into TensorFlow’s GPU memory management allows the new implementation to avoid out of memory conditions that the static graph analysis of previous releases could not. Neural networks typically follow the same set of steps for each batch of data, which leads to predictable memory allocation patterns. LMS makes use of this neural network behavior to learn the model’s memory allocation pattern over the course of a few iterations. As LMS learns the pattern, it can preemptively move inactive tensors to system memory to free GPU memory before it is needed. The self-learning and speculative swapping functionality increases the speed of training and inference jobs.
TensorFlow Large Model Support in IBM Watson Machine Learning Community Edition 1.7.0 provides an easy to use method of avoiding out of memory errors and scaling deep learning models beyond GPU memory capacity. IBM Watson Machine Learning Community Edition 1.7.0 is available now.