What is Large Model Support?
IBM Caffe with Large Model Support (LMS) loads the neural model and data set in system memory and caches activity to GPU memory, allowing models and training batch size to scale significantly beyond what was previously possible.
You can enable LMS by adding
<size in KB>
-lms 1000. Then, any memory chunk larger than 1000 KB will be kept in CPU memory, and fetched to GPU memory only when needed for computation. Thus, if you pass a very large value like
-lms 10000000000, it will effectively disable the feature while a small value means more aggressive LMS. The value is to control the performance trade-off.
LMS uses system memory and GPU memory to support more complex and higher resolution data.
TensorFlow Large Model Support (TLMS) provides an approach to training large models, batch sizes, and data sizes that cannot fit into GPU memory. It achieves this by automatically moving tensor data between the GPU and system memory. TensorFlow Large Model Support is currently available as a technology preview. For more information on how to enable TensorFlow Large Model Support start here: README. Note that if you’re using TLMS with PowerAI and need additional information, you should check the PowerAI README.
PyTorch Large Model Support (LMS) is a technology preview provided in the latest release of PowerAI. PyTorch that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with â€śout of memoryâ€ť errors. LMS manages this over subscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.
See the â€śGetting started with PyTorchâ€ť topic in the IBM Knowledge Center for more information.
Back to top