Get the code
by Animesh Singh, Scott Boag, Tommy Li, Waldemar Hummer | Updated July 3, 2018 - Published March 21, 2018
Artificial intelligenceData ScienceCloud
As a deep learning practitioner, you want reliability and scalability while orchestrating your training jobs. In addition, you would want to do this in a consistent manner across multiple libraries. With Fabric for Deep Learning (FfDL) on Kubernetes, you can achieve this by giving users the ability to leverage deep learning libraries such as Caffe, Torch, and TensorFlow in the cloud in a resilient manner with minimal effort. The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes. A resource provisioning layer enables flexible job management on heterogeneous resources, such as graphics processing units (GPUs) and central processing units (CPUs), in an infrastructure as a service (IaaS) cloud.
Training deep neural networks, known as deep learning (part of machine learning methods), is highly complex and computationally intensive. A typical user of deep learning is unnecessarily exposed to the details of the underlying hardware and software infrastructure, including configuring expensive GPU machines, installing deep learning libraries, and managing the jobs during execution to handle failures and recovery. Despite the ease of obtaining hardware from IaaS clouds and paying by the hour, the user still needs to manage those machines, install required libraries, and ensure resiliency of the deep learning training jobs.
This is where the opportunity of deep learning as a service lies. In this code pattern, we show you how to deploy a deep learning Fabric on Kubernetes. By using cloud native architectural artifacts like Kubernetes, microservices, Helm charts, and object storage, we show you how to deploy and use a deep learning Fabric. This Fabric spans across multiple deep learning engines like TensorFlow, Caffe, and PyTorch. It combines the flexibility, ease of use, and economics of a cloud service with the power of deep learning. You’ll find it easy to use and by using REST APIs, you can customize the training with different resources per user requirements or budget. Allow users to focus on deep learning and the applications instead of focusing on faults.
Find the detailed steps for this pattern in the README. The steps will show you how to:
April 10, 2019
April 17, 2019
Back to top