As a deep learning practitioner, you want reliability and scalability while orchestrating your training jobs. In addition, you would want to do this in a consistent manner across multiple libraries. With Fabric for Deep Learning (FfDL) on Kubernetes, you can achieve this by giving users the ability to leverage deep learning libraries such as Caffe, Torch, and TensorFlow in the cloud in a resilient manner with minimal effort. The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes. A resource provisioning layer enables flexible job management on heterogeneous resources, such as graphics processing units (GPUs) and central processing units (CPUs), in an infrastructure as a service (IaaS) cloud.
Training deep neural networks, known as deep learning (part of machine learning methods), is highly complex and computationally intensive. A typical user of deep learning is unnecessarily exposed to the details of the underlying hardware and software infrastructure, including configuring expensive GPU machines, installing deep learning libraries, and managing the jobs during execution to handle failures and recovery. Despite the ease of obtaining hardware from IaaS clouds and paying by the hour, the user still needs to manage those machines, install required libraries, and ensure resiliency of the deep learning training jobs.
This is where the opportunity of deep learning as a service lies. In this code pattern, we show you how to deploy a deep learning Fabric on Kubernetes. By using cloud native architectural artifacts like Kubernetes, microservices, Helm charts, and object storage, we show you how to deploy and use a deep learning Fabric. This Fabric spans across multiple deep learning engines like TensorFlow, Caffe, and PyTorch. It combines the flexibility, ease of use, and economics of a cloud service with the power of deep learning. You’ll find it easy to use and by using REST APIs, you can customize the training with different resources per user requirements or budget. Allow users to focus on deep learning and the applications instead of focusing on faults.
- The FfDL deployer deploys the FfDL code base to a Kubernetes cluster. The Kubernetes cluster is configured to used GPUs, CPUs, or both, and has access to S3-compatible object storage. If not specified, a locally simulated S3 pod is created.
- Once deployed, the data scientist uploads the model training data to the S3-compatible object store. FfDL assumes the data is already in the required format as prescribed by different deep learning frameworks.
- The user creates a FfDL Model manifest file. The manifest file contains different fields that describe the model in FfDL, its object store information, its resource requirements, and several arguments (including hyperparameters) that are required for model execution during training and testing. The user then interacts with FfDL by using CLI/SDK or UI to deploy the FfDL model manifest file with a model definition file. The user launches the training job and monitors its progress.
- The user downloads the trained model and associated logs once the training job is complete.
Find the detailed steps for this pattern in the README. The steps will show you how to:
- Compile and code and build Docker images.
- Install the FfDL components with helm install.
- Run a script to configure Grafana for monitoring FfDL.
- Obtain your Grafana, FfDL Web UI, and FfDL REST API endpoints.
- Run some simple jobs to train a convolutional network model by using TensorFlow and Caffe.