2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Fabric for Deep Learning (FfDL)

Deep learning frameworks such as TensorFlow, PyTorch, Caffe, Torch, Theano, and MXNet have contributed to the popularity of deep learning by reducing the effort and skills needed to design, train, and use deep learning models. Fabric for Deep Learning (FfDL, pronounced “fiddle”) provides a consistent way to run these deep-learning frameworks as a service on Kubernetes.

The FfDL platform uses a microservices architecture to reduce coupling between components, keep each component simple and as stateless as possible, isolate component failures, and allow each component to be developed, tested, deployed, scaled, and upgraded independently. Leveraging the power of Kubernetes, FfDL provides a scalable, resilient, and fault tolerant deep-learning framework.

The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes. A resource provisioning layer enables flexible job management on heterogeneous resources, such as GPUs and CPUs on top of Kubernetes.

The diagram below shows the basic architecture of the Fabric for Deep Learning project.

FfDL project diagram

FfDL uses REST APIs to access multiple deep learning libraries. Based on a microservices architecture, users can deploy FfDL by launching a single command, or follow detailed instructions using Helm charts which show each step of the entire deployment process.

Once deployed, there are four steps that users perform to use FfDL:

  1. Prepare their deep learning model
  2. Upload the model and training data
  3. Start the training job and monitor its progress
  4. Download the trained model once the training job is complete

Why FfDL?

Efficiently training large neural network models in scalable cloud infrastructures is of great importance as the use of machine learning and deep learning continues to grow. Machine learning workloads have traditionally been run in high-performance computing (HPC) environments, where users log in to dedicated machines and use the attached GPUs to run jobs that train models on huge datasets. Providing a similar user experience in a multitenant cloud environment can present unique challenges regarding fault tolerance, performance, and security. The FfDL project tackles these challenges by offering a deep learning stack that was specifically designed for on-demand cloud environments.

With FfDL, you can choose the deep-learning framework that your developers are most comfortable with. Leveraging the power of Kubernetes, FfDL provides a scalable, resilient, and fault tolerant deep-learning framework.

Why should I contribute?

We want to provide developers with a common deep learning platform across different cloud platforms. If you want to simplify machine learning development for users in your cloud platform or within your organization, you should contribute to FfDL by testing and adding support for your Kubernetes offerings, as well as missing deep learning engines. By contributing to FfDL, you’ll gain a better understanding of the technologies and infrastructure required for successfully developing, deploying, and managing microservices.