The Blog

 

In our last update, at OSCON 2018 in Portland, we discussed how Fabric for Deep Learning, or FfDL (pronounced “fiddle”), the open source deep learning platform, released major updates. We added support for Jupyter Notebooks, support for invoking the Adversarial Robustness Toolbox to launch attacks and detect vulnerabilities, and we worked with key community players to integrate their capabilities into FfDL: Seldon for model deployment and serving and H20.ai for enabling distributed machine learning capabilities via FfDL on top of Kubernetes. The last release also enabled support for PyTorch 0.41 distributed training leveraging Uber’s Horovod mechanism.

Distributed training leveraging PyTorch 1.0

Today, we are excited to announce another major update. Fabric for Deep Learning (FfDL) now supports both PyTorch 1.0 and the ONNX model format.

diagram

PyTorch is a key part of the IBM product offerings, and both Watson Studio Deep Learning and IBM PowerAI support it. PowerAI Enterprise with Spectrum Conductor Deep Learning Impact (DLI) will be adding PyTorch 1.0 support, and the PowerAI version of PyTorch includes support for the IBM Deep learning Distributed library (DDL) for performing training across a cluster

Additionally, IBM has contributors supporting the open source PyTorch codebase, and we are adding multiarchitecture support in PyTorch by enabling builds for Power architecture. There are other interesting projects that came out of IBM Research like Large Model Support and an open source framework for seq2seq models in PyTorch.

With Fabric for Deep Learning, we are further investing in PyTorch by adding support for the distributed deep learning training capability found in PyTorch 1.0. With this updated feature set, PyTorch 1.0 takes the flexibility and ease-of-use of the existing PyTorch framework and merges it with the large-scale production capabilities of Caffe2 to give developers a seamless path from research to production.

Fabric for Deep Learning now supports PyTorch 1.0 to run with its latest distributed learning back end. FfDL can provision the requested number of nodes and GPUs with a shared file system on Kubernetes that lets each node easily initialize and synchronize with the collective process group. From there, users can update gradients with various point-to-point, collective, or multi-GPU collective communication. We also provide several examples to demonstrate how to get started with defining the PyTorch process group with different types of communication back ends, then train the model with distributed data parallelism.

We’ve also fully tested FfDL with the new PyTorch distributed training with GLOO, NCCL, and MPI communication back ends to sync the model parameters.

GLOO MPI NCCL
CPU x
GPU

Announcing tech preview for ONNX

In addition to support for PyTorch 1.0, IBM is also active in the ONNX community, which is a key feature of PyTorch 1.0. IBM contributed the TensorFlow ONNX converter, as the format is not yet natively supported in TensorFlow. Fabric for Deep Learning now supports converting PyTorch and TensorFlow models to the ONNX format. We gave a preview and demonstration of this capability at the Open Source Summit, Vancouver, and it was well received by the community.

To save the models in ONNX format, you can run your usual model training functions to train the model and save the model using the native torch.onnx function similar to saving a PyTorch model. This removed the abstractions between converting within the different training and serving frameworks you have in your organization. After you have your model converted to ONNX, you can simply load it to any serving back end and start using the model.

Complete the pipeline: Deploy your ONNX-based models using Seldon with nGraph

And to complete the pipeline, Fabric for Deep Learning has integration with Seldon. Apart from serving PyTorch and TensorFlow models, Seldon recently announced the ability to serve ONNX models with an nGraph back end, designed to optimize the inferencing performance, using CPUs.

With this, we can craft an end-to-end pipeline to convert FfDL-trained models to ONNX and serve it with Seldon. Furthermore, because FfDL can save trained models to Object Storage using the Flex volume on Kubernetes, we have improved the integration with Seldon as well to load the saved model directly from the FLEX volume, which can save the serving image disk space, generalize model wrapper definition, and improve scalability.

Get started with PyTorch 1.0, ONNX, and FfDL today

FfDL with PyTorch 1.0 support is now available on GitHub, along with AI Fairness 360, Adversarial Robustness Toolbox (ART),B Model Asset Exchange (MAX), and other open source AI projects from the Center for Open Source Data and AI Technologies group.

We hope you’ll explore these tools and share your feedback. As with any open source project, its quality is only as good as the contributions it receives from the community.