This post is co-authored by Animesh Singh from IBM and Alex Sergeev from Uber.


Fabric for Deep Learning, or FfDL (pronounced “fiddle”), the open source deep learning platform, has had a major release update. In addition to adding key features around object storage access using S3FS (leveraging another open source IBM project, Kubernetes Object Storage Plugin), the new release adds the ability to use Jupyter Notebook to launch model training on FfDL. By using Jupyter notebooks, we can invoke Adversarial Robustness Toolbox to launch attacks and detect vulnerabilities in trained models stored on the FfDL object store. We also worked with key community players to integrate their capabilities in FfDL: Seldon for model deployment and serving and H20.ai for enabling distributed machine learning capabilities via FfDL on top of Kubernetes.

To enable developers to easily get started with these new updates, we launched two IBM Code patterns:

In addition, we’ve enabled distributed training leveraging Horovod in FfDL.

Announcing support for PyTorch distributed training using Horovod in FfDL

The release also enabled support for Uber’s Horovod mechanism for distributed deep learning training. Horovod provides a unified user experience for distributed training across distributed training frameworks for TensorFlow, Keras, and PyTorch. Horovod enables distributed model training through Message Passing Interface (MPI), a low-level interface for high-performance parallel computing. With MPI, both TensorFlow and PyTorch models can be trained leveraging the distributed Kubernetes cluster. While distributed TensorFlow training was supported in FfDL using the Parameter Server approach, Horovod added another mechanism and also enabled PyTorch distributed training.

Why Horovod?

We find that Horovod makes it extremely easy to switch from a single-GPU training to large distributed training, which improves ML engineers’ velocity. It also make it easier for users to switch between frameworks, all while training at scale. It improves efficient inter-GPU communication through ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training. Additionally, there is a package of new features coming to Horovod aimed at both large-scale systems and systems lower grade interconnects, new capabilities for very large model training, and self-diagnostics.

Horovod uses the concepts of allreduce, and developers use an MPI implementation such as Open MPI. Additional details on the framework can be found in the Horovod white paper. Allreduce performs an element-wise reduction on arrays of data spread across nodes of a cluster. At the end of the allreduce calculation, every node will have a copy of the result.

In addition, Horovod uses NCCL2 under the covers for GPU communication.

Try distributed Tensorflow and PyTorch leveraging Horovod in FfDL today!

You can find the details on how to use Horovod in FfDL in the open source FfDL readme file and guide. Deploy, use, and extend them with the capabilities that you find helpful. We’re waiting for your feedback and pull requests!

Join The Discussion

Your email address will not be published. Required fields are marked *