Get FfDL up and running on a private cloud

As many companies collect massive amounts of data, they want to use artificial intelligence (AI) and the collected data to improve user experiences for their products. Providing an easy way to use AI environments can lower the entry barrier and enable both developers and data scientists to focus on what they do best: analyzing data and defining and training cutting-edge neural network models (with automation) over these large data sets.

Fabric for Deep Learning (FfDL) is an open source collaboration platform for running deep learning workloads in private or public Kubernetes-based clouds. Leveraging the power of Kubernetes, FfDL provides a scalable, resilient, and fault-tolerant deep-learning framework by combining the right software, drivers, compute, memory, network, and storage resources.

IBM Cloud Private is an integrated environment for managing containers that includes the container orchestrator Kubernetes, a private image registry, a management console, and monitoring frameworks that are all running within your data center. Various types of solutions can benefit by deploying IBM Cloud Private on-premises for data privacy, data protection, and full control over the environment.

Together, IBM Cloud Private and FfDL can provide a solution with the flexibility, ease of use, the economics of a cloud service, and the power of deep learning.

Learning objectives

This tutorial shows steps to deploy FfDL on IBM Cloud Private Community Edition version (with Kubernetes 1.10).


For this tutorial, you need one of the following prerequsities:

  • Ubuntu 16.04.4 server running on bare metal, or
  • a virtual machine with CPU only as the master node and one Ubuntu 16.04.4 server running on bare metal or a virtual machine with GPUs as the worker node.

Estimated time

Allow one hour to complete this tutorial.

Step 1. Set up IBM Cloud Private – Community Edition

To set up IBM Cloud from the beginning, follow the steps in Preparing your cluster for installation.

Then, follow the steps in Installing IBM Cloud Private-Community Edition.

Step 2. Set up the Kubernetes CLI client and the Helm client for your IBM Cloud Private cluster

  1. Install kubectl.

    To install kubectl on the master node, run the following commands:

     sudo curl -L -o /usr/local/bin/kubectl
     sudo chmod +x /usr/local/bin/kubectl
  2. Install the Helm client.

    To install Helm client on the master node, follow the steps in Setting up the Helm CLI.

  3. Configure kubectl to use the service account token as access credentials.

    Get the existing service account secret name:

     $ kubectl get secret
     NAME                  TYPE                                  DATA      AGE
     calico-etcd-secrets   Opaque                                3         19h
     default-token-b9pfk   3         19h

    Write down the secret name of your service-account-token (that is default-token-b9pfk in the previous example), and run the following command with it.

     $ kubectl config set-credentials mycluster-user --token=$(kubectl get secret <your-token-secret-name> -o jsonpath={.data.token} | base64 -d)
     User "mycluster-user" set.
  4. Append the --tls option to all Helm command calls.

    For security reason, IBM Cloud Private requires all Helm commands to use the --tls flag. To automatically append the --tls option to all Helm command calls, complete the following steps:

     # Get helm version
     helm version --tls
     # Append helm version to the executable filename
     sudo mv /usr/local/bin/helm /usr/local/bin/helm-v273
     # Create a script to call this helm executable
     sudo vi /usr/local/bin/helm
     # Add the following lines into /usr/local/bin/helm
     if [ "$1" = "delete" ] || [ "$1" = "del" ] ||
        [ "$1" = "history" ] || [ "$1" = "hist" ] ||
        [ "$1" = "install" ] ||
        [ "$1" = "list" ] || [ "$1" = "ls" ] ||
        [ "$1" = "status" ] ||
        [ "$1" = "upgrade" ] ||
        [ "$1" = "version" ]
       /usr/local/bin/helm-v273 "$@" --tls
       /usr/local/bin/helm-v273 "$@"
     # Add execute permission for all users who can access this file
     sudo chmod +x /usr/local/bin/helm
     # Verify --tls is appended to every helm command call
     helm version

Step 3. Set up IBM Cloud Private to enable GPU support

  1. Install the Nvidia driver on each worker node.

    Either use the example yaml file at, or get the driver installer file from Google Cloud github and remove the affinity section:

     wget -O driver-installer.yaml

    Then deploy the daemon set from the master node will install the driver on each worker nodes:

     # Launch the daemonset
     kubectl create -f driver-installer.yaml

    Then deploy the daemon set from the master node to install the driver on each worker node:

     # Verify the driver is installed
     kubectl describe ds nvidia-driver-installer -n kube-system
     # ssh to your worker node(s) and run the following command on each node
  2. Deploy the Kubernetes device plug-in for Nvidia GPUs on your IBM Cloud Private cluster.

    Either use the example yaml file at, or get the device plug-in file from Kubernetes github and remove the affinity section:

     wget -O device-plugin.yaml

    Then deploy the daemon set from the master node to install the device plug-in on each worker node:

     # Launch the daemonset
     kubectl create -f device-plugin.yaml
     kubectl describe ds nvidia-gpu-device-plugin -n kube-system

    To verify GPU is enabled on each worker node, run the following command:

     kubectl get nodes
     kubectl describe node <your-node-name> | grep Capacity -A 15

Step 4. Set up IBM Cloud Private to allow dynamic storage provisioning based on storage class

If you already have a storage class available on your IBM Cloud Private cluster, you can skip to step 5.

If you have an existing shared storage system being served with an NFS server, you can use it and follow the advice at Add a dynamic NFS provisioner to your Kubernetes cluster to create a storage class on your IBM Cloud Private cluster. Then go to step 5

If you don’t have a storage class or NFS storage available on your environment, you can setup an NFS server to export a shared directory and mount it on all worker nodes so pods running on different nodes can all access to it. Run the following commands on the master node:

# Create the shared directory
sudo mkdir -p /data-nfs

# Install NFS kernel server
sudo apt update
sudo apt install -y nfs-kernel-server

# Update /etc/exports
sudo echo "/data-nfs *(rw,no_root_squash,no_subtree_check)" | sudo tee -a /etc/exports

# Restart NFS kernel server
sudo service nfs-kernel-server restart

To mount the shared directory on the worker nodes, run the following commands on all worker nodes:

# Install NFS client
sudo apt update
sudo apt install -y nfs-common

# Create the directory for the mount point
sudo mkdir /data-nfs
sudo chmod 777 /data-nfs

# Mount the shared directory
sudo mount -t nfs -o proto=tcp,port=2049 <hostname-or-IP-address-of-master-node>:/data-nfs /data-nfs

# Update /etc/fstab
echo "<hostname-or-IP-address-of-master-node>:/data-nfs /data-nfs   nfs    auto  0  0" | sudo tee -a /etc/fstab

Then, follow the guidance at Add a dynamic NFS provisioner to your Kubernetes cluster to create a storage class on your IBM Cloud Private cluster.

Step 5. Install FfDL

Now, the IBM Cloud Private cluster is ready for FfDL deployment. You can use the following commands:

# Clone the FfDL repository to the master node
git clone
cd FfDL

# Setup variables
export VM_TYPE=none
export PUBLIC_IP=<IP-address-of-master-node>

# Change the storage class to example-nfs-local or whatever is available on your ICP.
export SHARED_VOLUME_STORAGE_CLASS="example-nfs-local"

# Install a Object Storage for FfDL
helm install storage-plugin

# Create a static volume to store any metadata from FfDL

# Wait until static-volume-1 is bound to a volume
kubectl get pvc

# Install FfDL
helm install . --set lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS

# Wait until all FfDL components are in running state
helm status $(helm list | grep ffdl | awk '{print $1}' | head -n 1)

# Configure Grafana to monitor FfDL

# Get Grafana, FfDL Web UI, and FfDL restapi endpoints
grafana_port=$(kubectl get service grafana -o jsonpath='{.spec.ports[0].nodePort}')
ui_port=$(kubectl get service ffdl-ui -o jsonpath='{.spec.ports[0].nodePort}')
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
s3_port=$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}')

echo "Monitoring dashboard: http://$PUBLIC_IP:$grafana_port/ (login: admin/admin)"
echo "Web UI: http://$PUBLIC_IP:$ui_port/#/login?endpoint=$node_ip:$restapi_port&username=test-user"

Step 6. Verify the installation by running a Jupyter Notebook

To verify the proper operation of FfDL, you set up a Jupyter Notebook to run the code on FfDL.

  1. Prepare a Dockerfile for Jupyter Notebook.

    Create a Dockerfile on the master node with the following content. It installs PyTorch and downloads a Jupyter Notebook that uses the torchtext package. You can also replace it with your own Jupyter Notebook file.

     FROM jupyter/scipy-notebook
     RUN conda install --quiet --yes pytorch torchvision -c pytorch
     RUN git clone ~/examples && \
             cd ~/examples && \
             git fetch && \
             git checkout -b torchtext remotes/origin/torchtext && \
             cd word_language_model && \
             pip install -r requirements.txt && \
             pip install torchtext && \
             pip install spacy && \
             python -m spacy download en && \
             sed -i "s/#c.NotebookApp.password = ''/c.NotebookApp.password = 'sha1:590c6011243f:1f7af09d03abfd06ac8b49185cc72fabec5a199f'/g" /home/jovyan/.jupyter/
  2. Build the previously described docker image and load it to all worker nodes.

    Run the following commands:

     # Build the above docker image on the master node
     sudo docker build -f Dockerfile -t pytorch_jupyter_notebook:v1 .
      # Save the image into a file
     sudo docker save pytorch_jupyter_notebook:v1 > pytorch_jupyter_notebook.tar
     # scp the tar file to each worker node from the master node
     scp pytorch_jupyter_notebook.tar username@<worker-node-ip>:/home/username/.
     # On each worker node load the image into the docker repository
     sudo docker load < pytorch_jupyter_notebook.tar
  3. Create the Jupyter Notebook deployment and service.

    Create the following deploy.yaml and service.yaml files on the master node:

     $cat deploy.yaml
     apiVersion: apps/v1beta1
     kind: Deployment
       name: jupyter-notebook
       replicas: 1
             app: jupyter-notebook
           - name: pytorch-notebook
             image: pytorch_jupyter_notebook:v1
             - "jupyter"
             - "notebook"
             - "/home/jovyan/examples/word_language_model/word_language_model_and_torchtext.ipynb"
     $cat service.yaml
     apiVersion: v1
     kind: Service
       name: jupyter-notebook
       - port: 8888
         targetPort: 8888
         app: jupyter-notebook
       type: NodePort
     # Create the above deployment and service
     kubectl create -f deploy.yaml
     kubectl create -f service.yaml
  4. Open the Jupyter Notebook.

Use the following command to get the port number for the Jupyter Notebook service: kubectl get svc jupyter-notebook -o jsonpath='{.spec.ports[0].nodePort}'

Now, open http://your-master-node-ip-address:your-jupyter-notebook-service-port in a browser and log in with time4fun as the password. Then, you can double click word_language_model_and_torchtext.ipynb to open the notebook

If there is no error, your FfDL environment is ready.


Now you know how to get FfDL running on an IBM Cloud Private cluster. Now you can try working in your own environment to explore the power and flexibility of running deep learning workloads in your private cloud.