Train and serve a machine learning model using Kubeflow in IBM Cloud Private

This tutorial is part of the Get started with Kubeflow learning path.

Introduction

Kubeflow is known as a machine learning toolkit for Kubernetes. It is an open source project that is dedicated to making deployments of machine learning workflows on Kubernetes simple, portable, and scalable. IBM Cloud Private is a platform for developing and managing on-premises, containerized applications. The Modified National Institute of Standards and Technology database (MNIST) database is a large database of handwritten digits and is commonly used for machine learning training and testing purpose.

In this tutorial, I explain how to train and serve a machine learning model for the MNIST database based on a GitHub sample using Kubeflow in IBM Cloud Private-CE. The following topics are covered:

  • Enabling the LoadBalancer service on IBM Cloud Private
  • Creating PV and PVC for the sample application
  • Compiling the source
  • Uploading a pipeline application to Kubeflow Dashboard and testing

Prerequisites

To run this tutorial, you need an Ubuntu 18 machine with a minimum 8 cores, 16 GB RAM, and 250 GB storage. You also need root privileges to run the tutorial steps. Kubeflow is installed in IBM Cloud Private. If you have not already installed it, then install it using the following tutorial:

You’ll also need to set up the development environment.

Estimated time

It should take you approximately 30 minutes to complete this tutorial.

Enable the LoadBalancer service on IBM Cloud Private

This step is not needed if you have already enabled the LoadBalancer service on IBM Cloud Private. By default, IBM Cloud Private doesn’t provide built-in support for the LoadBalancer service. To enable the LoadBalancer service on IBM Cloud Private, I use the LoadBalancer service with MetalLB (layer2 mode). For other options, see Working with LoadBalancer services on IBM Cloud Private.

  1. Install MetalLB.

     kubectl create clusterrolebinding privileged-metallb-clusterrolebinding \
     --clusterrole=ibm-privileged-clusterrole \
     --group=system:serviceaccounts:metallb-system
    
     kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yaml
    

    If you encounter an image permission problem, create a file called image-policy.yaml with following content.

     apiVersion: securityenforcement.admission.cloud.ibm.com/v1beta1
     kind: ImagePolicy
     metadata:
       name: image-policy
     spec:
       repositories:
         - name: docker.io/*
           policy: null
         - name: k8s.gcr.io/*
           policy: null
         - name: gcr.io/*
           policy: null
         - name: ibmcom/*
           policy: null
         - name: quay.io/*
           policy: null
    

    Then run the following commands to create an image-policy on metallb-system namespace and install the MetalLB again.

     kubectl create -n metallb-system -f image-policy.yaml
     kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yaml
    
  2. Configure the MetalLB using layer2. Verify that MetalLB’s pod is running before you continue.

     kubectl -n metallb-system get pod
     NAME                        READY     STATUS    RESTARTS   AGE
     controller-9c57dbd4-sh2ms   1/1       Running   0          1d
    

    Then run the following commands to configure the MetalLB (remember to use an actual IP address range).

     kubectl create -f - << EOF
     apiVersion: v1
     kind: ConfigMap
     metadata:
       namespace: metallb-system
       name: config
     data:
       config: |
         address-pools:
         - name: default
           protocol: layer2
           addresses:
           -  {IP address range}
     EOF
    

Create a PV and PVC for the sample application

The sample application mnist_pipeline.py in GitHub is initially written to run in Google Cloud Platform. It is then modified to run on-premises. To run the application in our single node cluster, we need to create a Persistent Volume (PV) and a Persistent Volume Claim (PVC).

  1. Create a PV. Create a YAML file called jane-pv.yaml (or any name you like) with following content.

     apiVersion: v1
     kind: PersistentVolume
     metadata:
       name: jane-pv-volume
       labels:
         type: local
     spec:
       storageClassName: manual
       capacity:
         storage: 10Gi
       accessModes:
         - ReadWriteMany
       hostPath:
         path: "/mnt/data"
    
  2. Run the following command to create the PV.

     kubectl apply -f /root/jane/jane-pv.yaml -n kubeflow
    
  3. Create a PVC. Create a YAML file called jane-pv-claim.yaml (or any name you like) with following content:

     apiVersion: v1
     kind: PersistentVolumeClaim
     metadata:
       name: jane-pv-claim
     spec:
       storageClassName: manual
       accessModes:
         - ReadWriteMany
       resources:
         requests:
           storage: 3Gi
    
  4. Run the following command to create the PVC.

     kubectl apply -f /root/jane/jane-pv-claim.yaml -n kubeflow
    
  5. Run the following command to check the status of the PV you created.

     kubectl get pv jane-pv-volume -n kubeflow
    

You should see a “Bound” status like the following.

NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                    STORAGECLASS   REASON    AGE
jane-pv-volume   10Gi       RWX            Retain           Bound     kubeflow/jane-pv-claim   manual                   3d

Compilation of source

I have created a directory called /root/src to clone the sample from GitHub. You can use whatever directory you like. Before you can compile the sample, make sure you have set up a compilation environment as described in set up the development environment. Or you can follow the instructions on the Kubeflow site to set up a development environment.

mkdir /root/src
cd /root/src
git clone https://github.com/kubeflow/examples.git
cd /root/src/examples/pipelines/mnist-pipelines
conda activate mlpipeline
pip install -r requirements.txt –upgrade
sed -i.sedbak s"/platform = 'GCP'/platform = 'onprem'/"  mnist_pipeline.py
python3 mnist_pipeline.py

After compilation, a file called mnist_pipeline.py.tar.gz is created in the /root/src/examples/pipelines/mnist-pipelines directory.

Upload to Kubeflow Dashboard and test

Follow the instructions in set up the development environment to upload your application (mnist_pipeline.py.tar.gz), create an experiment, and run it. Enter the following run parameters (use defaults for other parameters).

  • Under “model-export-dir”: /mnt/export
  • Under “pvc-name”: jane-pv-claim

Run parameters window

For a successful run, you should see something like the following.

Run3 graph

In the source file mnist_pipeline.py in GitHub, the sample application launches three containers to train a model, serve a model, and launch a web UI to test the model. A full description of the sample application can be found at the mnist-pipelines GitHub site.

After the pipeline is run successfully, you can find out the URL of the web UI by running the following command.

kubectl get svc web-ui -n kubeflow
NAME      TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)        AGE
web-ui    LoadBalancer   10.0.0.50    x.xx.xxx.xx   80:30096/TCP   23h

web-ui is in http://x.xx.xxx.xx:30096/

Run3 graph

Now you can experiment and test your model with different inputs (handwriting digits).

Summary

In this tutorial, I explained how to train and serve a machine learning model for MNIST database based on a GitHub sample using Kubeflow in IBM Cloud Private. As you can see, Kubeflow Pipeline really makes this process simple and easy.

This tutorial is the final part of the Get started with Kubeflow learning path. You should now have a better understanding of Kubeflow, how to install it, setting up a development environment, and creating a Db2 for z/OS REST service using Kubeflow.

Acknowledgment

Many thanks to Jin Chi He for his assistance on LoadBalancer in IBM Cloud Private.

Jane Man