Digital Developer Conference: Cloud Security 2021 – Build the skills to secure your cloud and data Register free

IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Explore this machine learning toolkit for Kubernetes and OpenShift.

Machine learning must address a daunting breadth of functionalities around building, training, serving, and managing models. Doing so in a consistent, composable, portable, and scalable manner is hard. The Kubernetes framework is well suited to address these issues, which is why it’s a great foundation for deploying machine learning workloads. The Kubeflow project’s development has been a journey to realize this promise, and we are excited that journey has reached its first major destination – Kubeflow 1.0.

Always ready to work with a strong and diverse community, IBM joined this Kubeflow journey early on. Over the course of last year, IBM has become the largest code contributor after Google to Kubeflow. More than 20 IBMers have contributed code to Kubeflow, with over 500 commits and 900K lines of code.

Our focus has been on contributing to key components of Kubeflow including Katib (hyperparameter optimization), KFServing (model serving), Fairing (SDK), Kubeflow Pipelines, kfctl (control plane), Manifests (configurations), TF-Operator, and PyTorch-Operator (model training). Additionally, we have been running internal projects and performance tests as well as evaluating production fits for enterprise customers, while also working on requirements for certain projects that we are invested in (for example, Containerd, OpenShift, and Power Systems). As part of 1.0, many Kubeflow components around the core tenets of build, train, deploy and manage have matured, and are ready for production grade deployments. For more details, please follow the community blog for Kubeflow 1.0

Kubeflow provides instructions for deployment on Google Cloud Platform (GCP) and Amazon Web Services (AWS). Additionally, we have instructions to run Kubeflow on IBM Cloud, Kubeflow on OpenShift, and Kubeflow on Power. We are working with enterprise customers in telco, banking, farming, and other domains to enable an end-to-end machine learning platform that uses Kubeflow and other open source technologies. A couple of those client stories will be highlighted at the upcoming IBM THINK conference in May.

Some of the highlights of the work where we collaborated with the Kubeflow community leading toward an enterprise-grade Kubeflow 1.0 are listed below.

Kubeflow Operator for deployment and management of Kubeflow

One of our most recent collaborative efforts was around the development of the Kubeflow Operator, which helps deploy, monitor, and manage the lifecycle of Kubeflow. It is built using the Operator Framework, which is an open source toolkit used to build, test, package, and manage the lifecycle of operators. The Kubeflow Operator is now available in Kubeflow GitHub. Additionally, we created the metadata and code for the operator to be officially published on OperatorHub. This will help us leverage the ecosystem and tools around OpenShift, mainly Operator Lifecycle Manager.

Kubeflow Operator Design

Start using the Kubeflow Operator

To get started with using the Kubeflow Operator, deploy it in an Operator namespace, and then use the deployed Operator to deploy Kubeflow.

  1. Deploy the Operator using the following commands.

     git clone && cd kfctl
     kubectl create ns ${OPERATOR_NAMESPACE}
     kubectl create -f deploy/crds/kfdef.apps.kubeflow.org_kfdefs_crd.yaml
     kubectl create -f deploy/service_account.yaml -n ${OPERATOR_NAMESPACE}
     kubectl create clusterrolebinding kubeflow-operator --clusterrole cluster-admin --serviceaccount=${OPERATOR_NAMESPACE}:kubeflow-operator
     kubectl create -f deploy/operator.yaml -n ${OPERATOR_NAMESPACE}
  2. Point to IBM Cloud or Red Hat OpenShift KFDEF (or your corresponding Cloud Provider)

    • To deploy to IBM Cloud, point to the following kfdef file.

      export KFDEF_URL=
    • To deploy to Red Hat OpenShift, point to the following kfdef file.

      export KFDEF_URL=
  3. Update the KFDEF file with your Kubeflow deployment name

     export KUBEFLOW_DEPLOYMENT_NAME=kubeflowexport
     KFDEF_URL= KFDEF=$(echo "${KFDEF_URL}" | rev | cut -d/ -f1 | rev)
     curl -L ${KFDEF_URL} > ${KFDEF}yq w ${KFDEF} '' ${KUBEFLOW_DEPLOYMENT_NAME} > ${KFDEF}.tmp && mv ${KFDEF}.tmp ${KFDEF}kubectl create -f ${KFDEF} -n ${KUBEFLOW_NAMESPACE}
     yq w ${KFDEF} '' ${KUBEFLOW_DEPLOYMENT_NAME} > ${KFDEF}.tmp && mv ${KFDEF}.tmp ${KFDEF}
     kubectl create -f ${KFDEF} -n ${KUBEFLOW_NAMESPACE}
  4. Deploy Kubeflow

     kubectl create -f ${KFDEF} -n ${KUBEFLOW_NAMESPACE}

TF Operator, PyTorch Operator, Katib for distributed training and hyperparameter optimization

Distributed model training on Kubernetes for Tensorflow, PyTorch etc. has been the foundation of Kubeflow. We have contributed Python SDKs for Tensorflow Operator and PyTorch Operator, which allow you to run distributed training jobs from your notebooks. Additionally, Katib is a component of Kubeflow that enables hyperparameter tuning and neural architecture search. We are a major contributor to Katib, and led the design and implementation of various features like metrics collector, trial template, suggestion service, Katib API, and more. We also published a detailed deep dive into various features of Katib, which gives you behind-the-scenes details as well as how to get started with Katib.


KFServing for model inferencing and model management

IBM helped found KFServing, working jointly with Google, Bloomberg, Seldon, and others. The team has helped with many features, including contributions for SKLearn and PyTorch servers, Storage, SDK, KNative updates, Pipelines integration, E2E Test infrastructure, and other important features including co-leading the payload logging design. For this year, one major focus is to bring trusted AI features to KFServing, like bias detection, adversarial detection, and explainability using the IBM suite of trusted AI projects.

You can get started with KFServing in 5 minutes, and deploy a default and canary version of a model using a simple spec. For more advanced inferencing scenarios, follow our talk from KubeCon.

apiVersion: ""
kind: "InferenceService"
  name: "flowers-sample"
      # 90% of traffic is sent to this model
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
  canaryTrafficPercent: 10
      # 10% of traffic is sent to this model
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers-2"


Kubeflow Pipelines for ML workflow orchestration

Our engagement with Kubeflow Pipelines started with contributing Pipeline Components and samples for Spark, the Watson Portfolio (Watson Machine Learning and Watson OpenScale), KFServing, Katib, AI Fairness 360, and athe Adversarial Robustness 360 Toolbox.

Watson pipeline

Additionally, IBM has helped facilitate Pipelines execution with containerd (instead of just Docker), working with Argo community. IBM has contributed to a wide range of roadmap and design discussions for Kubeflow Pipelines and TFX, on-prem authentication/authorization, and more. Currently, IBM is driving the Kubeflow Pipelines and Tekton comparative study and an initial prototype for the KFP-Tekton YAML code and compiler. We are running MLOps Sig in the CD Foundation to drive this.

We have also published an end-to-end pipeline sample including Katib, TFJob, and KFserving.

Kubeflow pipelines

Fairing to provide a consistent multicloud Kubeflow SDK

By using Kubeflow Fairing and adding a few lines of code, you can run your machine learning training job locally or in the cloud, directly from Python code or a Jupyter Notebook. IBM has been contributing heavily to Kubeflow Fairing including python packaging for Fairing, release management and maintenance, its integration with KF Serving, Fairing CI/CD enhancement, and other fixes and feature enhancements.

Join us to build an enterprise-grade machine learning platform

Here are a few ways you can engage with us:

  • To contribute and build an enterprise-grade, end-to-end machine learning platform on OpenShift and Kubernetes, please join the Kubeflow community and reach out with any questions, comments, and feedback!
  • If you want help deploying and managing Kubeflow on your on-premise Kubernetes platform, OpenShift or on IBM Cloud, please connect with us.
  • Check out the OpenDataHub if you are interested in other open source projects in the Data and AI portfolio, namely Kafka, Hive, Hue, and Spark, and how to bring them together in a cloud-native way.

Thanks to IBM contributors in Kubeflow project, namely Jin Chi He, Tommy Li, Hou Gang Liu, Weiqiang Zhuang, Guang Ya Liu, Christian Kadner, Andrew Butler, Jane Man and many others for contributing to the various aspects of the project, both internally and externally.

Additionally kudos to our Red Hat colleagues for helping make Kubeflow on OpenShift hardened and production-ready for enterprise use cases. And, last but not least, thanks to the collaborative community members comprising of Google, Arrikto, Cisco, Bloomberg, Microsoft and many others for getting this to the first major milestone – Kubeflow 1.0.

Animesh Singh is the Chief Architect, Data and AI Open Source Platform in the IBM Cognitive Applications group.