Blog Post

Tekton optimizations for Kubeflow Pipelines 2.0

Challenges and benefits of the new Kubeflow Pipelines capabilities


LikeSave



Kubeflow Pipelines on Tekton is an open source machine learning workflow platform for building and deploying portable, scalable machine learning workflows on Kubernetes and Red Hat OpenShift. The Kubeflow Pipelines project is the core for Watson Pipelines, which is the Watson Studio offering that allows users to create repeatable and scheduled flows that automate notebook, data refinery, and machine learning pipelines on the Watsonx Platform. With Kubeflow Pipelines on Tekton V2.0 release, it's now much more scalable with enriched metadata.

In Kubeflow Pipelines V1, the pipeline spec that describes ML flows is platform-dependent, which makes it challenging to bring your ML flows to other pipeline frameworks. To lower the barrier, Kubeflow Pipelines V2.0 is built on a new Intermediate Representation (IR). It's a generic component/step-oriented specification that fits container orchestration frameworks. With Kubeflow Pipelines on Tekton 2.0, the user experience between Argo and Tekton became more seamless.

This blog reviews the challenges and benefits of having Kubeflow Pipelines V2.0 support multiple backends, and then cover some Tekton-specific optimizations on Kubeflow Pipelines that can benefit other backend providers.

Kubeflow Pipelines V2.0 design

In Kubeflow Pipelines V1, the pipeline spec describing ML flows is platform-dependent on all the clients including Web UI, Python SDK, and most backend microservices. Furthermore, the metadata in Kubeflow Pipelines V1 are preliminary data that do not provide useful information within the pipeline execution itself. Therefore, Kubeflow Pipelines V2.0 is aimed to address these two major limitations with a new design.

Kubeflow Pipelines V2.0 is designed with the following major features:

  • Authoring end-to-end ML workflows natively in Python
  • Creating fully customized ML components or using an ecosystem of existing components
  • Easily managing, tracking, and visualizing pipeline definitions, runs, experiments, and ML artifacts
  • Efficiently using compute resources through parallel task execution and through caching to eliminate redundant executions
  • Maintaining cross-platform pipeline portability through a platform-neutral IR YAML pipeline definition

To achieve these features in the new design, the new Kubeflow Pipelines pipeline definition is defined in a universal IR format that takes in dynamic metadata updates as part of the pipeline execution. So, Kubeflow Pipelines V2.0 introduced the concepts "driver" and "publisher" on each part of the pipelines and tasks to produce real-time metadata, handle cache and condition, and update pipeline status.

At the pipeline level, the driver is responsible for configuring the runtime, checking for caching decisions, producing execution ID for its dependent tasks, and updating all the information back to the metadata service. If the pipeline is not cached and being executed, the publisher takes in the parameters, artifacts, and the execution context produced at the end of the pipeline and report back to the metadata service. These steps cover all the necessary information that the users need from configuring the pipeline runtime and receiving metadata and pipeline status updates.

At the task level, the driver is responsible for checking for caching decisions, producing execution ID for its dependent tasks, and updating all the information back to the metadata service. In addition, it produces the executor-input, which represents the dynamic pod configuration at run time. Because the Tekton pipeline itself only supports static configuration out of the box, it limited the pipeline to define a pod spec configuration before running. By leveraging the new Kubeflow Pipelines V2.0 design, the pod spec can be modified and optimized dynamically based on real-time pipeline running conditions.

The task-level publisher is then embedded as part of the user container and proceeds with the following steps:

  1. Download input artifacts and prepare the container environment.
  2. Run the user command.
  3. Upload output artifacts to the designated object storage.
  4. Publish parameter/artifact metadata and update task status to the metadata service.

The new driver and publisher design enables Kubeflow Pipelines to update parameters, artifacts, and pipeline status without syncing with the actual Kubernetes and Tekton runtime status. As a result, all the new Kubeflow Pipelines pipelines and metadata can be seamlessly transferred between Argo and Tekton at run time.

KFP Driver and Publisher features

Tekton Optimizations for Kubeflow Pipelines V2.0

Although the new Kubeflow Pipelines V2.0 reduces the pipeline dependencies on the backend run time and provides better caching and metadata mechanism, it now requires running the drivers and publishers for each task and sub-pipeline in order to update the pipeline status, metadata, and artifact objects. Because the drivers and publishers are very lightweight and repetitive, we built a dedicated Tekton custom task controller with dedicated workers to handle these repetitive tasks. This custom task has significantly improved the latency and throughput when compared to running these tasks on a new pod every time.

When comparing the pipeline speed on the cached tasks, the current Kubeflow Pipelines and Tekton implementation does not need to create any additional pods. Therefore, it can save a significant amount of time on the same pipeline compared to the current state of Kubeflow Pipelines V2.0.2 Argo backend implementation.

cache run speed comparison

Another improvement for the Kubeflow Pipelines and Tekton implementation is to reduce the number of tasks and reconciliation cycles. When we have a large number of Tekton tasks in the Tekton controller queue, there is a significant bottleneck on Tekton handling the new events and processing the reconciliation cycle when the queuing system is overwhelmed. Therefore, we integrated all the new Kubeflow Pipelines V2.0 drivers and task construction logic into a single reconciliation cycle, so we can reduce the number of queuing tasks by 50% and process both drivers and Tekton task logic in a single reconciliation cycle.

KFP-Tekton task improvement

Conclusion

Kubeflow Pipelines V2.0 has significantly improved the caching and metadata mechanism with the new design that is built on top of the platform-agnostic intermediate representation. It not only makes pipelines more performant and informative, but also it has fewer platform dependencies that enable running pipelines to different backends and environments. Furthermore, with the unique features on the Tekton backend, we are able to further optimize with different runtime architectures to significantly improve latency and throughput for running Kubeflow Pipelines.

If you want to learn more about how to run production-level pipeline services for AI and large language models (LLMs), check out our product Watson Pipelines in watsonx.ai. Watsonx.ai provides new generative AI capabilities, which are powered by foundation models, and traditional machine learning capabilities in a powerful platform that spans the AI lifecycle.