Article

Create AI pipelines using Elyra and Kubeflow Pipelines

Use JupyterLab and Elyra to create and run machine learning pipelines

Data scientists frequently use Jupyter Notebooks to do their work. Whether you are loading or processing data, analyzing data, using data to train a model, or perform other tasks of the data science workflow, notebooks are usually key.

Let's say you create a set of notebooks that load, cleanse, and analyze time-series data, which is made available periodically. Instead of running each notebook manually (or performing all tasks in a single notebook, which limits reusability of task specific code), you could create and run a reusable machine learning pipeline like the following:

Diagram of a pipeline example

With the open source Elyra project, you can do this in JupyterLab, Apache Airflow, or Kubeflow Pipelines.

Quick intro to Jupyterlab and Elyra

JupyterLab extensions make it possible for anyone to customize the user experience. Extensions provide new functionality, like a CSV file editor or a visualization, and integrate services (like git for sharing and version control) or themes.

Elyra is a set of AI-centric extensions for JupyterLab that aim to simplify and streamline day-to-day activities. Its main feature is the Visual Pipeline Editor, which enables you to create workflows from Python notebooks or scripts and run them locally in JupyterLab, or remotely on Kubeflow Pipelines or Apache Airflow.

Diagram of a local and remote notebook pipeline example

Assembling a pipeline

You will use the Visual Pipeline Editor to assemble pipelines in Elyra. The pipeline assembly process generally involves:

  • Creating a new pipeline
  • Adding Python notebooks or Python scripts and defining their runtime properties
  • Connecting the notebooks and scripts to define execution dependencies

Animation that shows how to create a pipeline, add Python notebooks, and connect a notebook and scripts to define execution dependencies

Create a pipeline

To create a new pipeline in Elyra, open a Pipeline Editor from the Launcher. There are three editors that you can choose from: a generic pipeline editor, an editor for Kubeflow Pipelines, and an editor for Apache Airflow.

Screen capture of opening the Pipeline Editor from the Launcher

With the generic editor, you can create pipelines from notebooks and scripts to produce pipelines that are runnable in JupyterLab, Kubeflow Pipelines, or Apache Airflow. The Kubeflow Pipelines and Apache Airflow editors also support execution notebooks and scripts, as well as runtime specific components, which are custom pieces of code that implement tasks.

Adding Python notebooks and scripts to the pipeline

You can add Python notebooks and scripts to the pipeline by dragging them from the JupyterLab File Browser onto the canvas.

Screen capture of adding a node within the Launcher

Each notebook or file is represented by a node that includes input and output ports.

Magnified screen capture of a node icon

You can access node properties from the context menu. These node properties define the execution environment (container image) in which the notebook or script is run during remote execution, resource constraints, inputs (file dependencies and environment variables), output files, and resources. Note that resource settings only apply to pipelines that are executed on Kubeflow Pipelines or Apache Airflow.

Screen capture of example node properties within a context menu

Optionally, you can associate nodes with comments to describe their purpose.

Screen captuer of an example comment associated with a node

Defining dependencies between notebooks and scripts

Dependencies between notebooks or scripts are defined by connecting output ports to input ports.

Screen capture of two connected nodes

Dependencies are used to determine the order in which the nodes will be executed during a pipeline run.

The following rules are applied:

  • Circular dependencies are not allowed.
  • If two nodes are not connected (directly or indirectly), they can be executed in parallel.
  • If two nodes are connected, the node producing the inputs for the other node is executed first.

There are some distinct differences between how pipelines are executed in JupyterLab and on a third-party workflow orchestration framework, such as Kubeflow Pipelines.

Running pipelines in JupyterLab

You can execute pipelines in JupyterLab as long as the environment provides access to the pipeline's prerequisites. For example, the kernels that notebooks are associated with must already be installed, just like required packages (if they are not installed in the notebooks).

Screen capture of a Run pipeline dialog with a pipeline name and runtime config of Run in-place locally

Running pipelines in the JupyterLab environment should be possible if:

  • You are assembling a new pipeline and are testing it using relatively small data volumes.
  • The pipeline tasks don't require hardware resources in excess of what's available in the environment.
  • The pipeline tasks complete in an acceptable amount of time, given existing resource constraints.

In the JupyterLab environment:

  • Nodes are executed as a sub-process in the JupyterLab environment and always processed sequentially.
  • Output files (such as processed data files or training artifacts) are stored in the local file system and can be accessed using the JupyterLab File Browser.
  • Processed notebooks are updated in place, meaning their output cells reflect the execution results.
  • Script output, such as messages sent to STDOUT or STDERR, are displayed in the JupyterLab console.

Screen capture of sample local execution output information

Elyra currently does not provide a pipeline-monitoring capability in the JupyterLab UI aside from a message after processing has completed. However, the relevant information is contained in the JupyterLab console output.

To learn more about how to create a pipeline and run it in JupyterLab take a look at the Running generic pipelines in JupyterLab tutorial.

Running pipelines on Kubeflow Pipelines

While running pipelines locally might be feasible in some scenarios, it's rather impractical if large data volumes need to be processed or if compute tasks require special-purpose hardware like GPUs or TPUs to perform resource-intensive calculations.

You can configure Elyra to delegate pipeline execution to Kubeflow Pipelines by defining a runtime configuration, which contains connectivity information. When you run a pipeline, you can select which configuration to use, making it easy to leverage multiple environments, such as development, quality assurance, or production.

Screen capture of a Run pipeline dialog with a pipeline name and runtime config of Kubeflow Pipleines test environment

The main difference between local pipeline execution and execution on Kubeflow Pipelines is that with Kubeflow Pipelines each node is processed in an isolated container on Kubernetes, allowing for better portability, scalability, and manageability.

The following chart illustrates this for two dependent notebook nodes.

Diagram of two separate Docker containers running Jupyter Kernel that share input and output artifacts with S3-compatibe cloud storage

Data is shared between nodes using S3-compatible cloud storage. Before a notebook or script is executed, the declared input file dependencies are automatically downloaded from cloud storage into the container. After processing is complete, the declared output files are automatically uploaded from the container to cloud storage.

Elyra also supports mounting of data volumes, which are the preferred way to exchange large amounts of data.

You can monitor pipeline run progress by using the Central Dashboard, which is the administration interface for Kubeflow.

Screen capture of the Central Dashboard

You can find additional details, along with step-by-step instructions, in the Run generic pipelines on Kubeflow Pipelines tutorial.

Ways to try Elyra and pipelines

The referenced tutorials are a great way to get started with pipelines.

If you'd like to try out Elyra and start building your own pipelines, you have three options:

Kubeflow Pipelines is not included in any of the Elyra installation options.

Running Elyra in a sandbox environment on the cloud

You can test drive Elyra on mybinder.org, without having to install anything. Try out the latest stable release or the latest development version (if you feel adventurous) in a sandbox environment.

Screen capture of the Binder launch buttons in the Elyra GitHub repository's README file

The sandbox environment contains a getting_started markdown document, which provides a short tour of the Elyra features:

Screen capture of the binder-demo files within the sandbox environment, including getting_started.md

A couple of things to note:

  • Performance can sometimes be sluggish since this is a shared environment.
  • The sandbox environment is not persistent and any changes you make will be lost when it is shut down.

If you have a Docker compatible runtime installed on your machine, consider using one of the pre-built container images instead.

Running Elyra container images

The Elyra community publishes ready-to-use container images, which have JupyterLab and the Elyra extension pre-installed:

  • elyra/elyra:latest includes the latest stable release.
  • elyra/elyra:x.y.z includes the x.y.z release.

After you decide which image to use (elyra/elyra:latest is always an excellent choice because you won't miss out on the latest features!), you can spin up a sandbox container as follows:

docker run -it -p 8888:8888\
 -v ${HOME}/jupyter-notebooks/:/home/jovyan/work\
 -w /home/jovyan/work\
 elyra/elyra:latest jupyter lab

Open your web browser to the displayed URL and you are ready to start.

To access the notebook, open this file in a browser:
        file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
    Or copy and paste one of these URLs:
        http://4d17829ecd4c:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82
     or http://127.0.0.1:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82

Screen capture of the sandbox container within Launcher

The caveat is: In the sandbox mode, you cannot access existing files (such as notebooks) on your local machine and all changes you make are discarded when you shut down the container.

Therefore, it's better to launch the container like so, replacing ${HOME}/jupyter-notebooks/ and ${HOME}/jupyter-data-dir with the names of existing local directories:

docker run -it -p 8888:8888\
 -v ${HOME}/jupyter-notebooks/:/home/jovyan/work\
 -w /home/jovyan/work\
 -v ${HOME}/jupyter-data-dir:/home/jovyan/.local/share/jupyter\
 elyra/elyra:latest jupyter lab

This way all changes are preserved when you shut down the container, and you won't have to start from scratch when you bring it up again.

Installing Elyra locally

If your local environment meets the prerequisites, you can install JupyterLab and Elyra using pip, conda, or from source code, following the instructions in the installation guide.

Closing thoughts

Elyra is a community-driven effort. We welcome contributions of any kind: bug reports, feature requests, and, of course, pull requests. You can reach us in the community chatroom, the discussion forum, or by joining our weekly community call.