Data scientists frequently use Jupyter Notebooks to do their work. Whether you are loading or processing data, analyzing data, using data to train a model, or perform other tasks of the data science workflow, notebooks are usually key.
Let's say you create a set of notebooks that load, cleanse, and analyze time-series data, which is made available periodically. Instead of running each notebook manually (or performing all tasks in a single notebook, which limits reusability of task specific code), you could create and run a reusable machine learning pipeline like the following:
JupyterLab extensions make it possible for anyone to customize the user experience. Extensions provide new functionality, like a CSV file editor or a visualization, and integrate services (like git for sharing and version control) or themes.
Elyra is a set of AI-centric extensions for JupyterLab that aim to simplify and streamline day-to-day activities. Its main feature is the Visual Pipeline Editor, which enables you to create workflows from Python notebooks or scripts and run them locally in JupyterLab, or remotely on Kubeflow Pipelines or Apache Airflow.
Assembling a pipeline
You will use the Visual Pipeline Editor to assemble pipelines in Elyra. The pipeline assembly process generally involves:
Creating a new pipeline
Adding Python notebooks or Python scripts and defining their runtime properties
Connecting the notebooks and scripts to define execution dependencies
Create a pipeline
To create a new pipeline in Elyra, open a Pipeline Editor from the Launcher. There are three editors that you can choose from: a generic pipeline editor, an editor for Kubeflow Pipelines, and an editor for Apache Airflow.
With the generic editor, you can create pipelines from notebooks and scripts to produce pipelines that are runnable in JupyterLab, Kubeflow Pipelines, or Apache Airflow. The Kubeflow Pipelines and Apache Airflow editors also support execution notebooks and scripts, as well as runtime specific components, which are custom pieces of code that implement tasks.
Adding Python notebooks and scripts to the pipeline
You can add Python notebooks and scripts to the pipeline by dragging them from the JupyterLab File Browser onto the canvas.
Each notebook or file is represented by a node that includes input and output ports.
You can access node properties from the context menu. These node properties define the execution environment (container image) in which the notebook or script is run during remote execution, resource constraints, inputs (file dependencies and environment variables), output files, and resources. Note that resource settings only apply to pipelines that are executed on Kubeflow Pipelines or Apache Airflow.
Optionally, you can associate nodes with comments to describe their purpose.
Defining dependencies between notebooks and scripts
Dependencies between notebooks or scripts are defined by connecting output ports to input ports.
Dependencies are used to determine the order in which the nodes will be executed during a pipeline run.
The following rules are applied:
Circular dependencies are not allowed.
If two nodes are not connected (directly or indirectly), they can be executed in parallel.
If two nodes are connected, the node producing the inputs for the other node is executed first.
There are some distinct differences between how pipelines are executed in JupyterLab and on a third-party workflow orchestration framework, such as Kubeflow Pipelines.
Running pipelines in JupyterLab
You can execute pipelines in JupyterLab as long as the environment provides access to the pipeline's prerequisites. For example, the kernels that notebooks are associated with must already be installed, just like required packages (if they are not installed in the notebooks).
Running pipelines in the JupyterLab environment should be possible if:
You are assembling a new pipeline and are testing it using relatively small data volumes.
The pipeline tasks don't require hardware resources in excess of what's available in the environment.
The pipeline tasks complete in an acceptable amount of time, given existing resource constraints.
In the JupyterLab environment:
Nodes are executed as a sub-process in the JupyterLab environment and always processed sequentially.
Output files (such as processed data files or training artifacts) are stored in the local file system and can be accessed using the JupyterLab File Browser.
Processed notebooks are updated in place, meaning their output cells reflect the execution results.
Script output, such as messages sent to STDOUT or STDERR, are displayed in the JupyterLab console.
Elyra currently does not provide a pipeline-monitoring capability in the JupyterLab UI aside from a message after processing has completed. However, the relevant information is contained in the JupyterLab console output.
While running pipelines locally might be feasible in some scenarios, it's rather impractical if large data volumes need to be processed or if compute tasks require special-purpose hardware like GPUs or TPUs to perform resource-intensive calculations.
You can configure Elyra to delegate pipeline execution to Kubeflow Pipelines by defining a runtime configuration, which contains connectivity information. When you run a pipeline, you can select which configuration to use, making it easy to leverage multiple environments, such as development, quality assurance, or production.
The main difference between local pipeline execution and execution on Kubeflow Pipelines is that with Kubeflow Pipelines each node is processed in an isolated container on Kubernetes, allowing for better portability, scalability, and manageability.
The following chart illustrates this for two dependent notebook nodes.
Data is shared between nodes using S3-compatible cloud storage. Before a notebook or script is executed, the declared input file dependencies are automatically downloaded from cloud storage into the container. After processing is complete, the declared output files are automatically uploaded from the container to cloud storage.
Elyra also supports mounting of data volumes, which are the preferred way to exchange large amounts of data.
You can monitor pipeline run progress by using the Central Dashboard, which is the administration interface for Kubeflow.
Kubeflow Pipelines is not included in any of the Elyra installation options.
Running Elyra in a sandbox environment on the cloud
You can test drive Elyra on mybinder.org, without having to install anything. Try out the latest stable release or the latest development version (if you feel adventurous) in a sandbox environment.
The sandbox environment contains a getting_started markdown document, which provides a short tour of the Elyra features:
A couple of things to note:
Performance can sometimes be sluggish since this is a shared environment.
The sandbox environment is not persistent and any changes you make will be lost when it is shut down.
If you have a Docker compatible runtime installed on your machine, consider using one of the pre-built container images instead.
Running Elyra container images
The Elyra community publishes ready-to-use container images, which have JupyterLab and the Elyra extension pre-installed:
elyra/elyra:latest includes the latest stable release.
elyra/elyra:x.y.z includes the x.y.z release.
After you decide which image to use (elyra/elyra:latest is always an excellent choice because you won't miss out on the latest features!), you can spin up a sandbox container as follows:
Open your web browser to the displayed URL and you are ready to start.
To access the notebook, open this fileina browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste oneof these URLs:
http://4d17829ecd4c:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82
orhttp://127.0.0.1:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82
Show more
The caveat is: In the sandbox mode, you cannot access existing files (such as notebooks) on your local machine and all changes you make are discarded when you shut down the container.
Therefore, it's better to launch the container like so, replacing ${HOME}/jupyter-notebooks/ and ${HOME}/jupyter-data-dir with the names of existing local directories:
This way all changes are preserved when you shut down the container, and you won't have to start from scratch when you bring it up again.
Installing Elyra locally
If your local environment meets the prerequisites, you can install JupyterLab and Elyra using pip, conda, or from source code, following the instructions in the installation guide.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.