Notebook Pipelines

Elyra utilizes its canvas component to enable assembling multiple notebooks as a workflow. Elyra provides a visual editor for building Notebook-based AI pipelines, simplifying the conversion of multiple notebooks into batch jobs or workflows.  By leveraging cloud-based resources to run their experiments faster, data scientists, machine learning engineers and AI developers are then more productive, allowing them to spend time utilizing their technical skills.

../_images/pipeline-editor1.pngNotebook Pipeline Editor

Each pipeline node, which in this case represents a Notebook, provides a menu that provides access to opening the notebook file directly in the Notebook Editor

../_images/pipeline-editor-properties-menu.pngNotebook Pipeline Editor properties menu

The properties menu also enables users to set additional properties related to running this notebook (e.g. Environment Variables, File Dependencies, etc)

../_images/pipeline-editor-properties1.pngNotebook Pipeline Editor properties

Using the Elyra Pipeline Editor

../_images/elyra-main-page1.pngMain Page

  • In the Jupyter Lab Launcher, click the Pipeline Editor Icon to create a new pipeline.
  • On left side of the screen, navigate to your file browser, you should see a list of notebooks available.
  • Drag each notebook, each representing a step in your pipeline, to the canvas. Repeat until all notebooks needed for the pipeline are present.
  • Define your notebook execution order by connecting them together to form a graph.

../_images/pipeline-editor1.pngPipeline Editor

  • Define the properties for each node / notebook in your pipeline
Parameter Description Example
Docker Image The docker image you want to use to run your notebook TensorFlow 2.0
Output Files A list of files generated by the notebook inside the image to be passed as inputs to the next step of the pipeline. One file per line. contributions.csv
Env Vars A list of environment variables to be set inside in the container. One variable per line. GITHUB_TOKEN = sometokentobeused
File Dependencies A list of files to be passed from the LOCAL working environment into each respective step of the pipeline. Files should be in the same directory as the notebook it is associated with. One file per line. dependent-script.py

../_images/pipeline-editor-properties1.pngPipeline Node Properties

  • Click on the RUN Icon and give your pipeline a name.
  • Hit OK to start your pipeline.
  • Use the link provided in the response to your experiment in Kubeflow. By default, Elyra will create the pipeline template for you as well as start an experiment and run.

../_images/pipeline-editor.gifPipeline Flow

Distributing Your Pipeline

Oftentimes you’ll want to share or distribute your pipeline (including its notebooks and their dependencies) with colleagues. This section covers some of the best practices for accomplishing that, but first, it’s good to understand the relationships between components of a pipeline.

Pipeline Component Relationships

The primary component of a pipeline is the pipeline file itself. This JSON file (with a .pipeline extension) contains all relationships of the pipeline. The notebook execution nodes each specify a notebook file (a JSON file with a .ipynb extension) who’s path is currently relative to the working directory of the hosting notebook server (or the notebook server workspace). Each dependency of a given node is relative to the notebook location itself - not the notebook server workspace. When a pipeline is submitted for processing or export, the pipeline file itself is not sent to the server, only a portion of its contents are sent.

Distributing Pipelines - Best Practices

Prior to distributing your pipeline - which includes preserving the component relationships - it is best to commit these files (and directories) to a GitHub repository. An alternative approach would be to archive the files using tar or zip, while, again, preserving the component relationships relative to their location within the notebook server workspace.

When deploying a shared or distributed pipeline repository or archive, it is very important that the pipeline notebooks be extracted into the same directory hierarchy relative to the target notebook server’s workspace.

Scenario - Single Project in Workspace

If you archived your pipeline into a GitHub repository named my_analytics_pipeline directly from the notebook server’s workspace (i.e., not within a sub-directory of the workspace), then consumers of this archive would need to do the following:

  1. Clone your repository git clone https://github.com/<your-github-org>/my_analytics_pipeline.git
  2. Change directory to my_analytics_pipeline (this preserves the relative location of the notebook nodes relative to the workspace)
  3. Run jupyter-lab
  4. In a browser, open the pipeline file

Note that the change directory to my_analytics_pipeline can be avoided by setting the notebook workspace as an option to jupyter-lab: jupyter-lab --notebook-dir=<path to parent of>/my_analytics_pipeline

Scenario - Multiple Projects in Workspace

The scenario above describes behavior if you wanted to have a single pipeline “project” in your workspace. To accommodate multiple pipeline projects within a single workspace, you will want to build the project within a sub-directory of the notebook server workspace and capture it such that the sub-directory is included in the captured results. This way, its extraction, again into the same location relative to the notebook server’s workspace, will preserve the relative paths to the notebook files.

Confirming Notebook Locations

An easy way to confirm that your extracted pipeline’s notebook references are correctly located is to double-click the notebook node in your pipeline and ensure the notebook appears in an adjacent tab. If nothing happens or a 404 error shows on the server’s console, the notebook is not properly positioned relative to the server’s workspace, and attempts to run or export the pipeline will fail.

At this time, the only recourse, should it not be possible to fix the relative location via redeployment of the pipeline, would be to edit each of the filename: fields in the .pipeline file such that notebook locations are satisfied, then refresh the browser.