Best practices for file-based pipeline nodes¶
Generic pipelines and typed pipelines support natively file-based nodes for Jupyter notebooks, Python scripts, and R scripts. In order to support heterogeneous execution - that is making them runnable in any runtime environment (JupyterLab, Kubeflow Pipelines, and Apache Airflow) - follow the guidelines listed below.
On Kubeflow Pipelines and Apache Airflow, notebooks and scripts are executed in containers. Elyra provides example runtime images to get you started, but you should consider utilizing purpose-built images instead. If possible, pre-install all software prerequisites in the runtime image you are using instead of installing them on the fly (e.g. by running
pip install my-package==1.2.3 in a notebook cell).
In runtime environments (like Kubeflow Pipelines and Apache Airflow) where containers are used to run notebooks/scripts special consideration must be given to file input and output operations.
If a notebook/script requires access to files that are stored on you local system, those files must be declared as File dependencies. Elyra collects declared files and uploads them to cloud storage and makes them available to the notebook/script at runtime.
File dependencies must be located in the notebook/script file directory or a subdirectory of that location. Symlinks can be used to avoid the need to maintain multiple copies if files are shared among pipelines.
Valid directory layout examples:
./my-pipeline.pipeline ./my-notebook.ipynb ./a-notebook-symlink.ipynb # can reference a notebook in any local directory ./a-dependency-file ./a-symlink # can reference any file or directory ./a-subdir/a-dependency-file ./a-subdir/a-symlink # can reference any file or directory ./a-subdir/a-script-symlink.py # can reference a script in any local directory
Invalid directory layout examples:
./my-pipeline.pipeline ./my-script.py ../my-other-notebook.ipynb # must be symlinked ../a-dependency-in-parent-dir # must be symlinked /some/dir/some-dependency # must be symlinked
All changes to the file system (e.g. new files or modified files) are discarded after processing of the notebook/script has completed. To retain these files you must store those files on cloud storage, or declare those files as output files in the notebook/script node properties.
You can customize notebooks/scripts by setting environment variables in the pipeline node. Environment variable values are not shared across nodes belonging to the same pipeline.
The Visual Pipeline Editor can detect which environment variables notebooks/scripts are accessing and automatically adds those variables to the runtime properties if the following approaches are used to read variable values:
- R script
Refer to the next section for a list of proprietary environment variables that cannot be modified using the node properties settings.
Proprietary environment variables¶
Elyra makes a set of proprietary environment variables available to notebooks and scripts during execution. Unless indicated otherwise, these variables are defined in all runtime environments.
ELYRA_RUN_NAME is an identifier that is unique for each pipeline run but the same for all nodes in the pipeline. You can use this identifier to generate predictable file names.
ELYRA_RUNTIME_ENV identifies the runtime environment that the
notebook or script is executed in:
kfp- Kubeflow Pipelines
airflow- Apache Airflow