Configuring Apache Airflow on Kubernetes for use with Elyra

Pipelines in Elyra can be run locally in JupyterLab, or remotely on Kubeflow Pipelines or Apache Airflow to take advantage of shared resources that speed up processing of compute intensive tasks.

Note: Support for Apache Airflow is experimental.

This document outlines how to set up a new Elyra-enabled Apache Airflow environment or add Elyra support to an existing deployment.

This guide assumes a general working knowledge of and administration of a Kubernetes cluster.

Prerequisites

  • A private repository on github.com or GitHub Enterprise that is used to store DAGs.
  • S3-based cloud object storage e.g. IBM Cloud Object Storage, Amazon S3, MinIO

AND

  • A Kubernetes Cluster without Apache Airflow installed
    • Ensure Kubernetes is at least v1.18. Earlier versions might work but have not been tested.
    • Helm v3.0 or later
    • Use the Helm chart available in the Airflow source distribution with the Elyra sample configuration.

OR

  • An existing Apache Airflow cluster
    • Ensure Apache Airflow is at least v1.10.8 and below v2.0.0. Other versions might work but have not been tested.
    • Apache Airflow is configured to use the Kubernetes Executor.
    • Ensure the KubernetesPodOperator is installed and available in the Apache Airflow deployment

Setting up a DAG repository on GitHub

In order to use Apache Airflow with Elyra, it must be configured to use a GitHub repository to store DAGs.

  • Create a private repository on github.com or GitHub Enterprise. (Elyra produces DAGs that contain credentials, which are not encrypted. Therefore you should not use a public repository.)
  • Generate a personal access token with push access to the repository. This token is used by Elyra to upload DAGs.
  • Generate an SSH key with read access for the repository. Apache Airflow uses a git-sync container to keep its collection of DAGs in synch with the content of the GitHub Repository and the SSH key is used to authenticate.

Take note of the following information:

  • GitHub API endpoint (e.g. https://api.github.com if the repository is hosted on github.com)
  • Repository name (e.g. your-git-org/your-dag-repo)
  • Repository branch name (e.g. main)
  • Personal access token (e.g. 4d79206e616d6520697320426f6e642e204a616d657320426f6e64)

You need to provide this information in addition to your cloud object storage credentials when you create a runtime configuration in Elyra for the Apache Airflow deployment.

Example Apache Airflow runtime configuration

Deploying Airflow on a new Kubernetes cluster

To deploy Apache Airflow on a new Kubernetes cluster:

  1. Create a Kubernetes secret containing the SSH key that you created earlier. The example below creates a secret named airflow-secret from three files. Replace the secret name, file names and locations as appropriate for your environment.

    kubectl create secret generic airflow-secret --from-file=id_rsa=.ssh/id_rsa --from-file=known_hosts=.ssh/known_hosts --from-file=id_rsa.pub=.ssh/id_rsa.pub -n airflow
    
  2. Download, review, and customize the sample helm configuration (or customize an existing configuration):

    • Set git.url to the URL of the private repository you created earlier, e.g. ssh://git@github.com/your-git-org/your-dag-repo
    • Set git.ref to the DAG branch, e.g. main.
    • Set git.secret to the name of the secret you created, e.g. airflow-secret.
    • Adjust the git.gitSync.refreshTime as desired.

    Example excerpt from a customized configuration:

    ## configs for the DAG git repository & sync container
    ##
    git:
      ## url of the git repository
      ##
      ## EXAMPLE: (HTTP)
      ##   url: "https://github.com/torvalds/linux.git"
      ##
      ## EXAMPLE: (SSH)
      ##   url: "ssh://git@github.com:torvalds/linux.git"
      ##
      url: "ssh://git@github.com/your-git-org/your-dag-repo"
    
      ## the branch/tag/sha1 which we clone
      ##
      ref: "main"
    
      ## the name of a pre-created secret containing files for ~/.ssh/
      ##
      ## NOTE:
      ## - this is ONLY RELEVANT for SSH git repos
      ## - the secret commonly includes files: id_rsa, id_rsa.pub, known_hosts
      ## - known_hosts is NOT NEEDED if `git.sshKeyscan` is true
      ##
      secret: "airflow-secret"
      ...
      gitSync:
        ...
        refreshTime: 10
    
    airflow:
    ## configs for the docker image of the web/scheduler/worker
    ##
    image:
      repository: elyra/airflow
    

    The container image is created using this Dockerfile and published on Docker Hub and quay.io.

  3. Install Apache Airflow using the customized configuration.

    helm install "airflow" stable/airflow --values path/to/your_customized_helm_values.yaml
    

Once Apache Airflow is deployed you are ready to create and run pipelines, as described in the tutorial.

Enabling Elyra pipelines in an existing Apache Airflow deployment

To enable running of notebook pipelines on an existing Apache Airflow deployment

Once Apache Airflow is deployed you are ready to create and run pipelines, as described in the tutorial.