Metadata Services

While building enterprise extensions for the Elyra project we identified a requirement to integrate with external runtimes, data sources, and other components that require additional metadata in order to connect to these external components.

The Metadata Service provides a generic service that can be used to store metadata that can be easily integrated with Elyra backend and/or frontend components.

Metadata Service Overview

The core design of the metadata service is that it be primarily driven by schema. That is, all metadata instances must adhere to a schema. Metadata schema are organized, both logically and physically, into schemaspaces where a given schemaspace contains a set of schemas and each instance of a schemaspace’s schemas are co-located relative to the configured storage mechanism.

Externally, both in its persisted and serialized forms, a metadata instance is represented as a JSON structure. Within the metadata service, however, this JSON structure is represented as a Metadata class instance. The Metadata class can be extended via subclassing to enable custom behaviors that are triggered following load operations and/or before and after create, update and delete operations on a per schema level. That is, it is the respective schema that identifies the name of the class to use as the internal representation.

As previously mentioned, the metadata service has a pluggable storage mechanism. By default, metadata instances are stored within the file system. However, alternative storage mechanisms can be configured for, say, NoSQL DB or traditional, table-oriented, databases.

Schemaspaces and their schemas can be dynamically introduced via entrypoints. When the system starts, it will identify applicable entrypoints and load their corresponding objects to seed the list of configured schemaspaces and schemas.

Metadata Components

The Elyra metadata service is composed of several components, many of which were previously referenced, and each responsible for a different task.
MetadataServiceRelationships

The following briefly describes each of those components.

  • Schemaspace: This is the top-most component and represents a container of schemas. It has id (UUID), name (symbolic name), display_name (descriptive name), and description properties, along with a set of schemas. Instances of a schemaspace’s schemas are logically grouped together in whatever storage model is in use. Schemaspaces are discovered and loaded via entrypoints.
  • SchemasProvider: A schemas provider is a class that is also discovered and loaded via entrypoints. Its purpose is to provide schemas. Since a schemaspace is always referenced within the schema, a given SchemasProvider instance could provide schemas corresponding to different schemaspaces although a schemas provider will typically correspond to a single schemaspace.
  • Metadata: This class is used internally to house the JSON instance data. It can be subclassed to enable applications to hook persistence and retrieval operations.
  • SchemaManager: This class is a singleton instance that loads and houses all schemas organized by their respective Schemaspace. All schemas provided by a SchemasProvider must adhere to the meta-schema and the SchemaManager is responsible for validating provided schemas.
  • MetadataManager: This class operates on Metadata instances. Prior to each persistence operation, and following each load, the MetadataManager is responsible for validating the instance against its corresponding schema, which it obtains from the SchemaManager. It is also responsible for driving the persistence and retrieval of instances against the configured MetadataStore.
  • MetadataStore: This class is an abstract base class that defines methods relative to the persistence and retrieval of metadata instances. It can be subclassed such that each subclass typically represents an alternative storage mechanism. The system provides a FileMetadataStore instance - which is configured as the default storage mechanism.
  • REST/CLI API: The metadata service entities can be manipulated via a REST API consisting of /elyra/metadata, /elyra/schema and /elyra/schemaspace endpoints. Similar capabilities can also be performed using the command-line tool (CLI) elyra-metadata.

Metadata Services structure using the default file system store

The default implementation is for the metadata services to store metadata files in the file system, grouped by directories according to the metadata instance’s schemaspace.

The root directory for metadata is relative to the ‘Jupyter Data directory’ (e.g. jupyter --data-dir)

/Users/jovyan/Library/Jupyter/metadata/

Each instance of metadata is then stored in a child directory representing the instance’s schemaspace.

As an example, runtimes is the schemaspace for runtime metadata instances that reside in the following directory:

/Users/jovyan/Library/Jupyter/metadata/runtimes

The contents of this folder would then include multiple JSON files, each associated with a schema corresponding to the desired runtime platform.

For example, the following contains runtime metadata instances for two runtime platforms, airflow and kfp, where each runtime has 1 and 2 runtime instances defined, each of which adheres to their respective schemas.

/Users/jovyan/Library/Jupyter/metadata/runtimes/airflow-cloud.json
/Users/jovyan/Library/Jupyter/metadata/runtimes/kfp-fyre.json
/Users/jovyan/Library/Jupyter/metadata/runtimes/kfp-qa.json

And each metadata file contains information similar to the following (note the reference to its schema_name):

{
  "display_name": "Kubeflow Pipeline - Fyre",
  "schema_name": "kfp",
  "metadata": {
    "api_endpoint": "http://weakish1.fyre.ibm.com:32488/pipeline",
    "api_username": "username@email.com",
    "api_password": "mypassword",
    "cos_endpoint": "http://weakish1.fyre.ibm.com:30427",
    "cos_username": "minio",
    "cos_password": "minio123",
    "cos_bucket": "test-bucket",
    "tags": [
      "kfp", "v1.1"
    ]
  }
}

Metadata Service REST API

A REST API is available for easy integration with frontend components:

Retrieve all metadata for a given schemaspace:

GET /elyra/metadata/<schemaspace>

Retrieve a given metadata resource from a given schemaspace:

GET /elyra/metadata/<schemaspace>/<resource>

Metadata Client Tool

Users can easily manipulate metadata via the Client Tool

elyra-metadata list runtimes
Available metadata instances for runtimes (includes invalid):

Schema   Instance       Resource  
------   --------       -------- 
kfp      kfp-fyre       /Users/jovyan/Library/Jupyter/metadata/runtimes/kfp-fyre.json
kfp      kfp-qa         /Users/jovyan/Library/Jupyter/metadata/runtimes/kfp-qa.json
airflow  airflow-cloud  /Users/jovyan/Library/Jupyter/metadata/runtimes/airflow-cloud.json

Metadata APIs

A Python API is also available for accessing and manipulating metadata. This is accomplished using the MetadataManager along with a corresponding storage class. The default storage class is FileMetadataStore.

from elyra.metadata import MetadataManager, FileMetadataStore

metadata_manager = MetadataManager(schemaspace="runtimes",
                                   store=FileMetadataStore(schemaspace='runtimes'))

runtime_configuration = metadata_manager.get('kfp')

if not runtime_configuration:
    raise RuntimeError("Runtime metadata not available.")

api_endpoint = runtime_configuration.metadata['api_endpoint']
api_username = runtime_configuration.metadata['api_username']
api_password = runtime_configuration.metadata['api_password']
cos_endpoint = runtime_configuration.metadata['cos_endpoint']
cos_username = runtime_configuration.metadata['cos_username']
cos_password = runtime_configuration.metadata['cos_password']
bucket_name = runtime_configuration.metadata['cos_bucket']
tags = runtime_configuration.metadata['tags']

Bring Your Own Schemas

The metadata service enables the ability to introduce your own schemaspaces and bring custom schemas to existing schemaspaces via entrypoints. It is necessary to introduce Schemaspaces when developing a new extension or adding functionality in which it makes sense to group instances of the functionality together. For example, Code Snippets consist of their own Schemaspace and, currently, are based on a single schema. Applications extending existing functionality do not need to introduce a new Schemaspace, since one will already exist, but must be able to introduce their custom schema. Introducing your own runtime would be an example where a Schemaspace is not required. In this case, only a SchemasProvider would be necessary.

Schemaspace

A Schemaspace instance is used to represent a set of schemas. Custom schemaspace classes must derive from the Schemaspace base class and are responsible for providing basic information like id (UUID), name (used as a symbolic name for lookups), display_name (a human-readible, descriptive name) and description. They also contain a dictionary of JSON schemas, indexed by schema name. However, the dictionary is empty at construction and filled during system initialization via the SchemasProvider mechanism. By convention, schemaspace names are plural, while schema names are singular.

Registration

Schemaspaces are discovered via the entrypoints functionality. When the SchemaManager starts, it fetches all entrypoint instances named by the group 'metadata.schemaspaces', and uses the entrypoint data to load each found instance. Entrypoints are registered in the package’s setup configuration.

Here is an example of the schemaspace entrypoints defined in Elyra:

entry_points={
  'metadata.schemaspaces': [
    'runtimes = elyra.metadata.schemaspaces:Runtimes',
    'runtimes-images = elyra.metadata.schemaspaces:RuntimeImages',
    'code-snippets = elyra.metadata.schemaspaces:CodeSnippets',
    'component-registries = elyra.metadata.schemaspaces:ComponentRegistries'
  ]
}

As you can see, each entry has a name that equates to a module and object/class.

SchemasProvider

As mentioned above, Schemaspace instances do not know their set of schemas at the time of their construction. Instead, the SchemaManager loads a different entrypoints group following the loading of schemaspaces named 'metadata.schemas_providers'. These entrypoints identify a SchemasProvider object and this object is responsible for returning schemas when called upon by the SchemaManager. SchemasProvider instances are concrete instances of the abstract base class SchemasProvider, which requires an implementation of the get_schemas() method:

class SchemasProvider(ABC):
    """Abstract base class used to obtain schema definitions from registered schema providers."""

    @abstractmethod
    def get_schemas(self) -> List[Dict]:
        """Returns a list of schemas"""
        pass

When called, the SchemasProvider instance is responsible for producing a list of schemas. These schemas must conform to the metadata service’s meta-schema and, as a result, will include the schemaspace name and schemaspace id to which they correspond, in addition to a name property representing that schema’s name. The SchemaManager uses this meta-information to confirm the schemaspace has been loaded and register that schema with the schemaspace following its successful validation against the meta-schema.

Some SchemasProvider instances may have schema information which is dynamic, based on the active configuration of the system. In such cases, they are responsible for returning information pertinent to the current configuration. For example, the component-registry schema (of the component-registries schemaspace) includes an enumerated property named runtime. However, the values/choices included in the enum are a function of what packages are configured. As a result, the get_schemas() method for the component-registries SchemasProvider determines, based on installed packages, which entries to add to the runtime enumeration.

Registration

Like Schemaspaces, SchemasProvider implementations are discovered via the entrypoints functionality. When the SchemaManager starts, and after it loads all Schemaspaces, it fetches all entrypoint instances named by the group 'metadata.schemas_providers', and uses the entrypoint data to load each found instance.

Here is an example of the schemas_providers entrypoints defined in Elyra:

entry_points={
  'metadata.schemas_providers': [
    'runtimes = elyra.metadata.schemasproviders:RuntimesSchemas',
    'runtimes-images = elyra.metadata.schemasproviders:RuntimeImagesSchemas',
    'code-snippets = elyra.metadata.schemasproviders:CodeSnippetsSchemas',
    'component-registries = elyra.metadata.schemasproviders:ComponentRegistriesSchemas'  ]
}

Implementing Custom Metadata Classes

Applications can also influence behavior by introducing their own sublcass of Metadata to wrap instance data. When this is done, those implementations will be called prior to and after any create, update and delete operations, in addition to following any load (get) operations. As a result, applications can influence what is persisted and what is retrieved.

It should be noted that any exceptions raised from the method overrides will affect the overall operation. This may be warranted in some cases in which the application finds some invariant condition violations, perhaps those that can’t be expressed in the schema, in which the operation should be terminated. If the exceptions are raised in either post_save() or post_delete(), the MetadataManager will also apply a compensating transaction corresponding to the current operation. For example, if an instance is being created and the post_save() method raises an exception, the MetadataManager will apply a compensating transaction that deletes the newly-created instance.

Registration

Unlike Schemaspaces and SchemasProvider registrations, metadata class overrides are registered within their respective schema file via the metadata_class_name property. The value of this property denotes the dotted notation corresponding to the location of the implementation, the hosting package of which must be installed and locatable. (Note that someday, we may choose to change this value to be an entrypoint name of a well-documented entrypoint group. In such cases, the value would not be “dotted” and, therefore, backwards compatible with the current approach.)

Here’s an example of what the component-registry schema provides:

"metadata_class_name": "elyra.pipeline.component_metadata.ComponentRegistryMetadata"

In this case, the ComponentRegistry module overrides the post_save() and post_delete() methods to update its component’s cache.