Integration: Databricks
The Databricks integration enables you to ingest dataset metadata from your Databricks Unity Catalog into the Backstage Software Catalog. Databricks' legacy Hive metastore is not supported.
Configuration & Authentication
- Create a Service Principal in Databricks. This can be at either the account or workspace level.
- This service principal must be granted the
USE CATALOG
,USE SCHEMA
, andSELECT
privileges on each catalog, schema, and table (respectively) you wish to ingest datasets from. See the Databricks API documentation for more details. - For tags support: The service principal must also have access to query the
system.information_schema.table_tags
system table to retrieve tag information. This requires theUSE SCHEMA
privilege on thesystem.information_schema
schema. - SQL Warehouse requirement: To retrieve table tags, you must configure a SQL Warehouse ID in your Databricks source configuration. The service principal must have
CAN_USE
permission on this warehouse. - Generate OAuth credentials for the service principal, and take note of the client ID and secret. Be mindful of the lifetime set for these credentials and remember to rotate the secrets before they expire to prevent ingestion failures.
- Back in Portal, visit the Databricks configuration page in Config Manager.
- For each workspace you'd like to ingest datasets from, add a new source with the workspace URL, service principal credentials, and optionally a warehouse ID for tags support. The integration will discover all catalogs, schemas, and tables which the service principal has access to.
Naming
The naming structure for Datasets created from Databricks is as follows: [metastore_id].[catalog].[schema].[table_name]
. The metastore ID is used to ensure the uniqueness of names in the Software Catalog, but typically you will not need to be aware of it. Learn more about the metastore ID in the Databricks documentation.
Ingestible Assets
The Databricks integration currently ingests the following asset types:
- Tables
- Views
- Materialized Views
Tags and Labels
The Databricks integration supports ingesting table tags from the Unity Catalog system.information_schema.table_tags
system table. Tags are converted to labels in the Backstage Software Catalog.
Requirements for Tags Support
To enable tag ingestion, you must:
-
Configure a SQL Warehouse: Add a
warehouseId
to your Databricks source configuration in Config Manager. This warehouse is used to execute SQL queries against the system tables. -
Grant warehouse permissions: The service principal must have
CAN_USE
permission on the specified SQL Warehouse. -
Grant system schema access: The service principal must have
USE SCHEMA
privilege on thesystem.information_schema
schema to query thetable_tags
system table.
How Tags Work
- Table tags defined in Databricks Unity Catalog are retrieved using SQL queries against
system.information_schema.table_tags
- Each tag consists of a
tag_name
and optionaltag_value
- Tags are converted to labels in Backstage with the tag name as the key and tag value as the value
- If a tag has no value, the label value will be an empty string
If warehouseId
is not configured, the integration will still work but will skip tag ingestion and log a warning message.
Troubleshooting
If you are experiencing difficulties ingesting Databricks resources, verify your service principal has the necessary privileges to access the following APIs: