Getting Started with Data Experience α
Overview
What is the Data Experience α
The Data Experience α provides visibility of your data ecosystem by integrating it with core parts of Portal so Data Practitioners and Developers can easily find and manage datasets. The Data Experience enables you to ingest dataset metadata from different sources and model them in Backstage's Software Catalog.
Who is the Data Experience α for?
Data Platform Teams
You've seen Backstage successfully rolled out by your tech champion and wonder if you can use it to solve your own data challenges. You wish your company had more context about their data ecosystem and want to reduce the dependencies on your team. You want to promote best practices, information, and tools and processes to boost the productivity of your Data Practitioners.
Data Owners
You own a bunch of datasets and wish you had an interface to see everything in one place. Or maybe you're part of a backend team that already uses Portal, but also produces and owns datasets that you need to manage.
Data Consumers
You hope to discover datasets that are available across your organization for consumption and need to identify a given dataset's owner to learn more about it.
Why use the Data Experience α?
- Improve ownership and information about your datasets
- Centralize view of software and data in the Software Catalog
- Reduce the silo between your Data Platform and IDP
- Increase adoption of Backstage by expanding its value and use cases to Data Practitioners
Key Concepts
Ingest Sources
The Data Experience α comes with integrations that enable the ingestion of table or view metadata from Data Warehouses, like BigQuery, Snowflake, and Redshift
Data Registry
The Data Registry enables the ingestion of table or view metadata from supported Data Warehouses and centralizes this metadata into one location. The Data Registry contains more granular information about a table such as the schema and description. This provides a unified view of tables across all Data Warehouses and provides data entities to the Software Catalog. You can filter which dataset(s) in the Data Registry you want to add or exclude from the Software Catalog. Some use cases for this are excluded, internal, or staging tables that should not be searchable.
Simply put, the Data Registry serves as an extension of the Software Catalog, providing additional metadata fields that are unique to Data Platforms. All this while seamlessly integrating with the Software Catalog surface to provide a single, cohesive experience across software and data in the Portal IDP.
Dataset API
A table or view from the Data Registry becomes a Dataset API once it's added and modeled in the Software Catalog.
Datasets are modeled under the kind:API
and type:dataset
and act as an interface to a Component. Dataset APIs
are the main entity we care about in the Backstage Software Ecosystem Model. Dataset APIs are part of the Software
Catalog, which means they can be searched for in Portal's search.
Catalog Requirements and Constraints
While ingesting datasets into the catalog provides a lot of benefits, it does impose some requirements that are
Backstage-specific. All API entities within the catalog are required to have an owner
and lifecycle
populated.
For datasets, the Owner
metadata represents the group
or user in the Catalog that controls
or maintains the data. The Lifecycle
metadata denotes the development or support state of your dataset - for example, production.
Backstage accepts any lifecycle value, but an organization should take great care to establish a proper taxonomy for these.
The current set of well-known and common values for this field is:
experimental
: an experiment or early, non-production Dataset. This signals that users may prefer not to consume it over other more established Datasets or that there are low to no reliability guaranteesproduction
: an established, owned, and maintained Datasetdeprecated
: a Dataset that is at the end of its lifecycle, and may disappear at a later point in time
Because they are required, fallback defaults for these values are specified in the dataExperience config section (see configuration below). Each source connector may also specify a mechanism to define the owner and lifecycle for a specific dataset which is discussed further in their sections.
Installation
Core Data Experience α Plugins
@spotify/backstage-plugin-data-registry
@spotify/backstage-plugin-data-registry-backend
@spotify/backstage-plugin-catalog-backend-module-data-registry-provider
@spotify/backstage-plugin-search-backend-module-data-registry
Source Specific Plugins
These can be installed same as above.
@spotify/backstage-plugin-data-registry-backend-module-redshift
@spotify/backstage-plugin-data-registry-backend-module-snowflake
@spotify/backstage-plugin-data-registry-backend-module-bigquery
@spotify/backstage-plugin-data-registry-backend-module-dbt
Configuration
Catalog
Defaults
These are the final fallbacks you can set for required information for imported datasets. These will only be used if no other options are available from the dataset or a source specific setting.
Schedules
The schedule tab has three time configurable settings. The frequency setting accepts standard
cronjob
syntax. frequency
is how often the data stored in the registry will be written to the catalog.
For the other schedule options, they use a human readable time configuration. It will process the same way as the cron but may be easier to set for the other schedule tabs:
timeout
, how long a catalog sync runs before it’s considered timed out and ended so a new invocation can startinitialDelay
, how long the registry waits before the first sync starts
The example below has a timeout of 5 minutes for each scheduled sync and the registry would sync all datasets in it to the catalog every 5 minutes.
Registry
History
The registry keeps history for easier reference or rollbacks of data. The default setting if not configured is 90 days worth
of data but this can be fine tuned in the registry
configs by setting retentionTimeInDays
.
cleanup
schedules how often data past its retention is cleaned up from the history. The scheduler is in the same format as
the catalog sync. The default is every day at midnight if unset.
record
is a finer scheduler that determines how often a snapshot of the registry is taken. The scheduler is in the same format
as the catalog sync. The default if unset is every 12 hours.
Integration Defaults
The registry has default settings available for scheduling also. These defaults are for how often the sources will sync to the registry if it is not explicitly defined at an individual source level. The config settings are the same available on the catalog, reference Schedules above.
Sources
Snowflake
Configuration & Authentication
Each Snowflake account wanting to be ingested must be configured using the format 2 account identifier. For each account listed, every Table within that account will be ingested as a dataset within the registry.
Within each account block, the relevant authentication type and details must also be provided. Currently, password-based auth of Snowflake's authentication options is supported. Set the authenticator key to 'SNOWFLAKE' and add username and password key-values representing the account Backstage will connect with.
Roles
Users in Snowflake can have multiple roles but the default role for the user will be the one used during all queries to Snowflake. This can be overridden by adding the optional role key to the config underneath the relevant account. Note that the role selection will affect which tables are ingested. Tables with a higher level role in the role hierarchy than the one used in the connection won't appear.
Using the example config above, the role is specifically set to PUBLIC
. This would mean any tables visible only to ACCOUNTADMINs
would not be ingested into the Registry.
Naming
The naming structure for Datasets created from Snowflake is as follows: [database].[schema].[table_name]
.
Tags & Labels
Snowflake Tags are pieces of metadata that can be added to
various resources. Since these are key-value pairs in the source system, the Snowflake connector for Registry will convert them to
labels (which are a key-value pair mapping)
within the Software Catalog. Tags that can't be converted into Catalog labels (because they're too long, contain invalid UTF-8 characters,
etc) will be omitted from the Dataset and emit a WARN
log.
The snowflake connector will parse the tags looking for ones where the key value is OWNER
or LIFECYCLE
in order to populate those
values onto the dataset. If an OWNER
key is not found within the tags, it will use the ownership metadata
(which is usually the principal role on the table).
Finally, it will use the values populated within the entityDefaults
section of the config as a fallback option for both owner and lifecycle.
Warehouses
Snowflake Warehouses are required for certain queries to be run. While the basic ingestion of tables doesn't rely on any such queries, ingesting the tags on each table does. For this step, the default warehouse of the user associated with the account in the config will be used. To override this selection, add the optional warehouse key-value pair to the relevant account. Reference the screen shot above in roles to see where to configure this.
Since warehouses are also scoped to roles, it's important to ensure the warehouse used in the connection (be it the user's default or defined via the config) can be accessed by the role used in the connection. Also note that the warehouse must be running (or have auto-resume enabled) in order for tag ingestion to occur.
Redshift
Configuration
A source in the Redshift integration requires an accountId and a list of regions. Each region must be enabled for the associated account. The connector will search each region under the associated account and ingest every table's metadata.
Authentication
The Redshift integration uses the @backstage/integration-aws-node
package to create a credential provider which is then passed into
the AWS client SDKs. Since it is handled by a separate plugin, the set up for the authentication for it is found under the App
config as seen below:
Every accountID included in the data registry config must have an associated auth config like above.
Naming
The naming structure for Datasets created from Redshift is as follows: [accountId].[region].[cluster].[database].[schema].[table]
.
Tags & Labels
Tags for redshift resources are not currently supported for the Redshift connector since they cannot be applied to resources at the database granularity or below.