Data Registry
This Integration is available in Portal only, please head here to learn more about the Data Registry.
The Data Registry fact collector is installed by default on Portal and exposes information from Portals's Data Registry as facts to Soundcheck. Before configuring the collector, ensure the Data Registry itself has been enabled and configured.
This collector enables the creation of checks against metadata stored inside the Registry, to ensure that it is in compliance with your organization's standards and best practices.
It supports the collection of the following fact:
dataset-schema
- contains schema information about dataset entities.
Prerequisites
Configure Data Registry in Backstage
The dataset-schema
fact is typically only applicable to Catalog entities with Kind:Api and Type:dataset, which are programmatically ingested by the Data Registry. To make full use of this collector, be sure to configure some dataset integrations.
Data Registry Fact Collector Configuration
The collection of Data Registry facts is driven by configuration. To learn more about the configuration, consult the Defining Data Registry Fact Collections section.
Similar to other collectors, Data Registry Fact Collector can be configured via YAML or No-Code UI. If you configure it via both YAML and No-Code UI, the configurations will be merged. It's preferable to choose a single source for the Fact Collectors configuration (either No-Code UI or YAML) to avoid confusing merge results. Since this collector is designed for Portal, the No-Code UI is the easiest option, as YAML isn't configurable within the Portal site.
No-Code UI Configuration
To enable the Data Registry Integration, go to Soundcheck > Integrations > Data Registry
and click the Configure
button. To learn more about the No-Code UI config, see the Configuring a fact collector (integration) via the no-code UI.
By default, the dataset-schema
check is filtered to Kind:Api, Type:dataset entities, and runs on a one hour interval.
YAML Configuration Option
Add a soundcheck.collectors.date-registry.collects
field to the app-config.yaml
.
A simple example Catalog fact collector is listed below.
# app-config.yaml
soundcheck:
collectors:
data-registry:
collects:
type: 'dataset-schema'
filter:
- spec.type: 'dataset'
frequency:
hours: 1
cache: false
Defining Data Registry Fact Collections
This section describes the data shape and semantics of Data Registry Fact Collection configurations.
Shape Of A Data Registry Fact Collection Configuration
The following is an example of a Data Registry Fact Collection Configuration in YAML:
soundcheck:
collectors:
data-registry:
collects:
type: 'dataset-schema'
filter:
kind: api
spec.type: dataset
cache: false
Below are the details for each field.
type
[required]
The type of the collector: dataset-schema
.
frequency
[optional]
The frequency at which the fact collection should be executed. Possible values are either a cron expression { cron: ... }
or HumanDuration.
If not provided, the fact will only be collected on demand.
Example:
frequency:
minutes: 10
initialDelay
[optional]
The amount of time that should pass before the first invocation happens. Possible values are either a cron expression { cron: ... }
or HumanDuration.
Example:
initialDelay:
seconds: 30
batchSize
[optional]
The number of entities to collect facts for at once. Optional, the default value is 1.
Example:
batchSize: 100
filter
[optional]
A filter specifying which entities to collect the specified facts for. Matches the filter format used by the Catalog API.
The dataset-schema
fact in particular is only relevant to datasets, and so by default the No-Code UI has filters for Kind:Api, Type:dataset.
exclude
[optional]
Entities matching this filter will be skipped during the fact collection process. Can be used in combination with filter. Matches the filter format used by the Catalog API.
filter:
- kind: component
exclude:
- spec.type: documentation
cache
[optional]
If the collected facts should be cached, and if so for how long. Possible values are either true
or false
or a nested { duration:
HumanDuration }
field.
If not provided, the fact will not be cached.
Example:
cache:
duration:
hours: 24
Shape of A Data Registry Fact
The shape of a Data Registry Fact is based on the Fact Schema.
The following is an example of the collected dataset-schema
fact:
factRef: data-registry:default/dataset-schema
entityRef: component:default/test-dataset-1
data:
fields:
- name: user_id
type: varchar
description: unique identifier for the user
- name: num_connections
type: int2
description: number of connections associared with the user
timestamp: 2025-02-20T15:20:35Z
See Software Catalog's descriptor format documentation for more details about the shape of an entity.
Shape of A Data Registry Check
The shape of a Data Registry Check matches the Shape of a Check.
The following is an example of a Data Registry check:
- id: all_dataset_fields_have_descriptions
rule:
all:
- factRef: data-registry:default/dataset-schema
path: $.fields[*].description
operator: all:matches
value: .+
schedule:
frequency:
cron: '* * * * *'
filter:
kind: 'Api'
passedMessage: |
All fields in the dataset schema have a populated description!
failedMessage: |
At least one field in the dataset schema has an empty string for the description.