Skip to main content

Configuration

Catalog

Defaults

These are the final fallbacks you can set for required information for imported datasets. These will only be used if no other options are available from the dataset or a source specific setting.

Entity Defaults

Customizing Naming Validation

The software catalog has default constraints on naming for everything ingested. Given the length of some entities that may come from the data registry, you may need to alter the maximum length of an entity name. You can see how to do that here.

Filters

The Registry can be configured to exclude datasets from syncing to the Catalog. This may be useful for excluding test or dev tables for example. The 'filters' section of the config allows you to list out regex patterns that will be matched against to decide which dataset names to exclude.

Exclusion Filters

The 'Catalog Exclusions' tab on the Data Overview Dashboard will display all datasets in the Registry which are currently being excluded from the catalog based on the configured filters.

Schedules

The schedule tab has three time configurable settings. The frequency setting accepts standard cronjob syntax. frequency is how often the data stored in the registry will be written to the catalog.

For the other schedule options, they use a human readable time configuration. It will process the same way as the cron but may be easier to set for the other schedule tabs:

  1. timeout, how long a catalog sync runs before it’s considered timed out and ended so a new invocation can start
  2. initialDelay, how long the registry waits before the first sync starts

The example below has a timeout of 5 minutes for each scheduled sync and the registry would sync all datasets in it to the catalog every 5 minutes.

Cron Scheduler

Registry

entityRetentionTimeInDays controls how long entities (either datasets or dbt projects) will persist in the registry after they stop appearing in ingestions from the source systems. By default, this value is 0. Setting this config value to 10 for example, would mean an arbitrary BigQuery table would have to be missing from the ingestions for 10 consecutive days before being removed from Portal. Configuring this value to a higher number might be desirable if you want to be insulated from losing an entity's history if an entity was temporarily missing from an ingestion, but not truly deleted (testing out config changes, permissions changing, etc).

info

For example, say a BigQuery table was ingested into the Data Registry, but then deleted the next day. In the next scheduled BigQuery ingestion, this table represented in the Data Registry would be removed from the Registry, and eventually the Software Catalog. This means there would no longer be a Dataset Entity page for this table, nor any reference to it in Portal. If the table is added back in BigQuery itself, then it will be re-ingested into the Data Registry (and sync into the Software Catalog).

History

The registry keeps history for easier reference or rollbacks of data. The default setting if not configured is 90 days worth of data but this can be fine tuned in the registry configs by setting retentionTimeInDays.

cleanup schedules how often data past its retention is cleaned up from the history. The scheduler is in the same format as the catalog sync. The default is every day at midnight if unset.

record is a finer scheduler that determines how often a snapshot of the registry is taken. The scheduler is in the same format as the catalog sync. The default if unset is every 12 hours.

History

Integration Defaults

The registry has default settings available for scheduling also. These defaults are for how often the sources will sync to the registry if it is not explicitly defined at an individual source level. The config settings are the same available on the catalog, reference Schedules above.