Configuration
Catalog
Defaults
These are the final fallbacks you can set for required information for imported datasets. These will only be used if no other options are available from the dataset or a source specific setting.
Customizing Naming Validation
The software catalog has default constraints on naming for everything ingested. Given the length of some entities that may come from the data registry, you may need to alter the maximum length of an entity name. You can see how to do that here.
Filters
The Registry can be configured to exclude datasets from syncing to the Catalog. This may be useful for excluding test or dev tables for example. The 'filters' section of the config allows you to list out regex patterns that will be matched against to decide which dataset names to exclude.
The 'Catalog Exclusions' tab on the Data Overview Dashboard will display all datasets in the Registry which are currently being excluded from the catalog based on the configured filters.
Schedules
The schedule tab has three time configurable settings. The frequency setting accepts standard
cronjob
syntax. frequency
is how often the data stored in the registry will be written to the catalog.
For the other schedule options, they use a human readable time configuration. It will process the same way as the cron but may be easier to set for the other schedule tabs:
timeout
, how long a catalog sync runs before it’s considered timed out and ended so a new invocation can startinitialDelay
, how long the registry waits before the first sync starts
The example below has a timeout of 5 minutes for each scheduled sync and the registry would sync all datasets in it to the catalog every 5 minutes.
Registry
entityRetentionTimeInDays
controls how long entities (either datasets or dbt projects) will persist in the registry after they stop appearing in ingestions from the source systems. By default, this value is 0. Setting this config value to 10 for example, would mean an arbitrary BigQuery table would have to be missing from the ingestions for 10 consecutive days before being removed from Portal. Configuring this value to a higher number might be desirable if you want to be insulated from losing an entity's history if an entity was temporarily missing from an ingestion, but not truly deleted (testing out config changes, permissions changing, etc).
For example, say a BigQuery table was ingested into the Data Registry, but then deleted the next day. In the next scheduled BigQuery ingestion, this table represented in the Data Registry would be removed from the Registry, and eventually the Software Catalog. This means there would no longer be a Dataset Entity page for this table, nor any reference to it in Portal. If the table is added back in BigQuery itself, then it will be re-ingested into the Data Registry (and sync into the Software Catalog).
History
The registry keeps history for easier reference or rollbacks of data. The default setting if not configured is 90 days worth
of data but this can be fine tuned in the registry
configs by setting retentionTimeInDays
.
cleanup
schedules how often data past its retention is cleaned up from the history. The scheduler is in the same format as
the catalog sync. The default is every day at midnight if unset.
record
is a finer scheduler that determines how often a snapshot of the registry is taken. The scheduler is in the same format
as the catalog sync. The default if unset is every 12 hours.
Integration Defaults
The registry has default settings available for scheduling also. These defaults are for how often the sources will sync to the registry if it is not explicitly defined at an individual source level. The config settings are the same available on the catalog, reference Schedules above.