Data Experience Quick Start

This guide contains the minimal steps you'll need to take to get connected with the Data Experience. The plugins needed are all running by default so they just have to be configured.

More detailed information can be found for the Data Experience and each of the integrations in the sidebar.

What you'll achieve

Ingest data warehouse tables as APIs into Portal's Software Catalog
Make datasets searchable alongside your software components
Provide ownership and lifecycle information for data governance

Prerequisites

Admin access to Portal's Config Manager
Access to create credentials in your preferred integration(s)

Step 1: Configure Authentication

BigQuery
Redshift
Snowflake
Databricks

Create a GCP service account with the following roles:
- roles/bigquery.dataViewer
- roles/bigquery.jobUser
Download the JSON credentials file
Navigate to Config Manager > Data Experience
Expand the keys on the sidebar dataExperience > registry > integrations > bigquery
Add an item to the sources list:
- Enter your GCP project ID
- Paste the service account JSON into the credentials field
Scroll to the bottom of the page and click the Save changes button

Create an IAM User with one of the following policies depending on your needs.

Allow access to all Redshift clusters

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "redshift:DescribeClusters",
        "redshift-data:ListSchemas",
        "redshift-data:ListDatabases",
        "redshift:GetClusterCredentialsWithIAM",
        "redshift-data:ExecuteStatement",
        "redshift-data:DescribeStatement",
        "redshift-data:GetStatementResult"
      ],
      "Resource": "*"
    }
  ]
}

Allow access to limited Redshift clusters

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "redshift:DescribeClusters",
        "redshift-data:ListSchemas",
        "redshift-data:ListDatabases",
        "redshift-data:ExecuteStatement"
      ],
      "Resource": "arn:aws:redshift:YOUR_REGION:YOUR_AWS_ACCOUNT_ID:cluster:YOUR_CLUSTER_NAME"
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": [
        "redshift:GetClusterCredentialsWithIAM"
      ],
      "Resource": "arn:aws:redshift:YOUR_REGION:YOUR_AWS_ACCOUNT_ID:dbname:YOUR_CLUSTER_NAME/*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": [
        "redshift-data:DescribeStatement",
        "redshift-data:GetStatementResult"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "redshift-data:statement-owner-iam-userid": "${aws:userid}"
        }
      }
    }
  ]
}

Create an access key for this newly created IAM User.
Navigate to Config Manager > App and expand the aws > accounts key in the sidebar.
Add a new item under the accounts list
- Enter the AWS accountId
- Enter the IAM User's accessKeyId and secretAccessKey
- The remaining fields may remain blank.
Scroll to the bottom of the page and click the Save changes button
Navigate to Config Manager > Data Experience
Expand the keys on the sidebar dataExperience > registry > integrations > redshift
Add an item to the sources list:
- Enter your AWS accountId
- Expand the section below which applies to your situation and follow the instructions

I've enabled access to all Redshift clusters

Ensure the Option 1 tab is selected
Add an item to the sources list
Enter the AWS accountId
Add the regions in which your clusters reside to the regions list

I've enabled access to limited Redshift clusters

Ensure the Option 2 tab is selected
Add an item to the sources list
Enter the AWS accountId
Add the clusters you've granted access to in the clusters list
Enter the cluster identifier and region for each cluster

Scroll to the bottom of the page and click the Save changes button

Follow Snowflake's guide to generate a key-pair for authentication. Be sure to follow the instructions to assign the public key to a Snowflake user.
Navigate to Config Manager > Data Experience 1. Expand the keys on the sidebar dataExperience > registry > integrations > snowflake 1. Under the sources key, select the Option 2 tab and add a new item to the list - Enter SNOWFLAKE_JWT in the authenticator field - Enter the username of the user the public key was assigned to - Enter the privateKey

Enter the warehouse that should be used for executing queries. If omitted, the user's default warehouse will be used. - Enter the role that should be used for executing queries. If omitted, the user's default role will be used.

Scroll to the bottom of the page and click the Save changes button

Create a Service Principal in Databricks. This can be at either the account or workspace level.
This service principal must be granted the USE CATALOG and USE SCHEMA privileges on each catalog and schema you wish to ingest datasets from. See the Databricks API documentation for more details.
Generate OAuth credentials for the service principal, and take note of the client ID and secret. Be mindful of the lifetime set for these credentials and remember to rotate the secrets before they expire to prevent ingestion failures.
Back in Portal, navigate to Config Manager > Data Experience > Databricks
For each workspace you'd like to ingest datasets from, add a new source with the workspace URL and service principal credentials. The integration will discover all catalogs, schemas, and tables which the service principal has access to.

(Optional) Step 2: Configure Registry Ingest Schedule

Configure how often datasets are ingested from your sources to the data registry.

From Config Manager > Data Experience, expand the keys on the sidebar dataExperience > registry > integrations > defaults > schedule > frequency > cron
Enter a valid crontab string. This is 0 */6 * * * (every 6 hours) by default
Scroll to the bottom of the page and click the Save changes button

Step 3: Test & Verify

Wait for the first sync to complete - when this happens will depend on how you've configured your schedules in steps 3 and 4. You can monitor progress by visiting the Data Overview page, accessibile from Portal's navigation.
Search for your datasets in Portal's search

Next Steps

Add filters to exclude test/staging tables
Configure integrations with other data warehouses
Integrate with dbt to bring your dbt projects into the Software Catalog
Integrate with TechDocs to add documentation for your datasets
Create your first check for datasets in Soundcheck
Add news tags or labels to help discovery of your datasets with Entity Overlays

Troubleshooting

No datasets appearing? Verify service account permissions and project visibility. Consider extending the maximum entity name length in Portal
Missing owner/lifecycle? Verify your entity defaults are set to valid catalog entities

What you'll achieve​

Prerequisites​

Step 1: Configure Authentication​

(Optional) Step 2: Configure Registry Ingest Schedule​

Step 3: Test & Verify​

Next Steps​

Troubleshooting​