🤖Databricks SaaS installation

Summary

This guide details how to create the relevant service account for DataFlint to be able to ingest Databricks performance metadata with only read-only access

The entire process should take just a few minutes.

The steps are:

Install DataFlint OSS as init script
Create the DataFlint new service principal, with CAN_VIEW access
Enable cluster logs, and give read-only access to DataFlint
(Optional) supply mapping to your tag names to DataFlint ones

Installation

Step 1: Install DataFlint OSS as init script

See Install on DataBricks

Step 2: Create the DataFlint new service principal, with CAN_VIEW access

This process requires a workpsace admin permission

Create new service principal

Go to the workspace settings right top corner:

Then choose Identity -> service principal:

Choose "Add New Principal"

Give it a declerative name (such as "DataFlintServicePrincipal")

Choose the newly added service principal:

Go to secrets, and press "Generate Secret"

Afterward you will get this screen with a secret and client id:

Please supply us with:

Display name (for example "data infra team workspace")
Client ID
Secret
Workspace URL (looks something like this: "https://dbc-1234556-abcde.cloud.databricks.com/")

Add Permission

We will demonstrate with one job in the databricks UI, but you would probably want to do that process to a template or via a databricks orchestration operator.

Go to Jobs & Pipelines and choose a job:

Press "Edit permissions" and add to the job "can view" permission for the service principal we created

Step 3 - Enable cluster logs, and give read-only access

We will demonstrate with one cluster in the databricks UI, but you would probably want to do that process to a template or via a databricks orchestration operator.

Choose the "Compute" config under the job page:

Press "Configure"

Go under Advance -> logging, and enable cluster logging:

If using DBFS - no need for additional steps, CAN_VIEW access automatically gives read-only access to DBFS logs

If using S3 - you will need to add the following permission to your bucket policy where you save the logs:

{
            "Sid": "Allowing DataFlint to read Databricks logs",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::{DataFlint ACCOUNT ID}:role/eks-dataflint-service-role"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{YOUR BUCKET NAME}/*",
                "arn:aws:s3:::{YOUR BUCKET NAME}"
            ]
        }

Contact us privately to get the name of our DataFlint ACCOUNT ID, it might be different between SaaS and BYOC installations.

Step 4 (Optional) - Supply mapping to your tag names to DataFlint ones

You might already have tags in your databricks job/job run/cluster level for values such as team, group, version, workflow id and such.

You can add the following default tags, or map your existing map to DataFlint format (via our admin console)

Supported tags are:

DATAFLINT_ENV - the environment (i.e. prod, dev) default value is "default"
DATAFLINT_TEAM - the team that is the owner of the job
DATAFLINT_DOMAIN - owner group, business unit or product name the job belongs to
DATAFLINT_DAG_ID - a unique ID of the dag the job belongs to
DATAFLINT_DAG_RUN_ID - a unique ID of the dag run id the job was triggered from
DATAFLINT_VERSION - version of the job, in semver format

Last updated 2 months ago