🤖Databricks SaaS installation

Summary

This guide details how to create the relevant service account for DataFlint to be able to ingest Databricks performance metadata with only read-only access

The entire process should take just a few minutes.

The steps are:

  1. Install DataFlint OSS as init script

  2. Create the DataFlint new service principal, with CAN_VIEW access

  3. Enable cluster logs, and give read-only access to DataFlint

  4. (Optional) supply mapping to your tag names to DataFlint ones

Installation

Step 1: Install DataFlint OSS as init script

See Install on DataBricks

Step 2: Create the DataFlint new service principal, with CAN_VIEW access

Create new service principal

Go to the workspace settings right top corner:

Then choose Identity -> service principal:

Choose "Add New Principal"

Give it a declerative name (such as "DataFlintServicePrincipal")

Choose the newly added service principal:

Go to secrets, and press "Generate Secret"

Afterward you will get this screen with a secret and client id:

Please supply us with:

Add Permission

Go to Jobs & Pipelines and choose a job:

Press "Edit permissions" and add to the job "can view" permission for the service principal we created

Step 3 - Enable cluster logs, and give read-only access

Choose the "Compute" config under the job page:

Press "Configure"

Go under Advance -> logging, and enable cluster logging:

If using DBFS - no need for additional steps, CAN_VIEW access automatically gives read-only access to DBFS logs

If using S3 - you will need to add the following permission to your bucket policy where you save the logs:

{
            "Sid": "Allowing DataFlint to read Databricks logs",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::{DataFlint ACCOUNT ID}:role/eks-dataflint-service-role"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{YOUR BUCKET NAME}/*",
                "arn:aws:s3:::{YOUR BUCKET NAME}"
            ]
        }

Contact us privately to get the name of our DataFlint ACCOUNT ID, it might be different between SaaS and BYOC installations.

Step 4 (Optional) - Supply mapping to your tag names to DataFlint ones

You might already have tags in your databricks job/job run/cluster level for values such as team, group, version, workflow id and such.

You can add the following default tags, or map your existing map to DataFlint format (via our admin console)

Supported tags are:

  1. DATAFLINT_ENV - the environment (i.e. prod, dev) default value is "default"

  2. DATAFLINT_TEAM - the team that is the owner of the job

  3. DATAFLINT_DOMAIN - owner group, business unit or product name the job belongs to

  4. DATAFLINT_DAG_ID - a unique ID of the dag the job belongs to

  5. DATAFLINT_DAG_RUN_ID - a unique ID of the dag run id the job was triggered from

  6. DATAFLINT_VERSION - version of the job, in semver format

Last updated