# Databricks SaaS installation

## Summary

This guide details how to create the relevant service account for DataFlint to be able to ingest Databricks performance metadata with only read-only access

The entire process should take just a few minutes.

The steps are:

1. Install DataFlint OSS as init script
2. Create the DataFlint new service principal, with CAN\_VIEW access
3. Enable cluster logs, and give read-only access to DataFlint
4. (Optional) supply mapping to your tag names to DataFlint ones

## Installation

### Step 1: Install DataFlint OSS as init script

See [install-on-databricks](https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-databricks "mention")

### Step 2: Create the DataFlint new service principal, with CAN\_VIEW access

{% hint style="warning" %}
This process requires a workpsace admin permission
{% endhint %}

#### Create new service principal

Go to the workspace settings right top corner:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2F91GW3MUZU4vwLLjWNfCv%2Fimage.png?alt=media&#x26;token=fdd7a51a-428c-488f-85ce-41be4935dc18" alt=""><figcaption></figcaption></figure>

Then choose Identity -> service principal:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FMIUOxDr46RYRBKAA4tIz%2Fimage.png?alt=media&#x26;token=aa692651-c8e5-40ec-b3cc-634e8d1de59d" alt=""><figcaption></figcaption></figure>

Choose "Add New Principal"

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FzoddrnhotdU9BZPWHW2C%2Fimage.png?alt=media&#x26;token=9f64180d-7fbc-4c8c-aa07-5a64ab53b06e" alt=""><figcaption></figcaption></figure>

Give it a declerative name (such as "DataFlintServicePrincipal")<br>

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FPBxticWp2uLvUWnv1EUs%2Fimage.png?alt=media&#x26;token=23cc95ce-e50d-4ebe-a0f0-30136da544fb" alt=""><figcaption></figcaption></figure>

Choose the newly added service principal:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FdzGRAkt9CkPONULMfgNz%2Fimage.png?alt=media&#x26;token=628bd7fa-33c2-453e-b016-b6c15a599bc0" alt=""><figcaption></figcaption></figure>

Go to secrets, and press "Generate Secret"&#x20;

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FjzIR4vpL9gQo6qCQKezm%2Fimage.png?alt=media&#x26;token=1747b8b4-d07d-41e0-8dd8-b9c235b100b0" alt=""><figcaption></figcaption></figure>

Afterward you will get this screen with a secret and client id:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2F2jwMi7fvx20k8jR9dgtf%2Fimage.png?alt=media&#x26;token=2fad7933-f32e-4ba4-8183-7dec48db9326" alt=""><figcaption></figcaption></figure>

Please supply us with:

* Display name (for example "data infra team workspace")
* Client ID
* Secret
* Workspace URL (looks something like this: "<https://dbc-1234556-abcde.cloud.databricks.com/>")

#### Add Permission

{% hint style="warning" %}
We will demonstrate with one job in the databricks UI, but you would probably want to do that process to a template or via a databricks orchestration operator.
{% endhint %}

Go to Jobs & Pipelines and choose a job:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FfG39W2cPZqjbM1ROiCSn%2Fimage.png?alt=media&#x26;token=08b4ee6b-ec51-4776-9868-2f36593b9d26" alt=""><figcaption></figcaption></figure>

Press "Edit permissions" and add to the job "can view" permission for the service principal we created<br>

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FxTAcTLQYvezOALYZgBO4%2Fimage.png?alt=media&#x26;token=6b2960a4-4d67-413f-9485-540282e5d328" alt=""><figcaption></figcaption></figure>

### Step 3 - Enable cluster logs, and give read-only access

{% hint style="warning" %}
We will demonstrate with one cluster in the databricks UI, but you would probably want to do that process to a template or via a databricks orchestration operator.
{% endhint %}

Choose the "Compute" config under the job page:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FZ2GfxehE28gO2FuDfUEe%2Fimage.png?alt=media&#x26;token=d20f6804-5c63-4a8e-abe5-e2d96fe37432" alt=""><figcaption></figcaption></figure>

Press "Configure"

Go under Advance -> logging, and enable cluster logging:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2F5pegmSqOvJ3zns9Au65v%2Fimage.png?alt=media&#x26;token=15431196-f3e6-4bc8-aa57-7de1b3c89a8c" alt=""><figcaption></figcaption></figure>

**If using DBFS** - no need for additional steps, CAN\_VIEW access automatically gives read-only access to DBFS logs

**If using S3** - you will need to add the following permission to your bucket policy where you save the logs:

```json
{
            "Sid": "Allowing DataFlint to read Databricks logs",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::{DataFlint ACCOUNT ID}:role/eks-dataflint-service-role"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{YOUR BUCKET NAME}/*",
                "arn:aws:s3:::{YOUR BUCKET NAME}"
            ]
        }
```

Contact us privately to get the name of our DataFlint ACCOUNT ID, it might be different between SaaS and BYOC installations.

#### Step 4 (Optional) - Supply mapping to your tag names to DataFlint ones

#### &#x20;You might already have tags in your databricks job/job run/cluster level for values such as team, group, version, workflow id and such.

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FEbTHZeq4qmN1p8ak7Osl%2Fimage.png?alt=media&#x26;token=c0a76004-3eae-4e52-bf3f-634b015e460f" alt=""><figcaption></figcaption></figure>

You can add the following default tags, or map your existing map to DataFlint format (via our admin console)

Supported tags are:

1. DATAFLINT\_ENV - the environment (i.e. prod, dev) default value is "default"
2. DATAFLINT\_TEAM - the team that is the owner of the job
3. DATAFLINT\_DOMAIN - owner group, business unit or product name the job belongs to
4. DATAFLINT\_DAG\_ID - a unique ID of the dag the job belongs to
5. DATAFLINT\_DAG\_RUN\_ID - a unique ID of the dag run id the job was triggered from
6. DATAFLINT\_VERSION - version of the job, in semver format

   &#x20;
