# Databricks SaaS installation

## Summary

This guide details how to create the relevant service account for DataFlint to be able to ingest Databricks performance metadata with only read-only access

The entire process should take just a few minutes.

The steps are:

1. Install DataFlint OSS as init script
2. Create the DataFlint new service principal, with CAN\_VIEW access
3. Enable cluster logs, and give read-only access to DataFlint
4. (Optional) supply mapping to your tag names to DataFlint ones

## Installation

### Step 1: Install DataFlint OSS as init script

See [Install on DataBricks](/dataflint-for-spark/getting-started/install-on-databricks.md)

### Step 2: Create the DataFlint new service principal, with CAN\_VIEW access

{% hint style="warning" %}
This process requires a workpsace admin permission
{% endhint %}

#### Create new service principal

Go to the workspace settings right top corner:

<figure><img src="/files/DIEzIZCGTzsTOgCYC0vj" alt=""><figcaption></figcaption></figure>

Then choose Identity -> service principal:

<figure><img src="/files/w1blaPgRc7sio48LlPG2" alt=""><figcaption></figcaption></figure>

Choose "Add New Principal"

<figure><img src="/files/AcjVP5GlHOVIqVanWIGo" alt=""><figcaption></figcaption></figure>

Give it a declerative name (such as "DataFlintServicePrincipal")<br>

<figure><img src="/files/JqoP9GB7eDO0Dp9Kdtwm" alt=""><figcaption></figcaption></figure>

Choose the newly added service principal:

<figure><img src="/files/YHARCUQWBKhhTBsql89n" alt=""><figcaption></figcaption></figure>

Go to secrets, and press "Generate Secret"&#x20;

<figure><img src="/files/l12Sp5jRUI91iWra4Ba4" alt=""><figcaption></figcaption></figure>

Afterward you will get this screen with a secret and client id:

<figure><img src="/files/uGRID0hnaCMNtGUTrNBh" alt=""><figcaption></figcaption></figure>

Please supply us with:

* Display name (for example "data infra team workspace")
* Client ID
* Secret
* Workspace URL (looks something like this: "<https://dbc-1234556-abcde.cloud.databricks.com/>")

#### Add Permission

{% hint style="warning" %}
We will demonstrate with one job in the databricks UI, but you would probably want to do that process to a template or via a databricks orchestration operator.
{% endhint %}

Go to Jobs & Pipelines and choose a job:

<figure><img src="/files/BhNGAB2hzDic8rLqwZWr" alt=""><figcaption></figcaption></figure>

Press "Edit permissions" and add to the job "can view" permission for the service principal we created<br>

<figure><img src="/files/1GiDbRsW57ibYSwg7KLc" alt=""><figcaption></figcaption></figure>

### Step 3 - Enable cluster logs, and give read-only access

{% hint style="warning" %}
We will demonstrate with one cluster in the databricks UI, but you would probably want to do that process to a template or via a databricks orchestration operator.
{% endhint %}

Choose the "Compute" config under the job page:

<figure><img src="/files/3RonIKHeh0zTAyUOmkAC" alt=""><figcaption></figcaption></figure>

Press "Configure"

Go under Advance -> logging, and enable cluster logging:

<figure><img src="/files/BJIsGSwHZi3pnSl10D8T" alt=""><figcaption></figcaption></figure>

**If using DBFS** - no need for additional steps, CAN\_VIEW access automatically gives read-only access to DBFS logs

**If using S3** - you will need to add the following permission to your bucket policy where you save the logs:

```json
{
            "Sid": "Allowing DataFlint to read Databricks logs",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::{DataFlint ACCOUNT ID}:role/eks-dataflint-service-role"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{YOUR BUCKET NAME}/*",
                "arn:aws:s3:::{YOUR BUCKET NAME}"
            ]
        }
```

Contact us privately to get the name of our DataFlint ACCOUNT ID, it might be different between SaaS and BYOC installations.

#### Step 4 (Optional) - Supply mapping to your tag names to DataFlint ones

#### &#x20;You might already have tags in your databricks job/job run/cluster level for values such as team, group, version, workflow id and such.

<figure><img src="/files/BodCsSmxkJVgN59EpnCt" alt=""><figcaption></figcaption></figure>

You can add the following default tags, or map your existing map to DataFlint format (via our admin console)

Supported tags are:

1. DATAFLINT\_ENV - the environment (i.e. prod, dev) default value is "default"
2. DATAFLINT\_TEAM - the team that is the owner of the job
3. DATAFLINT\_DOMAIN - owner group, business unit or product name the job belongs to
4. DATAFLINT\_DAG\_ID - a unique ID of the dag the job belongs to
5. DATAFLINT\_DAG\_RUN\_ID - a unique ID of the dag run id the job was triggered from
6. DATAFLINT\_VERSION - version of the job, in semver format

   &#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dataflint.gitbook.io/dataflint-for-spark/saas/databricks-saas-installation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
