> For the complete documentation index, see [llms.txt](https://dataflint.gitbook.io/dataflint-for-spark/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://dataflint.gitbook.io/dataflint-for-spark/integrations/databricks.md).

# Databricks

## DataFlint Integration with Databricks

### Overview

DataFlint is a production-aware AI copilot for Apache Spark that provides cost optimization and performance monitoring for Databricks workloads. DataFlint connects to your Databricks workspace via the REST API to collect Spark event logs and cluster metadata, then analyzes them to surface optimization opportunities, detect bottlenecks, and reduce infrastructure costs.

#### Supported Clouds

| Cloud Provider           | Status      |
| ------------------------ | ----------- |
| AWS (Databricks on AWS)  | ✅ Supported |
| Azure (Azure Databricks) | ✅ Supported |
| GCP (Databricks on GCP)  | ✅ Supported |

#### What DataFlint Collects

DataFlint has two components that work together:

1. **DataFlint Spark Plugin** — An open-source Spark plugin (`io.dataflint.spark.SparkDataflintPlugin`) installed on your Databricks clusters that enriches Spark event logs with detailed execution metrics, query plans, and optimization metadata
2. **DataFlint SaaS Platform** — Connects to your Databricks workspace via the REST API to collect the enriched logs and metadata for analysis

#### Data collected

* **Enriched Spark event logs** — Execution plans, stage metrics, task-level statistics, shuffle data, and DataFlint optimization metadata (via the Spark plugin)
* **Cluster metadata** — Cluster configurations, node types, autoscaling settings, and runtime versions
* **Job and run metadata** — Job definitions, run history, execution durations, and status information

DataFlint also reads from **Unity Catalog system tables** for cost and compute analytics:

* **`system.billing.usage`** — DBU consumption by cluster, job, warehouse, and workspace
* **`system.compute.node_timeline`** — Per-node CPU, memory, and network utilization over time

DataFlint does **not** access or read any of your business data, user tables, or files stored in Delta Lake or cloud storage. The only Unity Catalog assets accessed are system tables, which contain operational and billing metadata.

<figure><img src="/files/AJllKQMtFhv0BizQtmbr" alt=""><figcaption></figcaption></figure>

***

### Prerequisites

Before setting up the DataFlint integration, ensure you have:

* An active Databricks workspace on **AWS**, **Azure**, or **GCP**
* **Workspace admin** or **account admin** permissions (required to create a service principal)
* A DataFlint account (sign up at [dataflint.io](https://www.dataflint.io))

***

### Setup Guide

#### Step 1: Install the DataFlint Spark Plugin on Databricks

The DataFlint Spark plugin enriches your Spark event logs with detailed performance and optimization data. There are two installation methods:

**Option A: Init Script (Recommended)**

This method automatically installs the plugin on cluster startup.

1. In your Databricks workspace, go to **Workspace** → **Create** → **File**
2. Paste the following init script and save it:

bash

```bash
DATAFLINT_VERSION="0.9.9"
SPARK_DEFAULTS_FILE="/databricks/driver/conf/00-custom-spark-driver-defaults.conf"

mkdir -p /databricks/jars/
wget --quiet \
  -O /databricks/jars/spark_2.12-$DATAFLINT_VERSION.jar \
  https://repo1.maven.org/maven2/io/dataflint/spark_2.12/$DATAFLINT_VERSION/spark_2.12-$DATAFLINT_VERSION.jar

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  mkdir -p /mnt/driver-daemon/jars/
  cp /databricks/jars/spark_2.12-$DATAFLINT_VERSION.jar /mnt/driver-daemon/jars/spark_2.12-$DATAFLINT_VERSION.jar

  echo "[driver] {" >> $SPARK_DEFAULTS_FILE
  echo "  spark.plugins = io.dataflint.spark.SparkDataflintPlugin" >> $SPARK_DEFAULTS_FILE
  echo "}" >> $SPARK_DEFAULTS_FILE
fi
```

3. In your cluster configuration, go to **Advanced** → **Init Scripts** and add the path to your init script
4. Restart the cluster

> **Note:** Init scripts are not supported on Databricks Community Edition. Use Option B instead.

**Option B: Notebook Installation**

This method works on both Databricks Community Edition and paid versions.

1. Go to your cluster → **Libraries** tab → **Install New**
2. Choose **Maven** and enter the coordinates:

```
io.dataflint:spark_2.12:0.9.9
```

3. In your notebook, run the following (add `%scala` if using a Python notebook):

scala

```scala
%scala
import io.dataflint.spark.SparkDataflint
SparkDataflint.install(spark.sparkContext)
```

After installation, a **DataFlint** tab will appear in the Spark UI. Use the "Open in new tab" link for the best experience.

> **Note:** The DataFlint Spark UI is only available while the cluster is running.

#### Step 2: Create a Databricks Service Principal

DataFlint authenticates using OAuth machine-to-machine (M2M) with a Databricks service principal. This is the recommended and most secure authentication method.

1. Navigate to your Databricks **Account Console** → **User management** → **Service principals**
2. Click **Add service principal** and give it a descriptive name (e.g., `dataflint-integration`)
3. Note the **Application ID** — this will be your `client_id`

#### Step 3: Generate an OAuth Secret

1. In the Account Console, select the service principal you just created
2. Go to the **Secrets** tab
3. Click **Generate secret**
4. Copy and securely store both the **Client ID** and **Client Secret** — the secret will only be shown once

#### Step 4: Grant Required Permissions

The DataFlint service principal needs read access to the following Databricks resources:

**Workspace-level permissions:**

| Permission                     | Purpose                                  |
| ------------------------------ | ---------------------------------------- |
| `CAN_VIEW` on clusters         | Read cluster configurations and metadata |
| `CAN_VIEW` on jobs             | Access job definitions and run history   |
| Access to cluster log delivery | Read Spark event logs                    |
| `SELECT` on `system` catalog   | Read billing and compute system tables   |

**Unity Catalog permissions:**

Grant the service principal read access to system tables:

sql

```sql
GRANT SELECT ON CATALOG system TO `dataflint-integration`;
```

To grant these permissions:

1. Go to your Databricks workspace → **Admin Settings** → **Service principals**
2. Add the service principal to the workspace
3. Assign the necessary permissions as listed above

**Note:** DataFlint requires **read-only** access. It does not need permissions to create, modify, or delete any Databricks resources.

#### Step 5: Configure DataFlint

1. Log in to your DataFlint dashboard
2. Navigate to **Admin Panel** → **Environment Management** → **Add New**
3. Enter the following details:
   * **Workspace URL** — Your Databricks workspace URL (e.g., `https://adb-1234567890.12.azuredatabricks.net` or `https://dbc-abc123.cloud.databricks.com`)
   * **Client ID** — The Application ID of the service principal
   * **Client Secret** — The OAuth secret generated in Step 2
4. Click **Test Connection** to verify the setup
5. Click **Save** to enable the integration

#### Step 6: Verify the Integration

Once configured, DataFlint will begin collecting Spark event logs and metadata from your workspace. You can verify the integration is working by:

1. Running a Spark job on your Databricks workspace
2. Checking the DataFlint dashboard — the job should appear within a few minutes
3. Reviewing the optimization insights and cost analysis generated for the job

***

### Authentication Details

#### OAuth M2M Flow

DataFlint uses the OAuth 2.0 Client Credentials grant (machine-to-machine) to authenticate with Databricks. This is the recommended approach by Databricks for ISV integrations.

**Token endpoint format:**

```
POST https://<workspace-url>/oidc/v1/token
Content-Type: application/x-www-form-urlencoded

grant_type=client_credentials
&client_id=<service-principal-application-id>
&client_secret=<oauth-secret>
&scope=all-apis
```

**Token usage:**

* Access tokens are included in the `Authorization: Bearer <token>` header on all API requests
* Tokens are valid for **1 hour** and are automatically refreshed by DataFlint before expiry
* No manual token management is required after initial setup

#### Partner Telemetry

DataFlint includes a `User-Agent` header on all Databricks API calls to identify itself as an integration partner:

```
User-Agent: dataflint/<version> (Databricks)
```

This allows Databricks to track partner integration usage and is a requirement of the Databricks Technology Partner Program.

***

### Data Flow Architecture

```
┌──────────────────────┐     OAuth M2M      ┌──────────────────────┐
│                      │ ──────────────────► │                      │
│    DataFlint SaaS    │   REST API calls    │  Databricks          │
│                      │ ◄────────────────── │  Workspace           │
│  • Job analysis      │  Enriched logs &    │                      │
│  • Cost insights     │  cluster metadata   │  ┌────────────────┐  │
│  • Optimization      │                     │  │ DataFlint Spark │  │
│    recommendations   │                     │  │ Plugin (OSS)    │  │
│                      │                     │  │ Enriches event  │  │
│                      │                     │  │ logs in-cluster │  │
│                      │                     │  └────────────────┘  │
└──────────────────────┘                     └──────────────────────┘
```

**Key points:**

* The **DataFlint Spark Plugin** runs inside your Databricks clusters and enriches Spark event logs with optimization metadata — it does not send data externally
* The **DataFlint SaaS platform** connects via the REST API (OAuth M2M) to collect the enriched logs
* All communication is outbound from DataFlint to Databricks over HTTPS
* DataFlint uses read-only API access — no write operations are performed
* No business data (tables, files, query results) is accessed or transferred
* All credentials are encrypted at rest and in transit

***

### API Endpoints Used

DataFlint interacts with the following Databricks REST API endpoints:

| Endpoint                  | Method | Purpose                                              |
| ------------------------- | ------ | ---------------------------------------------------- |
| `/api/2.0/clusters/list`  | GET    | List workspace clusters                              |
| `/api/2.0/clusters/get`   | GET    | Get cluster configuration details                    |
| `/api/2.1/jobs/list`      | GET    | List workspace jobs                                  |
| `/api/2.1/jobs/runs/list` | GET    | List job runs                                        |
| `/api/2.1/jobs/runs/get`  | GET    | Get run details and metadata                         |
| `/api/2.0/dbfs/read`      | GET    | Read Spark event log files                           |
| `/api/2.0/sql/statements` | POST   | Query Unity Catalog system tables (billing, compute) |
| `/oidc/v1/token`          | POST   | OAuth token generation                               |

***

### Troubleshooting

#### Connection Test Fails

**Symptom:** "Unable to connect to Databricks workspace" when testing the connection.

**Solutions:**

* Verify the workspace URL is correct and includes the full domain
* Confirm the Client ID and Client Secret are entered correctly
* Check that the service principal is added to the workspace (not just the account)
* Ensure your network allows outbound HTTPS connections to the Databricks workspace URL

#### No Jobs Appearing in DataFlint

**Symptom:** The connection succeeds but no Spark jobs appear in DataFlint.

**Solutions:**

* Ensure the service principal has `CAN_VIEW` permissions on the relevant clusters and jobs
* Verify that cluster log delivery is enabled for your clusters
* Check that jobs have actually run since the integration was configured
* Allow up to 10 minutes for initial data collection

#### Permission Denied Errors

**Symptom:** "403 Forbidden" or "Permission denied" errors in DataFlint logs.

**Solutions:**

* Review the service principal's workspace permissions
* Ensure the service principal has access to the specific clusters and jobs you want to monitor
* If using IP access lists, add DataFlint's IP addresses to your workspace allow list

#### Token Refresh Issues

**Symptom:** Integration works initially but stops after about an hour.

**Solutions:**

* Verify the OAuth secret has not expired or been revoked
* Check that the service principal is still active in the Account Console
* Re-generate the OAuth secret and update it in DataFlint settings

***

### Security & Compliance

* **Encryption:** All data in transit is encrypted via TLS 1.2+. Credentials are encrypted at rest using AES-256
* **Access:** DataFlint uses read-only access to your Databricks workspace. No write, delete, or modify operations are performed
* **Data scope:** Only operational metadata (logs, cluster configs, job metadata) is collected. No business data, tables, or query results are accessed
* **Credential storage:** OAuth secrets are stored encrypted and are never logged or exposed in plaintext
* **Data retention:** Collected metadata is retained according to your DataFlint plan settings and can be deleted on request

***

### Support

For questions or issues with the DataFlint Databricks integration:

* **Email:** <support@dataflint.io>
* **Documentation:** [docs.dataflint.io](https://docs.dataflint.io)
* **Databricks Partner Operations:** <partnerops@databricks.com>

###


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://dataflint.gitbook.io/dataflint-for-spark/integrations/databricks.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.