🧱Databricks

DataFlint Integration with Databricks

DataFlint Integration with Databricks

Overview

DataFlint is a production-aware AI copilot for Apache Spark that provides cost optimization and performance monitoring for Databricks workloads. DataFlint connects to your Databricks workspace via the REST API to collect Spark event logs and cluster metadata, then analyzes them to surface optimization opportunities, detect bottlenecks, and reduce infrastructure costs.

Supported Clouds

Cloud Provider
Status

AWS (Databricks on AWS)

βœ… Supported

Azure (Azure Databricks)

βœ… Supported

GCP (Databricks on GCP)

βœ… Supported

What DataFlint Collects

DataFlint has two components that work together:

  1. DataFlint Spark Plugin β€” An open-source Spark plugin (io.dataflint.spark.SparkDataflintPlugin) installed on your Databricks clusters that enriches Spark event logs with detailed execution metrics, query plans, and optimization metadata

  2. DataFlint SaaS Platform β€” Connects to your Databricks workspace via the REST API to collect the enriched logs and metadata for analysis

Data collected

  • Enriched Spark event logs β€” Execution plans, stage metrics, task-level statistics, shuffle data, and DataFlint optimization metadata (via the Spark plugin)

  • Cluster metadata β€” Cluster configurations, node types, autoscaling settings, and runtime versions

  • Job and run metadata β€” Job definitions, run history, execution durations, and status information

DataFlint does not access or read any of your business data, tables, or files stored in Unity Catalog, Delta Lake, or cloud storage.


Prerequisites

Before setting up the DataFlint integration, ensure you have:

  • An active Databricks workspace on AWS, Azure, or GCP

  • Workspace admin or account admin permissions (required to create a service principal)

  • A DataFlint account (sign up at dataflint.ioarrow-up-right)


Setup Guide

Step 1: Install the DataFlint Spark Plugin on Databricks

The DataFlint Spark plugin enriches your Spark event logs with detailed performance and optimization data. There are two installation methods:

Option A: Init Script (Recommended)

This method automatically installs the plugin on cluster startup.

  1. In your Databricks workspace, go to Workspace β†’ Create β†’ File

  2. Paste the following init script and save it:

bash

  1. In your cluster configuration, go to Advanced β†’ Init Scripts and add the path to your init script

  2. Restart the cluster

Note: Init scripts are not supported on Databricks Community Edition. Use Option B instead.

Option B: Notebook Installation

This method works on both Databricks Community Edition and paid versions.

  1. Go to your cluster β†’ Libraries tab β†’ Install New

  2. Choose Maven and enter the coordinates:

  1. In your notebook, run the following (add %scala if using a Python notebook):

scala

After installation, a DataFlint tab will appear in the Spark UI. Use the "Open in new tab" link for the best experience.

Note: The DataFlint Spark UI is only available while the cluster is running.

Step 2: Create a Databricks Service Principal

DataFlint authenticates using OAuth machine-to-machine (M2M) with a Databricks service principal. This is the recommended and most secure authentication method.

  1. Navigate to your Databricks Account Console β†’ User management β†’ Service principals

  2. Click Add service principal and give it a descriptive name (e.g., dataflint-integration)

  3. Note the Application ID β€” this will be your client_id

Step 3: Generate an OAuth Secret

  1. In the Account Console, select the service principal you just created

  2. Go to the Secrets tab

  3. Click Generate secret

  4. Copy and securely store both the Client ID and Client Secret β€” the secret will only be shown once

Step 4: Grant Required Permissions

The DataFlint service principal needs read access to the following Databricks resources:

Workspace-level permissions:

Permission
Purpose

CAN_VIEW on clusters

Read cluster configurations and metadata

CAN_VIEW on jobs

Access job definitions and run history

Access to cluster log delivery

Read Spark event logs

To grant these permissions:

  1. Go to your Databricks workspace β†’ Admin Settings β†’ Service principals

  2. Add the service principal to the workspace

  3. Assign the necessary permissions as listed above

Note: DataFlint requires read-only access. It does not need permissions to create, modify, or delete any Databricks resources.

Step 5: Configure DataFlint

  1. Log in to your DataFlint dashboard

  2. Navigate to Settings β†’ Integrations β†’ Databricks

  3. Enter the following details:

    • Workspace URL β€” Your Databricks workspace URL (e.g., https://adb-1234567890.12.azuredatabricks.net or https://dbc-abc123.cloud.databricks.com)

    • Client ID β€” The Application ID of the service principal

    • Client Secret β€” The OAuth secret generated in Step 2

  4. Click Test Connection to verify the setup

  5. Click Save to enable the integration

Step 6: Verify the Integration

Once configured, DataFlint will begin collecting Spark event logs and metadata from your workspace. You can verify the integration is working by:

  1. Running a Spark job on your Databricks workspace

  2. Checking the DataFlint dashboard β€” the job should appear within a few minutes

  3. Reviewing the optimization insights and cost analysis generated for the job


Authentication Details

OAuth M2M Flow

DataFlint uses the OAuth 2.0 Client Credentials grant (machine-to-machine) to authenticate with Databricks. This is the recommended approach by Databricks for ISV integrations.

Token endpoint format:

Token usage:

  • Access tokens are included in the Authorization: Bearer <token> header on all API requests

  • Tokens are valid for 1 hour and are automatically refreshed by DataFlint before expiry

  • No manual token management is required after initial setup

Partner Telemetry

DataFlint includes a User-Agent header on all Databricks API calls to identify itself as an integration partner:

This allows Databricks to track partner integration usage and is a requirement of the Databricks Technology Partner Program.


Data Flow Architecture

Key points:

  • The DataFlint Spark Plugin runs inside your Databricks clusters and enriches Spark event logs with optimization metadata β€” it does not send data externally

  • The DataFlint SaaS platform connects via the REST API (OAuth M2M) to collect the enriched logs

  • All communication is outbound from DataFlint to Databricks over HTTPS

  • DataFlint uses read-only API access β€” no write operations are performed

  • No business data (tables, files, query results) is accessed or transferred

  • All credentials are encrypted at rest and in transit


API Endpoints Used

DataFlint interacts with the following Databricks REST API endpoints:

Endpoint
Method
Purpose

/api/2.0/clusters/list

GET

List workspace clusters

/api/2.0/clusters/get

GET

Get cluster configuration details

/api/2.1/jobs/list

GET

List workspace jobs

/api/2.1/jobs/runs/list

GET

List job runs

/api/2.1/jobs/runs/get

GET

Get run details and metadata

/api/2.0/dbfs/read

GET

Read Spark event log files

/oidc/v1/token

POST

OAuth token generation


Troubleshooting

Connection Test Fails

Symptom: "Unable to connect to Databricks workspace" when testing the connection.

Solutions:

  • Verify the workspace URL is correct and includes the full domain

  • Confirm the Client ID and Client Secret are entered correctly

  • Check that the service principal is added to the workspace (not just the account)

  • Ensure your network allows outbound HTTPS connections to the Databricks workspace URL

No Jobs Appearing in DataFlint

Symptom: The connection succeeds but no Spark jobs appear in DataFlint.

Solutions:

  • Ensure the service principal has CAN_VIEW permissions on the relevant clusters and jobs

  • Verify that cluster log delivery is enabled for your clusters

  • Check that jobs have actually run since the integration was configured

  • Allow up to 10 minutes for initial data collection

Permission Denied Errors

Symptom: "403 Forbidden" or "Permission denied" errors in DataFlint logs.

Solutions:

  • Review the service principal's workspace permissions

  • Ensure the service principal has access to the specific clusters and jobs you want to monitor

  • If using IP access lists, add DataFlint's IP addresses to your workspace allow list

Token Refresh Issues

Symptom: Integration works initially but stops after about an hour.

Solutions:

  • Verify the OAuth secret has not expired or been revoked

  • Check that the service principal is still active in the Account Console

  • Re-generate the OAuth secret and update it in DataFlint settings


Security & Compliance

  • Encryption: All data in transit is encrypted via TLS 1.2+. Credentials are encrypted at rest using AES-256

  • Access: DataFlint uses read-only access to your Databricks workspace. No write, delete, or modify operations are performed

  • Data scope: Only operational metadata (logs, cluster configs, job metadata) is collected. No business data, tables, or query results are accessed

  • Credential storage: OAuth secrets are stored encrypted and are never logged or exposed in plaintext

  • Data retention: Collected metadata is retained according to your DataFlint plan settings and can be deleted on request


Support

For questions or issues with the DataFlint Databricks integration:

Last updated