π§±Databricks
DataFlint Integration with Databricks
DataFlint Integration with Databricks
Overview
DataFlint is a production-aware AI copilot for Apache Spark that provides cost optimization and performance monitoring for Databricks workloads. DataFlint connects to your Databricks workspace via the REST API to collect Spark event logs and cluster metadata, then analyzes them to surface optimization opportunities, detect bottlenecks, and reduce infrastructure costs.
Supported Clouds
AWS (Databricks on AWS)
β Supported
Azure (Azure Databricks)
β Supported
GCP (Databricks on GCP)
β Supported
What DataFlint Collects
DataFlint has two components that work together:
DataFlint Spark Plugin β An open-source Spark plugin (
io.dataflint.spark.SparkDataflintPlugin) installed on your Databricks clusters that enriches Spark event logs with detailed execution metrics, query plans, and optimization metadataDataFlint SaaS Platform β Connects to your Databricks workspace via the REST API to collect the enriched logs and metadata for analysis
Data collected
Enriched Spark event logs β Execution plans, stage metrics, task-level statistics, shuffle data, and DataFlint optimization metadata (via the Spark plugin)
Cluster metadata β Cluster configurations, node types, autoscaling settings, and runtime versions
Job and run metadata β Job definitions, run history, execution durations, and status information
DataFlint does not access or read any of your business data, tables, or files stored in Unity Catalog, Delta Lake, or cloud storage.
Prerequisites
Before setting up the DataFlint integration, ensure you have:
An active Databricks workspace on AWS, Azure, or GCP
Workspace admin or account admin permissions (required to create a service principal)
A DataFlint account (sign up at dataflint.io)
Setup Guide
Step 1: Install the DataFlint Spark Plugin on Databricks
The DataFlint Spark plugin enriches your Spark event logs with detailed performance and optimization data. There are two installation methods:
Option A: Init Script (Recommended)
This method automatically installs the plugin on cluster startup.
In your Databricks workspace, go to Workspace β Create β File
Paste the following init script and save it:
bash
In your cluster configuration, go to Advanced β Init Scripts and add the path to your init script
Restart the cluster
Note: Init scripts are not supported on Databricks Community Edition. Use Option B instead.
Option B: Notebook Installation
This method works on both Databricks Community Edition and paid versions.
Go to your cluster β Libraries tab β Install New
Choose Maven and enter the coordinates:
In your notebook, run the following (add
%scalaif using a Python notebook):
scala
After installation, a DataFlint tab will appear in the Spark UI. Use the "Open in new tab" link for the best experience.
Note: The DataFlint Spark UI is only available while the cluster is running.
Step 2: Create a Databricks Service Principal
DataFlint authenticates using OAuth machine-to-machine (M2M) with a Databricks service principal. This is the recommended and most secure authentication method.
Navigate to your Databricks Account Console β User management β Service principals
Click Add service principal and give it a descriptive name (e.g.,
dataflint-integration)Note the Application ID β this will be your
client_id
Step 3: Generate an OAuth Secret
In the Account Console, select the service principal you just created
Go to the Secrets tab
Click Generate secret
Copy and securely store both the Client ID and Client Secret β the secret will only be shown once
Step 4: Grant Required Permissions
The DataFlint service principal needs read access to the following Databricks resources:
Workspace-level permissions:
CAN_VIEW on clusters
Read cluster configurations and metadata
CAN_VIEW on jobs
Access job definitions and run history
Access to cluster log delivery
Read Spark event logs
To grant these permissions:
Go to your Databricks workspace β Admin Settings β Service principals
Add the service principal to the workspace
Assign the necessary permissions as listed above
Note: DataFlint requires read-only access. It does not need permissions to create, modify, or delete any Databricks resources.
Step 5: Configure DataFlint
Log in to your DataFlint dashboard
Navigate to Settings β Integrations β Databricks
Enter the following details:
Workspace URL β Your Databricks workspace URL (e.g.,
https://adb-1234567890.12.azuredatabricks.netorhttps://dbc-abc123.cloud.databricks.com)Client ID β The Application ID of the service principal
Client Secret β The OAuth secret generated in Step 2
Click Test Connection to verify the setup
Click Save to enable the integration
Step 6: Verify the Integration
Once configured, DataFlint will begin collecting Spark event logs and metadata from your workspace. You can verify the integration is working by:
Running a Spark job on your Databricks workspace
Checking the DataFlint dashboard β the job should appear within a few minutes
Reviewing the optimization insights and cost analysis generated for the job
Authentication Details
OAuth M2M Flow
DataFlint uses the OAuth 2.0 Client Credentials grant (machine-to-machine) to authenticate with Databricks. This is the recommended approach by Databricks for ISV integrations.
Token endpoint format:
Token usage:
Access tokens are included in the
Authorization: Bearer <token>header on all API requestsTokens are valid for 1 hour and are automatically refreshed by DataFlint before expiry
No manual token management is required after initial setup
Partner Telemetry
DataFlint includes a User-Agent header on all Databricks API calls to identify itself as an integration partner:
This allows Databricks to track partner integration usage and is a requirement of the Databricks Technology Partner Program.
Data Flow Architecture
Key points:
The DataFlint Spark Plugin runs inside your Databricks clusters and enriches Spark event logs with optimization metadata β it does not send data externally
The DataFlint SaaS platform connects via the REST API (OAuth M2M) to collect the enriched logs
All communication is outbound from DataFlint to Databricks over HTTPS
DataFlint uses read-only API access β no write operations are performed
No business data (tables, files, query results) is accessed or transferred
All credentials are encrypted at rest and in transit
API Endpoints Used
DataFlint interacts with the following Databricks REST API endpoints:
/api/2.0/clusters/list
GET
List workspace clusters
/api/2.0/clusters/get
GET
Get cluster configuration details
/api/2.1/jobs/list
GET
List workspace jobs
/api/2.1/jobs/runs/list
GET
List job runs
/api/2.1/jobs/runs/get
GET
Get run details and metadata
/api/2.0/dbfs/read
GET
Read Spark event log files
/oidc/v1/token
POST
OAuth token generation
Troubleshooting
Connection Test Fails
Symptom: "Unable to connect to Databricks workspace" when testing the connection.
Solutions:
Verify the workspace URL is correct and includes the full domain
Confirm the Client ID and Client Secret are entered correctly
Check that the service principal is added to the workspace (not just the account)
Ensure your network allows outbound HTTPS connections to the Databricks workspace URL
No Jobs Appearing in DataFlint
Symptom: The connection succeeds but no Spark jobs appear in DataFlint.
Solutions:
Ensure the service principal has
CAN_VIEWpermissions on the relevant clusters and jobsVerify that cluster log delivery is enabled for your clusters
Check that jobs have actually run since the integration was configured
Allow up to 10 minutes for initial data collection
Permission Denied Errors
Symptom: "403 Forbidden" or "Permission denied" errors in DataFlint logs.
Solutions:
Review the service principal's workspace permissions
Ensure the service principal has access to the specific clusters and jobs you want to monitor
If using IP access lists, add DataFlint's IP addresses to your workspace allow list
Token Refresh Issues
Symptom: Integration works initially but stops after about an hour.
Solutions:
Verify the OAuth secret has not expired or been revoked
Check that the service principal is still active in the Account Console
Re-generate the OAuth secret and update it in DataFlint settings
Security & Compliance
Encryption: All data in transit is encrypted via TLS 1.2+. Credentials are encrypted at rest using AES-256
Access: DataFlint uses read-only access to your Databricks workspace. No write, delete, or modify operations are performed
Data scope: Only operational metadata (logs, cluster configs, job metadata) is collected. No business data, tables, or query results are accessed
Credential storage: OAuth secrets are stored encrypted and are never logged or exposed in plaintext
Data retention: Collected metadata is retained according to your DataFlint plan settings and can be deleted on request
Support
For questions or issues with the DataFlint Databricks integration:
Email: support@dataflint.io
Documentation: docs.dataflint.io
Databricks Partner Operations: partnerops@databricks.com
Last updated