🍦Spark on K8s SaaS Installation

Summary

This guide shows how to grant DataFlint read-only access to Spark event logs for Spark on Kubernetes deployments, or any other Spark deployment that has Spark Event Logs in S3.

You will:

  1. Find your Spark event log location (spark.eventLog.dir).

  2. Add an S3 bucket policy that lets DataFlint read that location.

  3. Add the bucket + path in the DataFlint UI (one config per location).

The entire process should take a few minutes.

circle-info

This page documents the SaaS model where DataFlint reads event logs from your object store. For BYOC installations, the DataFlint AWS account / role can be different.

What DataFlint needs

Send DataFlint (or fill in the UI):

  • Region of the bucket.

  • Bucket name.

  • Path/prefix inside the bucket (from spark.eventLog.dir).

circle-exclamation

How it works

Spark writes event logs to the directory configured in spark.eventLog.dir.

DataFlint reads those logs (read-only) and builds run summaries and insights.

Common S3 layouts:

Step 1: Find your spark.eventLog.dir

You need the exact bucket + prefix that Spark writes to.

Pick one way:

  1. Spark UI β†’ Environment tab.

    • Look for spark.eventLog.dir.

  2. Your Spark submit / operator manifest.

    • Look for --conf spark.eventLog.dir=...

    • Or spec.sparkConf.spark.eventLog.dir: ... (Spark Operator).

  3. Your Spark defaults (spark-defaults.conf).

    • Look for spark.eventLog.dir ...

Translate spark.eventLog.dir to bucket + path

If your value looks like:

Then:

  • Bucket: my-spark-events

  • Path/prefix: prod/spark-events/

circle-info

If you use a History Server, you might see the same path configured as spark.history.fs.logDirectory. In most setups it matches spark.eventLog.dir.

Step 2: Allow DataFlint to read the bucket (S3 bucket policy)

DataFlint reads Spark event logs via a dedicated role in the DataFlint AWS account.

Use the same principal as in EMR SaaS Installation:

  • DataFlint AWS account ID: 975050001706

  • DataFlint service role: arn:aws:iam::975050001706:role/eks-dataflint-service-role

circle-exclamation

Add this policy (or merge its statements) into the bucket policy of the bucket used by spark.eventLog.dir.

Replace:

  • YOUR_BUCKET_NAME

  • YOUR_PREFIX (no leading /)

circle-info

In S3, β€œlist objects” is done via s3:ListBucket (there is no separate ListObject action).

chevron-rightOptional: restrict listing to only the event-log prefixhashtag

The policy above grants s3:ListBucket on the bucket. To restrict listing to the specific prefix, add this to the statement:

circle-exclamation

Apply the bucket policy

  1. Open S3 β†’ Buckets β†’ YOUR_BUCKET_NAME.

  2. Go to Permissions.

  3. Under Bucket policy, click Edit.

  4. Paste or merge the policy from above.

  5. Click Save changes.

Step 3: Add the event log locations in DataFlint

Create one configuration per spark.eventLog.dir location.

You’ll typically provide:

  • Cloud: AWS

  • Region

  • Bucket

  • Path / prefix

Example (using the screenshot)

  1. Paste the bucket name from spark.eventLog.dir.

  2. Paste the path/prefix from spark.eventLog.dir.

  3. Choose the region where the bucket lives.

  4. Add the AWS account ID where the bucket lives.

  5. Add another config if you have another bucket/prefix.

Add one config per Spark event-log bucket + path. Use the values from spark.eventLog.dir.

Misc

Optional: add an EKS role

You can also add an EKS (Kubernetes) role integration. It lets DataFlint enrich runs with more cluster metadata.

Ask DataFlint for the exact role requirements for your setup.

Optional: SQS-based ingestion

For high volume environments, you can enable SQS-based ingestion. It reduces S3 listing and can speed up discovery of new event logs.

Ask DataFlint for the S3 β†’ SQS notification setup that matches your bucket layout.

Last updated