# Spark on K8s SaaS Installation

## Summary

This guide shows how to grant DataFlint **read-only** access to Spark event logs for **Spark on Kubernetes** deployments, or any other Spark deployment that has Spark Event Logs in S3.

You will:

1. Find your Spark event log location (`spark.eventLog.dir`).
2. Add an S3 bucket policy that lets DataFlint read that location.
3. Add the bucket + path in the DataFlint UI (one config per location).

The entire process should take a few minutes.

{% hint style="info" %}
This page documents the **SaaS** model where DataFlint reads event logs from your object store. For **BYOC** installations, the DataFlint AWS account / role can be different.
{% endhint %}

### What DataFlint needs

Send DataFlint (or fill in the UI):

* **Region** of the bucket.
* **Bucket name**.
* **Path/prefix** inside the bucket (from `spark.eventLog.dir`).

{% hint style="warning" %}
Each **bucket + path** is a separate configuration. If you have multiple clusters or environments with different `spark.eventLog.dir`, add each one.
{% endhint %}

### How it works

Spark writes event logs to the directory configured in `spark.eventLog.dir`.

DataFlint reads those logs (read-only) and builds run summaries and insights.

Common S3 layouts:

```
s3a://my-spark-events/prod/
  application_*

s3a://my-spark-events/prod/spark-job-history/
  application_*
```

### Step 1: Find your `spark.eventLog.dir`

You need the **exact** bucket + prefix that Spark writes to.

Pick one way:

1. **Spark UI → Environment** tab.
   * Look for `spark.eventLog.dir`.
2. **Your Spark submit / operator manifest**.
   * Look for `--conf spark.eventLog.dir=...`
   * Or `spec.sparkConf.spark.eventLog.dir: ...` (Spark Operator).
3. **Your Spark defaults** (`spark-defaults.conf`).
   * Look for `spark.eventLog.dir ...`

#### Translate `spark.eventLog.dir` to bucket + path

If your value looks like:

```
s3a://my-spark-events/prod/spark-events/
```

Then:

* Bucket: `my-spark-events`
* Path/prefix: `prod/spark-events/`

{% hint style="info" %}
If you use a History Server, you might see the same path configured as `spark.history.fs.logDirectory`. In most setups it matches `spark.eventLog.dir`.
{% endhint %}

### Step 2: Allow DataFlint to read the bucket (S3 bucket policy)

DataFlint reads Spark event logs via a dedicated role in the DataFlint AWS account.

Use the same principal as in [EMR SaaS Installation](https://dataflint.gitbook.io/dataflint-for-spark/saas/emr-saas-installation):

* DataFlint AWS account ID: `975050001706`
* DataFlint service role: `arn:aws:iam::975050001706:role/eks-dataflint-service-role`

{% hint style="warning" %}
For BYOC installations, this principal can be different. If you’re not sure, ask DataFlint for the correct AWS account ID / role ARN.
{% endhint %}

#### Minimal bucket policy statement (recommended)

Add this policy (or merge its statements) into the bucket policy of the bucket used by `spark.eventLog.dir`.

Replace:

* `YOUR_BUCKET_NAME`
* `YOUR_PREFIX` (no leading `/`)

```json
{
  "Version": "2012-10-17",
  "Id": "DataFlintSparkEventLogsReadOnly",
  "Statement": [
    {
      "Sid": "AllowDataFlintReadSparkEventLogs",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::975050001706:role/eks-dataflint-service-role"
      },
      "Action": [
        "s3:GetObject",
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET_NAME",
        "arn:aws:s3:::YOUR_BUCKET_NAME/YOUR_PREFIX*"
      ]
    }
  ]
}
```

{% hint style="info" %}
In S3, “list objects” is done via `s3:ListBucket` (there is no separate `ListObject` action).
{% endhint %}

<details>

<summary>Optional: restrict listing to only the event-log prefix</summary>

The policy above grants `s3:ListBucket` on the bucket. To restrict listing to the specific prefix, add this to the statement:

```json
"Condition": {
  "StringLike": {
    "s3:prefix": [
      "YOUR_PREFIX",
      "YOUR_PREFIX*"
    ]
  }
}
```

</details>

{% hint style="warning" %}
If the bucket uses SSE-KMS, you also need to allow `kms:Decrypt` for the KMS key.
{% endhint %}

### Apply the bucket policy

{% tabs %}
{% tab title="AWS Console (UI)" %}

1. Open **S3 → Buckets → YOUR\_BUCKET\_NAME**.
2. Go to **Permissions**.
3. Under **Bucket policy**, click **Edit**.
4. Paste or merge the policy from above.
5. Click **Save changes**.
   {% endtab %}

{% tab title="AWS CLI" %}
Fetch the current bucket policy:

```bash
aws s3api get-bucket-policy --bucket YOUR_BUCKET_NAME --query Policy --output text
```

Update the bucket policy from a local file:

```bash
aws s3api put-bucket-policy \
  --bucket YOUR_BUCKET_NAME \
  --policy file://bucket-policy.json
```

{% endtab %}
{% endtabs %}

## Step 3: Add the event log locations in DataFlint

Create one configuration per `spark.eventLog.dir` location.

You’ll typically provide:

* **Cloud**: AWS
* **Region**
* **Bucket**
* **Path / prefix**

### Example (using the screenshot)

1. Paste the **bucket name** from `spark.eventLog.dir`.
2. Paste the **path/prefix** from `spark.eventLog.dir`.
3. Choose the **region** where the bucket lives.
4. Add the AWS account ID where the bucket lives.
5. Add another config if you have another bucket/prefix.

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FJrB9xryS1jKhZnEXOqVQ%2Fimage.png?alt=media&#x26;token=8b8872f8-f9f8-4713-bfe3-7d360d6eb4bc" alt="" width="375"><figcaption><p>Add one config per Spark event-log bucket + path. Use the values from <code>spark.eventLog.dir</code>.</p></figcaption></figure>

## Misc

### Optional: add an EKS role

You can also add an **EKS (Kubernetes) role** integration. It lets DataFlint enrich runs with more cluster metadata.

Ask DataFlint for the exact role requirements for your setup.

### Optional: SQS-based ingestion

For high volume environments, you can enable **SQS-based ingestion**. It reduces S3 listing and can speed up discovery of new event logs.

Ask DataFlint for the S3 → SQS notification setup that matches your bucket layout.
