# Spark on K8s SaaS Installation

## Summary

This guide shows how to grant DataFlint **read-only** access to Spark event logs for **Spark on Kubernetes** deployments, or any other Spark deployment that has Spark Event Logs in S3.

You will:

1. Find your Spark event log location (`spark.eventLog.dir`).
2. Add an S3 bucket policy that lets DataFlint read that location.
3. Add the bucket + path in the DataFlint UI (one config per location).

The entire process should take a few minutes.

{% hint style="info" %}
This page documents the **SaaS** model where DataFlint reads event logs from your object store. For **BYOC** installations, the DataFlint AWS account / role can be different.
{% endhint %}

### What DataFlint needs

Send DataFlint (or fill in the UI):

* **Region** of the bucket.
* **Bucket name**.
* **Path/prefix** inside the bucket (from `spark.eventLog.dir`).

{% hint style="warning" %}
Each **bucket + path** is a separate configuration. If you have multiple clusters or environments with different `spark.eventLog.dir`, add each one.
{% endhint %}

### How it works

Spark writes event logs to the directory configured in `spark.eventLog.dir`.

DataFlint reads those logs (read-only) and builds run summaries and insights.

Common S3 layouts:

```
s3a://my-spark-events/prod/
  application_*

s3a://my-spark-events/prod/spark-job-history/
  application_*
```

### Step 1: Find your `spark.eventLog.dir`

You need the **exact** bucket + prefix that Spark writes to.

Pick one way:

1. **Spark UI → Environment** tab.
   * Look for `spark.eventLog.dir`.
2. **Your Spark submit / operator manifest**.
   * Look for `--conf spark.eventLog.dir=...`
   * Or `spec.sparkConf.spark.eventLog.dir: ...` (Spark Operator).
3. **Your Spark defaults** (`spark-defaults.conf`).
   * Look for `spark.eventLog.dir ...`

#### Translate `spark.eventLog.dir` to bucket + path

If your value looks like:

```
s3a://my-spark-events/prod/spark-events/
```

Then:

* Bucket: `my-spark-events`
* Path/prefix: `prod/spark-events/`

{% hint style="info" %}
If you use a History Server, you might see the same path configured as `spark.history.fs.logDirectory`. In most setups it matches `spark.eventLog.dir`.
{% endhint %}

### Step 2: Allow DataFlint to read the bucket (S3 bucket policy)

DataFlint reads Spark event logs via a dedicated role in the DataFlint AWS account.

Use the same principal as in [EMR SaaS Installation](/dataflint-for-spark/saas/emr-saas-installation.md):

* DataFlint AWS account ID: `975050001706`
* DataFlint service role: `arn:aws:iam::975050001706:role/eks-dataflint-service-role`

{% hint style="warning" %}
For BYOC installations, this principal can be different. If you’re not sure, ask DataFlint for the correct AWS account ID / role ARN.
{% endhint %}

#### Minimal bucket policy statement (recommended)

Add this policy (or merge its statements) into the bucket policy of the bucket used by `spark.eventLog.dir`.

Replace:

* `YOUR_BUCKET_NAME`
* `YOUR_PREFIX` (no leading `/`)

```json
{
  "Version": "2012-10-17",
  "Id": "DataFlintSparkEventLogsReadOnly",
  "Statement": [
    {
      "Sid": "AllowDataFlintReadSparkEventLogs",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::975050001706:role/eks-dataflint-service-role"
      },
      "Action": [
        "s3:GetObject",
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET_NAME",
        "arn:aws:s3:::YOUR_BUCKET_NAME/YOUR_PREFIX*"
      ]
    }
  ]
}
```

{% hint style="info" %}
In S3, “list objects” is done via `s3:ListBucket` (there is no separate `ListObject` action).
{% endhint %}

<details>

<summary>Optional: restrict listing to only the event-log prefix</summary>

The policy above grants `s3:ListBucket` on the bucket. To restrict listing to the specific prefix, add this to the statement:

```json
"Condition": {
  "StringLike": {
    "s3:prefix": [
      "YOUR_PREFIX",
      "YOUR_PREFIX*"
    ]
  }
}
```

</details>

{% hint style="warning" %}
If the bucket uses SSE-KMS, you also need to allow `kms:Decrypt` for the KMS key.
{% endhint %}

### Apply the bucket policy

{% tabs %}
{% tab title="AWS Console (UI)" %}

1. Open **S3 → Buckets → YOUR\_BUCKET\_NAME**.
2. Go to **Permissions**.
3. Under **Bucket policy**, click **Edit**.
4. Paste or merge the policy from above.
5. Click **Save changes**.
   {% endtab %}

{% tab title="AWS CLI" %}
Fetch the current bucket policy:

```bash
aws s3api get-bucket-policy --bucket YOUR_BUCKET_NAME --query Policy --output text
```

Update the bucket policy from a local file:

```bash
aws s3api put-bucket-policy \
  --bucket YOUR_BUCKET_NAME \
  --policy file://bucket-policy.json
```

{% endtab %}
{% endtabs %}

## Step 3: Add the event log locations in DataFlint

Create one configuration per `spark.eventLog.dir` location.

You’ll typically provide:

* **Cloud**: AWS
* **Region**
* **Bucket**
* **Path / prefix**

### Example (using the screenshot)

1. Paste the **bucket name** from `spark.eventLog.dir`.
2. Paste the **path/prefix** from `spark.eventLog.dir`.
3. Choose the **region** where the bucket lives.
4. Add the AWS account ID where the bucket lives.
5. Add another config if you have another bucket/prefix.

<figure><img src="/files/yC8MCcY8shYTl9p41us1" alt="" width="375"><figcaption><p>Add one config per Spark event-log bucket + path. Use the values from <code>spark.eventLog.dir</code>.</p></figcaption></figure>

## Misc

### Optional: add an EKS role

You can also add an **EKS (Kubernetes) role** integration. It lets DataFlint enrich runs with more cluster metadata.

Ask DataFlint for the exact role requirements for your setup.

### Optional: SQS-based ingestion

For high volume environments, you can enable **SQS-based ingestion**. It reduces S3 listing and can speed up discovery of new event logs.

Ask DataFlint for the S3 → SQS notification setup that matches your bucket layout.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dataflint.gitbook.io/dataflint-for-spark/saas/spark-on-k8s-saas-installation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
