# DataProc SaaS Installation

## Summary

This guide shows how to grant DataFlint **read-only** access to Spark event logs and DataProc metadata from **Google Cloud DataProc**.

For the broader SaaS threat model and stability notes, see [SaaS Security & Stability](https://dataflint.gitbook.io/dataflint-for-spark/saas/saas-security-and-stability).

You will:

1. Create a dedicated GCP service account.
2. Grant minimal IAM roles for Dataproc + Cloud Storage.
3. Generate a JSON key (or use your preferred secret flow).
4. Share the credentials with DataFlint

The entire process should take a few minutes.

{% hint style="warning" %}
Service account keys are sensitive credentials. Store them like passwords and share them only over an approved secure channel.
{% endhint %}

### What DataFlint needs

Send DataFlint:

* **Project ID**
* **Region(s)** you run Dataproc in
* **Service account key (JSON)** *or* an agreed alternative credential method
* **Dataproc temp bucket name(s)** if you use custom buckets

### How it works

Dataproc writes Spark event logs into a GCS bucket. By default this is the Dataproc **temp bucket**.

Common layout:

```
gs://dataproc-temp-<region>-<project-number>-<suffix>/
  <cluster-uuid>/
    spark-job-history/
      application_*
```

DataFlint reads only these logs and cluster metadata.

### Required IAM permissions (minimal)

Grant the DataFlint service account:

* Project-level: `roles/dataproc.viewer`
* Bucket-level (on the event-log buckets): `roles/storage.objectViewer`

{% hint style="info" %}
If you configured Spark event logs to a custom bucket via `spark:spark.eventLog.dir` or History Server settings, grant `storage.objectViewer` on that bucket too.
{% endhint %}

## Installation

Pick one method. All methods create the same resources.

{% tabs %}
{% tab title="Google Cloud Console (UI)" %}

### Step 1: Create a service account

1. Open **IAM & Admin → Service Accounts**.
2. Click **Create service account**.
3. Use:
   * Name: `dataflint-events-reader`
   * Description: `Read-only access to Dataproc Spark event logs`
4. Click **Done**.

### Step 2: Create a service account key (JSON)

1. Open the service account.
2. Go to **Keys**.
3. Click **Add key → Create new key**.
4. Select **JSON**.
5. Download and store the file securely.

### Step 3: Grant Dataproc viewer permissions

1. Open **IAM & Admin → IAM**.
2. Click **Grant access**.
3. Principal: `dataflint-events-reader@<PROJECT_ID>.iam.gserviceaccount.com`
4. Role: **Dataproc Viewer** (`roles/dataproc.viewer`)

### Step 4: Grant bucket read access (event logs)

1. Open **Cloud Storage → Buckets**.
2. Find the Dataproc temp bucket.
   * It usually looks like `dataproc-temp-<region>-<project-number>-...`
3. Open **Permissions**.
4. Click **Grant access**.
5. Add the same principal as above.
6. Role: **Storage Object Viewer** (`roles/storage.objectViewer`)
   {% endtab %}

{% tab title="gcloud / gsutil (CLI)" %}

### Prerequisites

```bash
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
```

### Create service account

```bash
gcloud iam service-accounts create dataflint-events-reader \
  --display-name="DataFlint Spark Events Reader" \
  --description="Read-only access to Dataproc Spark event logs"
```

### Create key

```bash
gcloud iam service-accounts keys create dataflint-sa-key.json \
  --iam-account=dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com
```

### Grant Dataproc Viewer (project-level)

```bash
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/dataproc.viewer" \
  --condition=None
```

### Grant Storage Object Viewer (bucket-level)

```bash
# List candidate temp buckets
gsutil ls | grep dataproc-temp || true

# Grant read-only access (repeat for every relevant bucket)
gsutil iam ch \
  serviceAccount:dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com:objectViewer \
  gs://dataproc-temp-REGION-PROJECT_NUMBER-SUFFIX
```

### Optional: one-shot setup script

{% code title="setup-dataflint-dataproc-reader.sh" %}

```bash
#!/bin/bash
set -euo pipefail

# EDIT THESE
PROJECT_ID="your-project-id"
REGION="us-central1"
SA_NAME="dataflint-events-reader"
KEY_OUTPUT_PATH="./dataflint-sa-key.json"

SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

echo "Creating service account (if missing)..."
gcloud iam service-accounts create "${SA_NAME}" \
  --display-name="DataFlint Spark Events Reader" \
  --description="Read-only access to Dataproc Spark event logs" \
  --project="${PROJECT_ID}" 2>/dev/null || true

echo "Creating key..."
gcloud iam service-accounts keys create "${KEY_OUTPUT_PATH}" \
  --iam-account="${SA_EMAIL}"

echo "Granting Dataproc Viewer..."
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/dataproc.viewer" \
  --condition=None \
  --quiet

echo "Granting bucket access for dataproc temp buckets in region ${REGION}..."
for bucket in $(gsutil ls | grep "dataproc-temp-${REGION}" || true); do
  echo "  ${bucket}"
  gsutil iam ch "serviceAccount:${SA_EMAIL}:objectViewer" "${bucket}"
done

echo "Done."
echo "Service account: ${SA_EMAIL}"
echo "Key file: ${KEY_OUTPUT_PATH}"
```

{% endcode %}
{% endtab %}

{% tab title="Terraform" %}

### Terraform example

This creates:

* `dataflint-events-reader` service account
* a service account key
* `roles/dataproc.viewer` binding on the project
* `roles/storage.objectViewer` on your Dataproc temp bucket

{% code title="main.tf" %}

```hcl
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

variable "project_id" {
  description = "GCP Project ID"
  type        = string
}

variable "region" {
  description = "GCP region"
  type        = string
  default     = "us-central1"
}

variable "dataproc_temp_bucket" {
  description = "Dataproc temp bucket name (without gs://)"
  type        = string
}

resource "google_service_account" "dataflint_events_reader" {
  account_id   = "dataflint-events-reader"
  display_name = "DataFlint Spark Events Reader"
  description  = "Read-only access to Dataproc Spark event logs"
}

resource "google_service_account_key" "dataflint_events_reader_key" {
  service_account_id = google_service_account.dataflint_events_reader.name
}

resource "google_project_iam_member" "dataproc_viewer" {
  project = var.project_id
  role    = "roles/dataproc.viewer"
  member  = "serviceAccount:${google_service_account.dataflint_events_reader.email}"
}

resource "google_storage_bucket_iam_member" "bucket_viewer" {
  bucket = var.dataproc_temp_bucket
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.dataflint_events_reader.email}"
}

output "service_account_email" {
  value = google_service_account.dataflint_events_reader.email
}

output "service_account_key_base64" {
  value     = google_service_account_key.dataflint_events_reader_key.private_key
  sensitive = true
}
```

{% endcode %}

{% code title="terraform.tfvars" %}

```hcl
project_id            = "your-project-id"
region                = "us-central1"
dataproc_temp_bucket  = "dataproc-temp-us-central1-123456789-abcdefg"
```

{% endcode %}

### Apply and export the key

```bash
terraform init
terraform apply

terraform output -raw service_account_key_base64 | base64 -d > dataflint-sa-key.json
```

{% endtab %}
{% endtabs %}

## Validate the access (recommended)

Use the key to test that listing clusters and reading objects works.

```bash
gcloud config set project YOUR_PROJECT_ID
gcloud auth activate-service-account \
  --key-file="$PWD/dataflint-sa-key.json"

# Should succeed (needs roles/dataproc.viewer)
gcloud dataproc clusters list --region=YOUR_REGION

# Should succeed (needs roles/storage.objectViewer)
gsutil ls gs://YOUR_DATAPROC_TEMP_BUCKET/
```

{% hint style="warning" %}
If you don't activate the service account, the test can succeed using your personal credentials.
{% endhint %}
