# DataProc SaaS Installation

## Summary

This guide shows how to grant DataFlint **read-only** access to Spark event logs and DataProc metadata from **Google Cloud DataProc**.

For the broader SaaS threat model and stability notes, see [SaaS Security & Stability](/dataflint-for-spark/saas/saas-security-and-stability.md).

You will:

1. Create a dedicated GCP service account.
2. Grant minimal IAM roles for Dataproc + Cloud Storage.
3. Generate a JSON key (or use your preferred secret flow).
4. Share the credentials with DataFlint

The entire process should take a few minutes.

{% hint style="warning" %}
Service account keys are sensitive credentials. Store them like passwords and share them only over an approved secure channel.
{% endhint %}

### What DataFlint needs

Send DataFlint:

* **Project ID**
* **Region(s)** you run Dataproc in
* **Service account key (JSON)** *or* an agreed alternative credential method
* **Dataproc temp bucket name(s)** if you use custom buckets

### How it works

Dataproc writes Spark event logs into a GCS bucket. By default this is the Dataproc **temp bucket**.

Common layout:

```
gs://dataproc-temp-<region>-<project-number>-<suffix>/
  <cluster-uuid>/
    spark-job-history/
      application_*
```

DataFlint reads only these logs and cluster metadata.

### Required IAM permissions (minimal)

Grant the DataFlint service account:

* Project-level: `roles/dataproc.viewer`
* Bucket-level (on the event-log buckets): `roles/storage.objectViewer`

{% hint style="info" %}
If you configured Spark event logs to a custom bucket via `spark:spark.eventLog.dir` or History Server settings, grant `storage.objectViewer` on that bucket too.
{% endhint %}

## Installation

Pick one method. All methods create the same resources.

{% tabs %}
{% tab title="Google Cloud Console (UI)" %}

### Step 1: Create a service account

1. Open **IAM & Admin → Service Accounts**.
2. Click **Create service account**.
3. Use:
   * Name: `dataflint-events-reader`
   * Description: `Read-only access to Dataproc Spark event logs`
4. Click **Done**.

### Step 2: Create a service account key (JSON)

1. Open the service account.
2. Go to **Keys**.
3. Click **Add key → Create new key**.
4. Select **JSON**.
5. Download and store the file securely.

### Step 3: Grant Dataproc viewer permissions

1. Open **IAM & Admin → IAM**.
2. Click **Grant access**.
3. Principal: `dataflint-events-reader@<PROJECT_ID>.iam.gserviceaccount.com`
4. Role: **Dataproc Viewer** (`roles/dataproc.viewer`)

### Step 4: Grant bucket read access (event logs)

1. Open **Cloud Storage → Buckets**.
2. Find the Dataproc temp bucket.
   * It usually looks like `dataproc-temp-<region>-<project-number>-...`
3. Open **Permissions**.
4. Click **Grant access**.
5. Add the same principal as above.
6. Role: **Storage Object Viewer** (`roles/storage.objectViewer`)
   {% endtab %}

{% tab title="gcloud / gsutil (CLI)" %}

### Prerequisites

```bash
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
```

### Create service account

```bash
gcloud iam service-accounts create dataflint-events-reader \
  --display-name="DataFlint Spark Events Reader" \
  --description="Read-only access to Dataproc Spark event logs"
```

### Create key

```bash
gcloud iam service-accounts keys create dataflint-sa-key.json \
  --iam-account=dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com
```

### Grant Dataproc Viewer (project-level)

```bash
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/dataproc.viewer" \
  --condition=None
```

### Grant Storage Object Viewer (bucket-level)

```bash
# List candidate temp buckets
gsutil ls | grep dataproc-temp || true

# Grant read-only access (repeat for every relevant bucket)
gsutil iam ch \
  serviceAccount:dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com:objectViewer \
  gs://dataproc-temp-REGION-PROJECT_NUMBER-SUFFIX
```

### Optional: one-shot setup script

{% code title="setup-dataflint-dataproc-reader.sh" %}

```bash
#!/bin/bash
set -euo pipefail

# EDIT THESE
PROJECT_ID="your-project-id"
REGION="us-central1"
SA_NAME="dataflint-events-reader"
KEY_OUTPUT_PATH="./dataflint-sa-key.json"

SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

echo "Creating service account (if missing)..."
gcloud iam service-accounts create "${SA_NAME}" \
  --display-name="DataFlint Spark Events Reader" \
  --description="Read-only access to Dataproc Spark event logs" \
  --project="${PROJECT_ID}" 2>/dev/null || true

echo "Creating key..."
gcloud iam service-accounts keys create "${KEY_OUTPUT_PATH}" \
  --iam-account="${SA_EMAIL}"

echo "Granting Dataproc Viewer..."
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/dataproc.viewer" \
  --condition=None \
  --quiet

echo "Granting bucket access for dataproc temp buckets in region ${REGION}..."
for bucket in $(gsutil ls | grep "dataproc-temp-${REGION}" || true); do
  echo "  ${bucket}"
  gsutil iam ch "serviceAccount:${SA_EMAIL}:objectViewer" "${bucket}"
done

echo "Done."
echo "Service account: ${SA_EMAIL}"
echo "Key file: ${KEY_OUTPUT_PATH}"
```

{% endcode %}
{% endtab %}

{% tab title="Terraform" %}

### Terraform example

This creates:

* `dataflint-events-reader` service account
* a service account key
* `roles/dataproc.viewer` binding on the project
* `roles/storage.objectViewer` on your Dataproc temp bucket

{% code title="main.tf" %}

```hcl
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

variable "project_id" {
  description = "GCP Project ID"
  type        = string
}

variable "region" {
  description = "GCP region"
  type        = string
  default     = "us-central1"
}

variable "dataproc_temp_bucket" {
  description = "Dataproc temp bucket name (without gs://)"
  type        = string
}

resource "google_service_account" "dataflint_events_reader" {
  account_id   = "dataflint-events-reader"
  display_name = "DataFlint Spark Events Reader"
  description  = "Read-only access to Dataproc Spark event logs"
}

resource "google_service_account_key" "dataflint_events_reader_key" {
  service_account_id = google_service_account.dataflint_events_reader.name
}

resource "google_project_iam_member" "dataproc_viewer" {
  project = var.project_id
  role    = "roles/dataproc.viewer"
  member  = "serviceAccount:${google_service_account.dataflint_events_reader.email}"
}

resource "google_storage_bucket_iam_member" "bucket_viewer" {
  bucket = var.dataproc_temp_bucket
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.dataflint_events_reader.email}"
}

output "service_account_email" {
  value = google_service_account.dataflint_events_reader.email
}

output "service_account_key_base64" {
  value     = google_service_account_key.dataflint_events_reader_key.private_key
  sensitive = true
}
```

{% endcode %}

{% code title="terraform.tfvars" %}

```hcl
project_id            = "your-project-id"
region                = "us-central1"
dataproc_temp_bucket  = "dataproc-temp-us-central1-123456789-abcdefg"
```

{% endcode %}

### Apply and export the key

```bash
terraform init
terraform apply

terraform output -raw service_account_key_base64 | base64 -d > dataflint-sa-key.json
```

{% endtab %}
{% endtabs %}

## Validate the access (recommended)

Use the key to test that listing clusters and reading objects works.

```bash
gcloud config set project YOUR_PROJECT_ID
gcloud auth activate-service-account \
  --key-file="$PWD/dataflint-sa-key.json"

# Should succeed (needs roles/dataproc.viewer)
gcloud dataproc clusters list --region=YOUR_REGION

# Should succeed (needs roles/storage.objectViewer)
gsutil ls gs://YOUR_DATAPROC_TEMP_BUCKET/
```

{% hint style="warning" %}
If you don't activate the service account, the test can succeed using your personal credentials.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dataflint.gitbook.io/dataflint-for-spark/saas/dataproc-saas-installation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
