💊DataProc SaaS Installation

Summary

This guide shows how to grant DataFlint read-only access to Spark event logs and DataProc metadata from Google Cloud DataProc.

For the broader SaaS threat model and stability notes, see SaaS Security & Stability.

You will:

Create a dedicated GCP service account.
Grant minimal IAM roles for Dataproc + Cloud Storage.
Generate a JSON key (or use your preferred secret flow).
Share the credentials with DataFlint

The entire process should take a few minutes.

Service account keys are sensitive credentials. Store them like passwords and share them only over an approved secure channel.

What DataFlint needs

Send DataFlint:

Project ID
Region(s) you run Dataproc in
Service account key (JSON) or an agreed alternative credential method
Dataproc temp bucket name(s) if you use custom buckets

How it works

Dataproc writes Spark event logs into a GCS bucket. By default this is the Dataproc temp bucket.

Common layout:

gs://dataproc-temp-<region>-<project-number>-<suffix>/
  <cluster-uuid>/
    spark-job-history/
      application_*

DataFlint reads only these logs and cluster metadata.

Required IAM permissions (minimal)

Grant the DataFlint service account:

Project-level: roles/dataproc.viewer
Bucket-level (on the event-log buckets): roles/storage.objectViewer

If you configured Spark event logs to a custom bucket via spark:spark.eventLog.dir or History Server settings, grant storage.objectViewer on that bucket too.

Installation

Pick one method. All methods create the same resources.

Step 1: Create a service account

Open IAM & Admin → Service Accounts.
Click Create service account.
Use:
- Name: dataflint-events-reader
- Description: Read-only access to Dataproc Spark event logs
Click Done.

Step 2: Create a service account key (JSON)

Open the service account.
Go to Keys.
Click Add key → Create new key.
Select JSON.
Download and store the file securely.

Step 3: Grant Dataproc viewer permissions

Open IAM & Admin → IAM.
Click Grant access.
Principal: dataflint-events-reader@<PROJECT_ID>.iam.gserviceaccount.com
Role: Dataproc Viewer (roles/dataproc.viewer)

Step 4: Grant bucket read access (event logs)

Open Cloud Storage → Buckets.
Find the Dataproc temp bucket.
- It usually looks like dataproc-temp-<region>-<project-number>-...
Open Permissions.
Click Grant access.
Add the same principal as above.
Role: Storage Object Viewer (roles/storage.objectViewer)

Prerequisites

gcloud auth login
gcloud config set project YOUR_PROJECT_ID

Create service account

gcloud iam service-accounts create dataflint-events-reader \
  --display-name="DataFlint Spark Events Reader" \
  --description="Read-only access to Dataproc Spark event logs"

Create key

gcloud iam service-accounts keys create dataflint-sa-key.json \
  --iam-account=dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com

Grant Dataproc Viewer (project-level)

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/dataproc.viewer" \
  --condition=None

Grant Storage Object Viewer (bucket-level)

# List candidate temp buckets
gsutil ls | grep dataproc-temp || true

# Grant read-only access (repeat for every relevant bucket)
gsutil iam ch \
  serviceAccount:dataflint-events-reader@YOUR_PROJECT_ID.iam.gserviceaccount.com:objectViewer \
  gs://dataproc-temp-REGION-PROJECT_NUMBER-SUFFIX

Optional: one-shot setup script

setup-dataflint-dataproc-reader.sh

#!/bin/bash
set -euo pipefail

# EDIT THESE
PROJECT_ID="your-project-id"
REGION="us-central1"
SA_NAME="dataflint-events-reader"
KEY_OUTPUT_PATH="./dataflint-sa-key.json"

SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

echo "Creating service account (if missing)..."
gcloud iam service-accounts create "${SA_NAME}" \
  --display-name="DataFlint Spark Events Reader" \
  --description="Read-only access to Dataproc Spark event logs" \
  --project="${PROJECT_ID}" 2>/dev/null || true

echo "Creating key..."
gcloud iam service-accounts keys create "${KEY_OUTPUT_PATH}" \
  --iam-account="${SA_EMAIL}"

echo "Granting Dataproc Viewer..."
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/dataproc.viewer" \
  --condition=None \
  --quiet

echo "Granting bucket access for dataproc temp buckets in region ${REGION}..."
for bucket in $(gsutil ls | grep "dataproc-temp-${REGION}" || true); do
  echo "  ${bucket}"
  gsutil iam ch "serviceAccount:${SA_EMAIL}:objectViewer" "${bucket}"
done

echo "Done."
echo "Service account: ${SA_EMAIL}"
echo "Key file: ${KEY_OUTPUT_PATH}"

Terraform example

This creates:

dataflint-events-reader service account
a service account key
roles/dataproc.viewer binding on the project
roles/storage.objectViewer on your Dataproc temp bucket

main.tf

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

variable "project_id" {
  description = "GCP Project ID"
  type        = string
}

variable "region" {
  description = "GCP region"
  type        = string
  default     = "us-central1"
}

variable "dataproc_temp_bucket" {
  description = "Dataproc temp bucket name (without gs://)"
  type        = string
}

resource "google_service_account" "dataflint_events_reader" {
  account_id   = "dataflint-events-reader"
  display_name = "DataFlint Spark Events Reader"
  description  = "Read-only access to Dataproc Spark event logs"
}

resource "google_service_account_key" "dataflint_events_reader_key" {
  service_account_id = google_service_account.dataflint_events_reader.name
}

resource "google_project_iam_member" "dataproc_viewer" {
  project = var.project_id
  role    = "roles/dataproc.viewer"
  member  = "serviceAccount:${google_service_account.dataflint_events_reader.email}"
}

resource "google_storage_bucket_iam_member" "bucket_viewer" {
  bucket = var.dataproc_temp_bucket
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.dataflint_events_reader.email}"
}

output "service_account_email" {
  value = google_service_account.dataflint_events_reader.email
}

output "service_account_key_base64" {
  value     = google_service_account_key.dataflint_events_reader_key.private_key
  sensitive = true
}

terraform.tfvars

project_id            = "your-project-id"
region                = "us-central1"
dataproc_temp_bucket  = "dataproc-temp-us-central1-123456789-abcdefg"

Apply and export the key

terraform init
terraform apply

terraform output -raw service_account_key_base64 | base64 -d > dataflint-sa-key.json

Validate the access (recommended)

Use the key to test that listing clusters and reading objects works.

gcloud config set project YOUR_PROJECT_ID
gcloud auth activate-service-account \
  --key-file="$PWD/dataflint-sa-key.json"

# Should succeed (needs roles/dataproc.viewer)
gcloud dataproc clusters list --region=YOUR_REGION

# Should succeed (needs roles/storage.objectViewer)
gsutil ls gs://YOUR_DATAPROC_TEMP_BUCKET/

If you don't activate the service account, the test can succeed using your personal credentials.

Last updated 21 days ago

hashtagSummary

hashtagWhat DataFlint needs

hashtagHow it works

hashtagRequired IAM permissions (minimal)

hashtagInstallation

hashtagStep 1: Create a service account

hashtagStep 2: Create a service account key (JSON)

hashtagStep 3: Grant Dataproc viewer permissions

hashtagStep 4: Grant bucket read access (event logs)

hashtagPrerequisites

hashtagCreate service account

hashtagCreate key

hashtagGrant Dataproc Viewer (project-level)

hashtagGrant Storage Object Viewer (bucket-level)

hashtagOptional: one-shot setup script

hashtagTerraform example

hashtagApply and export the key

hashtagValidate the access (recommended)

Summary

What DataFlint needs

How it works

Required IAM permissions (minimal)

Installation

Step 1: Create a service account

Step 2: Create a service account key (JSON)

Step 3: Grant Dataproc viewer permissions

Step 4: Grant bucket read access (event logs)

Prerequisites

Create service account

Create key

Grant Dataproc Viewer (project-level)

Grant Storage Object Viewer (bucket-level)

Optional: one-shot setup script

Terraform example

Apply and export the key

Validate the access (recommended)