# EMR SaaS Installation

## Summary

This guide shows how to grant DataFlint **read-only** access to Amazon EMR metadata via **cross-account IAM role assumption**. It applies to **EMR on EC2**, **EMR Serverless**, and **EMR on EKS (IAM called this "EMR Containers")**.

For the broader SaaS threat model and stability notes, see [SaaS Security & Stability](https://dataflint.gitbook.io/dataflint-for-spark/saas/saas-security-and-stability).

You will:

1. Create a dedicated IAM role in your AWS account.
2. Add a trust policy that allows the DataFlint service role to assume it.
3. Attach a minimal read-only policy for EMR / EMR Containers APIs.
4. Share the role ARN + regions with DataFlint.

The entire process should take a few minutes.

{% hint style="info" %}
You need to repeat this per AWS account you want DataFlint to read from.
{% endhint %}

{% hint style="warning" %}
Cross-account roles must use an **External ID**. Ask DataFlint for your `CUSTOMER_EXTERNAL_ID` and the DataFlint AWS account details.
{% endhint %}

### What DataFlint needs

Send DataFlint:

* **Role ARN** you created (one per AWS account).
* **Region(s)** where you run EMR (and/or EMR on EKS / EMR Containers).
* The **role name** (optional, helps troubleshooting).

### How it works

DataFlint assumes the role you create and calls **read-only** EMR APIs to:

* Discover clusters / virtual clusters.
* List and describe job runs and steps.
* Fetch application UI links when applicable (read-only).

{% hint style="info" %}
AWS classifies a few EMR “UI helper” APIs as write actions (for example `elasticmapreduce:CreatePersistentAppUI`). DataFlint uses them only to generate read-only UI access links.
{% endhint %}

### Required IAM permissions (minimal)

Use a dedicated policy attached to the role. This is the minimal set we currently require for EMR + EMR Containers read access:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEmrContainersBasicAccess",
      "Effect": "Allow",
      "Action": [
        "emr-containers:ListVirtualClusters",
        "emr-containers:DescribeVirtualCluster",
        "emr-containers:ListJobRuns",
        "emr-containers:DescribeJobRun",
        "elasticmapreduce:CreatePersistentAppUI",
        "elasticmapreduce:DescribePersistentAppUI",
        "elasticmapreduce:GetPersistentAppUIPresignedURL",
        "elasticmapreduce:DescribeCluster",
        "elasticmapreduce:DescribeEditor",
        "elasticmapreduce:DescribeJobFlows",
        "elasticmapreduce:DescribeSecurityConfiguration",
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:DescribeReleaseLabel",
        "elasticmapreduce:GetBlockPublicAccessConfiguration",
        "elasticmapreduce:GetManagedScalingPolicy",
        "elasticmapreduce:GetAutoTerminationPolicy",
        "elasticmapreduce:ListBootstrapActions",
        "elasticmapreduce:ListClusters",
        "elasticmapreduce:ListEditors",
        "elasticmapreduce:ListInstanceFleets",
        "elasticmapreduce:ListInstanceGroups",
        "elasticmapreduce:ListInstances",
        "elasticmapreduce:ListSecurityConfigurations",
        "elasticmapreduce:ListSteps",
        "elasticmapreduce:ListSupportedInstanceTypes",
        "elasticmapreduce:ViewEventsFromAllClustersInConsole"
      ],
      "Resource": "*"
    }
  ]
}
```

{% hint style="info" %}
If you want to scope down further (by region, tags, or resource ARNs), tell us your constraints and we’ll help tighten it.
{% endhint %}

## Installation

Pick one method. All methods create the same resources.

{% tabs %}
{% tab title="AWS Console (UI)" %}

### Step 1: Create the IAM policy

1. Open **IAM → Policies**.
2. Click **Create policy**.
3. Choose **JSON**.
4. Paste the minimal policy from **Required IAM permissions**.
5. Click **Next**.
6. Policy name: `DataflintEmrContainersReadOnly`
7. Create the policy.

### Step 2: Create the IAM role

1. Open **IAM → Roles**.
2. Click **Create role**.
3. Trusted entity type: **AWS account**.
4. Select **Another AWS account**.
5. Account ID: `DATAFLINT_ACCOUNT_ID` (get it from DataFlint).
6. Enable **Require external ID**.
7. External ID: `CUSTOMER_EXTERNAL_ID` (get it from DataFlint).
8. In **Add permissions**, attach `DataflintEmrContainersReadOnly`.
9. Role name: `dataflint-emr-read-only-role`
10. Create the role.

### Step 3: Update the trust policy (Principal role)

In some accounts, the UI creates the trust policy with the **root** principal. We require the DataFlint **service role** principal instead.

1. Open the new role.
2. Go to **Trust relationships → Edit trust policy**.
3. Use this trust policy (replace placeholders):

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::DATAFLINT_ACCOUNT_ID:role/eks-dataflint-service-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "CUSTOMER_EXTERNAL_ID"
        }
      }
    }
  ]
}
```

### Step 4: Copy the role ARN

Open the role summary and copy the **ARN**. You’ll share it with DataFlint.
{% endtab %}

{% tab title="AWS CLI" %}

### Prerequisites

* AWS CLI v2 installed
* Logged in to the target AWS account

```bash
aws sts get-caller-identity
```

### Create the trust policy file

{% code title="trust-policy.json" %}

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::DATAFLINT_ACCOUNT_ID:role/eks-dataflint-service-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "CUSTOMER_EXTERNAL_ID"
        }
      }
    }
  ]
}
```

{% endcode %}

### Create the permissions policy file

{% code title="permissions-policy.json" %}

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEmrContainersBasicAccess",
      "Effect": "Allow",
      "Action": [
        "emr-containers:ListVirtualClusters",
        "emr-containers:DescribeVirtualCluster",
        "emr-containers:ListJobRuns",
        "emr-containers:DescribeJobRun",
        "elasticmapreduce:CreatePersistentAppUI",
        "elasticmapreduce:DescribePersistentAppUI",
        "elasticmapreduce:GetPersistentAppUIPresignedURL",
        "elasticmapreduce:DescribeCluster",
        "elasticmapreduce:DescribeEditor",
        "elasticmapreduce:DescribeJobFlows",
        "elasticmapreduce:DescribeSecurityConfiguration",
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:DescribeReleaseLabel",
        "elasticmapreduce:GetBlockPublicAccessConfiguration",
        "elasticmapreduce:GetManagedScalingPolicy",
        "elasticmapreduce:GetAutoTerminationPolicy",
        "elasticmapreduce:ListBootstrapActions",
        "elasticmapreduce:ListClusters",
        "elasticmapreduce:ListEditors",
        "elasticmapreduce:ListInstanceFleets",
        "elasticmapreduce:ListInstanceGroups",
        "elasticmapreduce:ListInstances",
        "elasticmapreduce:ListSecurityConfigurations",
        "elasticmapreduce:ListSteps",
        "elasticmapreduce:ListSupportedInstanceTypes",
        "elasticmapreduce:ViewEventsFromAllClustersInConsole"
      ],
      "Resource": "*"
    }
  ]
}
```

{% endcode %}

### Create role + policy and attach

Replace placeholders before running:

* `DATAFLINT_ACCOUNT_ID`
* `CUSTOMER_EXTERNAL_ID`

```bash
ROLE_NAME="dataflint-emr-read-only-role"
POLICY_NAME="DataflintEmrContainersReadOnly"

aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role for DataFlint to access EMR in read-only mode"

POLICY_ARN="$(aws iam create-policy \
  --policy-name "${POLICY_NAME}" \
  --policy-document file://permissions-policy.json \
  --query 'Policy.Arn' \
  --output text)"

aws iam attach-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-arn "${POLICY_ARN}"

aws iam get-role --role-name "${ROLE_NAME}" --query 'Role.Arn' --output text
```

{% hint style="info" %}
If your org mandates a permissions boundary or specific tags, add them at `create-role`.
{% endhint %}
{% endtab %}

{% tab title="Terraform" %}

### Terraform example

This creates:

* an IAM role `dataflint-emr-read-only-role`
* a minimal read-only IAM policy for EMR / EMR Containers
* the trust policy with `sts:ExternalId`

{% code title="main.tf" %}

```hcl
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {}

variable "dataflint_account_id" {
  description = "AWS account ID of DataFlint (provided by DataFlint)"
  type        = string
  default     = "975050001706"
}

variable "customer_external_id" {
  description = "External ID (provided by DataFlint)"
  type        = string
  sensitive   = true
}

variable "dataflint_service_role_name" {
  description = "Role name in the DataFlint account that will assume this role"
  type        = string
  default     = "eks-dataflint-service-role"
}

resource "aws_iam_role" "dataflint_emr_read_only" {
  name        = "dataflint-emr-read-only-role"
  description = "Role for DataFlint to access EMR / EMR Containers in read-only mode"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${var.dataflint_account_id}:role/${var.dataflint_service_role_name}"
        }
        Action = "sts:AssumeRole"
        Condition = {
          StringEquals = {
            "sts:ExternalId" = var.customer_external_id
          }
        }
      }
    ]
  })

  tags = {
    Name      = "dataflint-emr-read-only-role"
    Purpose   = "DataFlint EMR Integration"
    ManagedBy = "Terraform"
  }
}

resource "aws_iam_policy" "emr_containers_read_only" {
  name        = "DataflintEmrContainersReadOnly"
  description = "Policy granting read-only access to EMR / EMR Containers for DataFlint"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowEmrContainersBasicAccess"
        Effect = "Allow"
        Action = [
          "emr-containers:ListVirtualClusters",
          "emr-containers:DescribeVirtualCluster",
          "emr-containers:ListJobRuns",
          "emr-containers:DescribeJobRun",
          "elasticmapreduce:CreatePersistentAppUI",
          "elasticmapreduce:DescribePersistentAppUI",
          "elasticmapreduce:GetPersistentAppUIPresignedURL",
          "elasticmapreduce:DescribeCluster",
          "elasticmapreduce:DescribeEditor",
          "elasticmapreduce:DescribeJobFlows",
          "elasticmapreduce:DescribeSecurityConfiguration",
          "elasticmapreduce:DescribeStep",
          "elasticmapreduce:DescribeReleaseLabel",
          "elasticmapreduce:GetBlockPublicAccessConfiguration",
          "elasticmapreduce:GetManagedScalingPolicy",
          "elasticmapreduce:GetAutoTerminationPolicy",
          "elasticmapreduce:ListBootstrapActions",
          "elasticmapreduce:ListClusters",
          "elasticmapreduce:ListEditors",
          "elasticmapreduce:ListInstanceFleets",
          "elasticmapreduce:ListInstanceGroups",
          "elasticmapreduce:ListInstances",
          "elasticmapreduce:ListSecurityConfigurations",
          "elasticmapreduce:ListSteps",
          "elasticmapreduce:ListSupportedInstanceTypes",
          "elasticmapreduce:ViewEventsFromAllClustersInConsole"
        ]
        Resource = "*"
      }
    ]
  })

  tags = {
    Name      = "DataflintEmrContainersReadOnly"
    Purpose   = "DataFlint EMR Integration"
    ManagedBy = "Terraform"
  }
}

resource "aws_iam_role_policy_attachment" "dataflint_emr_policy_attachment" {
  role       = aws_iam_role.dataflint_emr_read_only.name
  policy_arn = aws_iam_policy.emr_containers_read_only.arn
}

output "role_arn" {
  description = "Provide this ARN to DataFlint"
  value       = aws_iam_role.dataflint_emr_read_only.arn
}
```

{% endcode %}

#### Apply

```bash
terraform init
terraform apply
```

{% endtab %}

{% tab title="CloudFormation" %}

### CloudFormation template

This stack creates a role + inline policy. It requires `CAPABILITY_NAMED_IAM`.

{% code title="dataflint-emr-read-only-role.yaml" %}

```yaml
AWSTemplateFormatVersion: "2010-09-09"
Description: "DataFlint - EMR / EMR Containers read-only cross-account role"

Parameters:
  RoleName:
    Type: String
    Default: dataflint-emr-read-only-role
    Description: "Name of the IAM role to create"

  DataflintAccountId:
    Type: String
    Default: "975050001706"
    Description: "AWS account ID of DataFlint"

  DataflintServiceRoleName:
    Type: String
    Default: eks-dataflint-service-role
    Description: "Role name in the DataFlint account that will assume this role"

  CustomerExternalId:
    Type: String
    NoEcho: true
    Description: "External ID (provided by DataFlint)"

Resources:
  DataflintEmrReadOnlyRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Ref RoleName
      Description: "Role for DataFlint to access EMR / EMR Containers in read-only mode"
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${DataflintAccountId}:role/${DataflintServiceRoleName}"
            Action: "sts:AssumeRole"
            Condition:
              StringEquals:
                sts:ExternalId: !Ref CustomerExternalId
      Policies:
        - PolicyName: DataflintEmrContainersReadOnly
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Sid: AllowEmrContainersBasicAccess
                Effect: Allow
                Action:
                  - emr-containers:ListVirtualClusters
                  - emr-containers:DescribeVirtualCluster
                  - emr-containers:ListJobRuns
                  - emr-containers:DescribeJobRun
                  - elasticmapreduce:CreatePersistentAppUI
                  - elasticmapreduce:DescribePersistentAppUI
                  - elasticmapreduce:GetPersistentAppUIPresignedURL
                  - elasticmapreduce:DescribeCluster
                  - elasticmapreduce:DescribeEditor
                  - elasticmapreduce:DescribeJobFlows
                  - elasticmapreduce:DescribeSecurityConfiguration
                  - elasticmapreduce:DescribeStep
                  - elasticmapreduce:DescribeReleaseLabel
                  - elasticmapreduce:GetBlockPublicAccessConfiguration
                  - elasticmapreduce:GetManagedScalingPolicy
                  - elasticmapreduce:GetAutoTerminationPolicy
                  - elasticmapreduce:ListBootstrapActions
                  - elasticmapreduce:ListClusters
                  - elasticmapreduce:ListEditors
                  - elasticmapreduce:ListInstanceFleets
                  - elasticmapreduce:ListInstanceGroups
                  - elasticmapreduce:ListInstances
                  - elasticmapreduce:ListSecurityConfigurations
                  - elasticmapreduce:ListSteps
                  - elasticmapreduce:ListSupportedInstanceTypes
                  - elasticmapreduce:ViewEventsFromAllClustersInConsole
                Resource: "*"

Outputs:
  RoleArn:
    Description: "Provide this ARN to DataFlint"
    Value: !GetAtt DataflintEmrReadOnlyRole.Arn
```

{% endcode %}

### Deploy via Console

1. Open **CloudFormation → Create stack**.
2. Upload the template file.
3. Fill:
   * `DataflintAccountId`
   * `CustomerExternalId`
4. Create stack.

### Deploy via AWS CLI

```bash
aws cloudformation create-stack \
  --stack-name dataflint-emr-read-only \
  --template-body file://dataflint-emr-read-only-role.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameters \
    ParameterKey=DataflintAccountId,ParameterValue=DATAFLINT_ACCOUNT_ID \
    ParameterKey=CustomerExternalId,ParameterValue=CUSTOMER_EXTERNAL_ID \
    ParameterKey=RoleName,ParameterValue=dataflint-emr-read-only-role
```

{% endtab %}
{% endtabs %}

## Validate the setup (recommended)

You can validate that the role exists and has the expected policies attached.

```bash
aws iam get-role --role-name dataflint-emr-read-only-role
aws iam list-attached-role-policies --role-name dataflint-emr-read-only-role
```

You can also validate permissions using IAM simulation:

```bash
ROLE_ARN="$(aws iam get-role --role-name dataflint-emr-read-only-role --query 'Role.Arn' --output text)"

aws iam simulate-principal-policy \
  --policy-source-arn "${ROLE_ARN}" \
  --action-names emr-containers:ListVirtualClusters elasticmapreduce:ListClusters \
  --output text
```

## Send the details to DataFlint

Share over your approved secure channel:

* Role ARN
* Regions for EMR / EMR Containers
