# EMR SaaS Installation

## Summary

This guide shows how to grant DataFlint **read-only** access to Amazon EMR metadata via **cross-account IAM role assumption**. It applies to **EMR on EC2**, **EMR Serverless**, and **EMR on EKS (IAM called this "EMR Containers")**.

For the broader SaaS threat model and stability notes, see [SaaS Security & Stability](/dataflint-for-spark/saas/saas-security-and-stability.md).

You will:

1. Create a dedicated IAM role in your AWS account.
2. Add a trust policy that allows the DataFlint service role to assume it.
3. Attach a minimal read-only policy for EMR / EMR Containers APIs.
4. Share the role ARN + regions with DataFlint.

The entire process should take a few minutes.

{% hint style="info" %}
You need to repeat this per AWS account you want DataFlint to read from.
{% endhint %}

{% hint style="warning" %}
Cross-account roles must use an **External ID**. Ask DataFlint for your `CUSTOMER_EXTERNAL_ID` and the DataFlint AWS account details.
{% endhint %}

### What DataFlint needs

Send DataFlint:

* **Role ARN** you created (one per AWS account).
* **Region(s)** where you run EMR (and/or EMR on EKS / EMR Containers).
* The **role name** (optional, helps troubleshooting).

### How it works

DataFlint assumes the role you create and calls **read-only** EMR APIs to:

* Discover clusters / virtual clusters.
* List and describe job runs and steps.
* Fetch application UI links when applicable (read-only).

{% hint style="info" %}
AWS classifies a few EMR “UI helper” APIs as write actions (for example `elasticmapreduce:CreatePersistentAppUI`). DataFlint uses them only to generate read-only UI access links.
{% endhint %}

### Required IAM permissions (minimal)

Use a dedicated policy attached to the role. This is the minimal set we currently require for EMR + EMR Containers read access:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEmrContainersBasicAccess",
      "Effect": "Allow",
      "Action": [
        "emr-containers:ListVirtualClusters",
        "emr-containers:DescribeVirtualCluster",
        "emr-containers:ListJobRuns",
        "emr-containers:DescribeJobRun",
        "elasticmapreduce:CreatePersistentAppUI",
        "elasticmapreduce:DescribePersistentAppUI",
        "elasticmapreduce:GetPersistentAppUIPresignedURL",
        "elasticmapreduce:DescribeCluster",
        "elasticmapreduce:DescribeEditor",
        "elasticmapreduce:DescribeJobFlows",
        "elasticmapreduce:DescribeSecurityConfiguration",
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:DescribeReleaseLabel",
        "elasticmapreduce:GetBlockPublicAccessConfiguration",
        "elasticmapreduce:GetManagedScalingPolicy",
        "elasticmapreduce:GetAutoTerminationPolicy",
        "elasticmapreduce:ListBootstrapActions",
        "elasticmapreduce:ListClusters",
        "elasticmapreduce:ListEditors",
        "elasticmapreduce:ListInstanceFleets",
        "elasticmapreduce:ListInstanceGroups",
        "elasticmapreduce:ListInstances",
        "elasticmapreduce:ListSecurityConfigurations",
        "elasticmapreduce:ListSteps",
        "elasticmapreduce:ListSupportedInstanceTypes",
        "elasticmapreduce:ViewEventsFromAllClustersInConsole"
      ],
      "Resource": "*"
    }
  ]
}
```

{% hint style="info" %}
If you want to scope down further (by region, tags, or resource ARNs), tell us your constraints and we’ll help tighten it.
{% endhint %}

## Installation

Pick one method. All methods create the same resources.

{% tabs %}
{% tab title="AWS Console (UI)" %}

### Step 1: Create the IAM policy

1. Open **IAM → Policies**.
2. Click **Create policy**.
3. Choose **JSON**.
4. Paste the minimal policy from **Required IAM permissions**.
5. Click **Next**.
6. Policy name: `DataflintEmrContainersReadOnly`
7. Create the policy.

### Step 2: Create the IAM role

1. Open **IAM → Roles**.
2. Click **Create role**.
3. Trusted entity type: **AWS account**.
4. Select **Another AWS account**.
5. Account ID: `DATAFLINT_ACCOUNT_ID` (get it from DataFlint).
6. Enable **Require external ID**.
7. External ID: `CUSTOMER_EXTERNAL_ID` (get it from DataFlint).
8. In **Add permissions**, attach `DataflintEmrContainersReadOnly`.
9. Role name: `dataflint-emr-read-only-role`
10. Create the role.

### Step 3: Update the trust policy (Principal role)

In some accounts, the UI creates the trust policy with the **root** principal. We require the DataFlint **service role** principal instead.

1. Open the new role.
2. Go to **Trust relationships → Edit trust policy**.
3. Use this trust policy (replace placeholders):

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::DATAFLINT_ACCOUNT_ID:role/eks-dataflint-service-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "CUSTOMER_EXTERNAL_ID"
        }
      }
    }
  ]
}
```

### Step 4: Copy the role ARN

Open the role summary and copy the **ARN**. You’ll share it with DataFlint.
{% endtab %}

{% tab title="AWS CLI" %}

### Prerequisites

* AWS CLI v2 installed
* Logged in to the target AWS account

```bash
aws sts get-caller-identity
```

### Create the trust policy file

{% code title="trust-policy.json" %}

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::DATAFLINT_ACCOUNT_ID:role/eks-dataflint-service-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "CUSTOMER_EXTERNAL_ID"
        }
      }
    }
  ]
}
```

{% endcode %}

### Create the permissions policy file

{% code title="permissions-policy.json" %}

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEmrContainersBasicAccess",
      "Effect": "Allow",
      "Action": [
        "emr-containers:ListVirtualClusters",
        "emr-containers:DescribeVirtualCluster",
        "emr-containers:ListJobRuns",
        "emr-containers:DescribeJobRun",
        "elasticmapreduce:CreatePersistentAppUI",
        "elasticmapreduce:DescribePersistentAppUI",
        "elasticmapreduce:GetPersistentAppUIPresignedURL",
        "elasticmapreduce:DescribeCluster",
        "elasticmapreduce:DescribeEditor",
        "elasticmapreduce:DescribeJobFlows",
        "elasticmapreduce:DescribeSecurityConfiguration",
        "elasticmapreduce:DescribeStep",
        "elasticmapreduce:DescribeReleaseLabel",
        "elasticmapreduce:GetBlockPublicAccessConfiguration",
        "elasticmapreduce:GetManagedScalingPolicy",
        "elasticmapreduce:GetAutoTerminationPolicy",
        "elasticmapreduce:ListBootstrapActions",
        "elasticmapreduce:ListClusters",
        "elasticmapreduce:ListEditors",
        "elasticmapreduce:ListInstanceFleets",
        "elasticmapreduce:ListInstanceGroups",
        "elasticmapreduce:ListInstances",
        "elasticmapreduce:ListSecurityConfigurations",
        "elasticmapreduce:ListSteps",
        "elasticmapreduce:ListSupportedInstanceTypes",
        "elasticmapreduce:ViewEventsFromAllClustersInConsole"
      ],
      "Resource": "*"
    }
  ]
}
```

{% endcode %}

### Create role + policy and attach

Replace placeholders before running:

* `DATAFLINT_ACCOUNT_ID`
* `CUSTOMER_EXTERNAL_ID`

```bash
ROLE_NAME="dataflint-emr-read-only-role"
POLICY_NAME="DataflintEmrContainersReadOnly"

aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role for DataFlint to access EMR in read-only mode"

POLICY_ARN="$(aws iam create-policy \
  --policy-name "${POLICY_NAME}" \
  --policy-document file://permissions-policy.json \
  --query 'Policy.Arn' \
  --output text)"

aws iam attach-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-arn "${POLICY_ARN}"

aws iam get-role --role-name "${ROLE_NAME}" --query 'Role.Arn' --output text
```

{% hint style="info" %}
If your org mandates a permissions boundary or specific tags, add them at `create-role`.
{% endhint %}
{% endtab %}

{% tab title="Terraform" %}

### Terraform example

This creates:

* an IAM role `dataflint-emr-read-only-role`
* a minimal read-only IAM policy for EMR / EMR Containers
* the trust policy with `sts:ExternalId`

{% code title="main.tf" %}

```hcl
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {}

variable "dataflint_account_id" {
  description = "AWS account ID of DataFlint (provided by DataFlint)"
  type        = string
  default     = "975050001706"
}

variable "customer_external_id" {
  description = "External ID (provided by DataFlint)"
  type        = string
  sensitive   = true
}

variable "dataflint_service_role_name" {
  description = "Role name in the DataFlint account that will assume this role"
  type        = string
  default     = "eks-dataflint-service-role"
}

resource "aws_iam_role" "dataflint_emr_read_only" {
  name        = "dataflint-emr-read-only-role"
  description = "Role for DataFlint to access EMR / EMR Containers in read-only mode"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${var.dataflint_account_id}:role/${var.dataflint_service_role_name}"
        }
        Action = "sts:AssumeRole"
        Condition = {
          StringEquals = {
            "sts:ExternalId" = var.customer_external_id
          }
        }
      }
    ]
  })

  tags = {
    Name      = "dataflint-emr-read-only-role"
    Purpose   = "DataFlint EMR Integration"
    ManagedBy = "Terraform"
  }
}

resource "aws_iam_policy" "emr_containers_read_only" {
  name        = "DataflintEmrContainersReadOnly"
  description = "Policy granting read-only access to EMR / EMR Containers for DataFlint"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowEmrContainersBasicAccess"
        Effect = "Allow"
        Action = [
          "emr-containers:ListVirtualClusters",
          "emr-containers:DescribeVirtualCluster",
          "emr-containers:ListJobRuns",
          "emr-containers:DescribeJobRun",
          "elasticmapreduce:CreatePersistentAppUI",
          "elasticmapreduce:DescribePersistentAppUI",
          "elasticmapreduce:GetPersistentAppUIPresignedURL",
          "elasticmapreduce:DescribeCluster",
          "elasticmapreduce:DescribeEditor",
          "elasticmapreduce:DescribeJobFlows",
          "elasticmapreduce:DescribeSecurityConfiguration",
          "elasticmapreduce:DescribeStep",
          "elasticmapreduce:DescribeReleaseLabel",
          "elasticmapreduce:GetBlockPublicAccessConfiguration",
          "elasticmapreduce:GetManagedScalingPolicy",
          "elasticmapreduce:GetAutoTerminationPolicy",
          "elasticmapreduce:ListBootstrapActions",
          "elasticmapreduce:ListClusters",
          "elasticmapreduce:ListEditors",
          "elasticmapreduce:ListInstanceFleets",
          "elasticmapreduce:ListInstanceGroups",
          "elasticmapreduce:ListInstances",
          "elasticmapreduce:ListSecurityConfigurations",
          "elasticmapreduce:ListSteps",
          "elasticmapreduce:ListSupportedInstanceTypes",
          "elasticmapreduce:ViewEventsFromAllClustersInConsole"
        ]
        Resource = "*"
      }
    ]
  })

  tags = {
    Name      = "DataflintEmrContainersReadOnly"
    Purpose   = "DataFlint EMR Integration"
    ManagedBy = "Terraform"
  }
}

resource "aws_iam_role_policy_attachment" "dataflint_emr_policy_attachment" {
  role       = aws_iam_role.dataflint_emr_read_only.name
  policy_arn = aws_iam_policy.emr_containers_read_only.arn
}

output "role_arn" {
  description = "Provide this ARN to DataFlint"
  value       = aws_iam_role.dataflint_emr_read_only.arn
}
```

{% endcode %}

#### Apply

```bash
terraform init
terraform apply
```

{% endtab %}

{% tab title="CloudFormation" %}

### CloudFormation template

This stack creates a role + inline policy. It requires `CAPABILITY_NAMED_IAM`.

{% code title="dataflint-emr-read-only-role.yaml" %}

```yaml
AWSTemplateFormatVersion: "2010-09-09"
Description: "DataFlint - EMR / EMR Containers read-only cross-account role"

Parameters:
  RoleName:
    Type: String
    Default: dataflint-emr-read-only-role
    Description: "Name of the IAM role to create"

  DataflintAccountId:
    Type: String
    Default: "975050001706"
    Description: "AWS account ID of DataFlint"

  DataflintServiceRoleName:
    Type: String
    Default: eks-dataflint-service-role
    Description: "Role name in the DataFlint account that will assume this role"

  CustomerExternalId:
    Type: String
    NoEcho: true
    Description: "External ID (provided by DataFlint)"

Resources:
  DataflintEmrReadOnlyRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Ref RoleName
      Description: "Role for DataFlint to access EMR / EMR Containers in read-only mode"
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${DataflintAccountId}:role/${DataflintServiceRoleName}"
            Action: "sts:AssumeRole"
            Condition:
              StringEquals:
                sts:ExternalId: !Ref CustomerExternalId
      Policies:
        - PolicyName: DataflintEmrContainersReadOnly
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Sid: AllowEmrContainersBasicAccess
                Effect: Allow
                Action:
                  - emr-containers:ListVirtualClusters
                  - emr-containers:DescribeVirtualCluster
                  - emr-containers:ListJobRuns
                  - emr-containers:DescribeJobRun
                  - elasticmapreduce:CreatePersistentAppUI
                  - elasticmapreduce:DescribePersistentAppUI
                  - elasticmapreduce:GetPersistentAppUIPresignedURL
                  - elasticmapreduce:DescribeCluster
                  - elasticmapreduce:DescribeEditor
                  - elasticmapreduce:DescribeJobFlows
                  - elasticmapreduce:DescribeSecurityConfiguration
                  - elasticmapreduce:DescribeStep
                  - elasticmapreduce:DescribeReleaseLabel
                  - elasticmapreduce:GetBlockPublicAccessConfiguration
                  - elasticmapreduce:GetManagedScalingPolicy
                  - elasticmapreduce:GetAutoTerminationPolicy
                  - elasticmapreduce:ListBootstrapActions
                  - elasticmapreduce:ListClusters
                  - elasticmapreduce:ListEditors
                  - elasticmapreduce:ListInstanceFleets
                  - elasticmapreduce:ListInstanceGroups
                  - elasticmapreduce:ListInstances
                  - elasticmapreduce:ListSecurityConfigurations
                  - elasticmapreduce:ListSteps
                  - elasticmapreduce:ListSupportedInstanceTypes
                  - elasticmapreduce:ViewEventsFromAllClustersInConsole
                Resource: "*"

Outputs:
  RoleArn:
    Description: "Provide this ARN to DataFlint"
    Value: !GetAtt DataflintEmrReadOnlyRole.Arn
```

{% endcode %}

### Deploy via Console

1. Open **CloudFormation → Create stack**.
2. Upload the template file.
3. Fill:
   * `DataflintAccountId`
   * `CustomerExternalId`
4. Create stack.

### Deploy via AWS CLI

```bash
aws cloudformation create-stack \
  --stack-name dataflint-emr-read-only \
  --template-body file://dataflint-emr-read-only-role.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameters \
    ParameterKey=DataflintAccountId,ParameterValue=DATAFLINT_ACCOUNT_ID \
    ParameterKey=CustomerExternalId,ParameterValue=CUSTOMER_EXTERNAL_ID \
    ParameterKey=RoleName,ParameterValue=dataflint-emr-read-only-role
```

{% endtab %}
{% endtabs %}

## Validate the setup (recommended)

You can validate that the role exists and has the expected policies attached.

```bash
aws iam get-role --role-name dataflint-emr-read-only-role
aws iam list-attached-role-policies --role-name dataflint-emr-read-only-role
```

You can also validate permissions using IAM simulation:

```bash
ROLE_ARN="$(aws iam get-role --role-name dataflint-emr-read-only-role --query 'Role.Arn' --output text)"

aws iam simulate-principal-policy \
  --policy-source-arn "${ROLE_ARN}" \
  --action-names emr-containers:ListVirtualClusters elasticmapreduce:ListClusters \
  --output text
```

## Send the details to DataFlint

Share over your approved secure channel:

* Role ARN
* Regions for EMR / EMR Containers


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dataflint.gitbook.io/dataflint-for-spark/saas/emr-saas-installation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
