# SaaS Security & Stability

## Summary

This document is for explaining DataFlint architecture and why it both secure and won't hurt your spark jobs performance or stability.

This document is refering to the DataFlint SaaS offering, for security & stability consideration of using the open source offering see [security-and-stability](https://dataflint.gitbook.io/dataflint-for-spark/overview/security-and-stability "mention")

## TLDR

1. **DataFlint SaaS does not have access or export anything relates to your data**, only non-sensitive metadata about your job performance. \[except for extreme edge cases - See [#appendix-edge-cases-when-pii-might-get-sent](#appendix-edge-cases-when-pii-might-get-sent "mention")]
2. DataFlint SaaS will export your job performance metadata via the **DataFlint open source library to a S3 bucket using IAM user token,**&#x20;
   1. For Databricks, exporting is being done via **Databricks API**
3. **You don't need to give any permissions to your environment.**
4. You get access to DataFlint via a web portal with **OAuth authentication.**

## SaaS Interfaces

Deployment options:

1. Databricks
   1. Fully SaaS
   2. Vendor account for Databricks API access and processing
2. DataProc/K8s/Standalone
   1. Fully SaaS
   2. Vendor account for uploading run summary and processing
3. EMR
   1. Fully SaaS
   2. Vendor account for assuming EMR read-only role to production account and processing

### For EMR:

#### Fully SaaS Option

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FeFwM4hDFjDi8kG6nVncD%2Fimage.png?alt=media&#x26;token=ae135f7d-28fb-4874-8835-f480358fd098" alt=""><figcaption></figcaption></figure>

#### Vendor Account Option:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FhHW6HcJA5MVfRzfEf73m%2Fimage.png?alt=media&#x26;token=56f6f2fe-6189-497f-a0b8-b3a771caf6aa" alt=""><figcaption></figcaption></figure>

### For DataProc/K8s:

#### Fully SaaS option:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FMwBYK7eE478T3wP6rDqZ%2Fimage.png?alt=media&#x26;token=dd292380-ba80-43e2-84d1-10a953907d61" alt=""><figcaption></figcaption></figure>

#### Vendor Account Option:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FYO6lOuetnbRTe7JG7MyQ%2Fimage.png?alt=media&#x26;token=362263c4-7cda-4996-b265-d3c31d343a40" alt=""><figcaption></figcaption></figure>

### For Databricks:

#### Fully SaaS Option:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FpiAIVMnv4rLzW9gfSjHP%2Fimage.png?alt=media&#x26;token=c1e1968f-705c-48f7-98df-682b54da614f" alt=""><figcaption></figcaption></figure>

#### Vendor Account Option:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FQJrdsoJRZaRDNJL1VmNg%2Fimage.png?alt=media&#x26;token=2a2aa26c-b65f-4cd3-ae77-54b3e48f3d79" alt=""><figcaption></figcaption></figure>

DataFlint SaaS exposes 2 interfaces to your organization:

1. Applications **Exporting** Performance Data
2. Developers **Accessing** DataFlint SaaS Web Client

Internally in DataFlint account your data is processed to be available to the web client. Your job performance metadata or any derivatives of it **does not leave DataFlint AWS account or being shared with any 3rd** party.

In both SaaS and Vendor Account deployment your metrics and insights are being shared in a tenant DB for accessibility via the Web UI

In Vendor Account mode your raw data (job summary metadata) and Databricks API access are being done single-teanant from your environment, while in fully SaaS mode it's being done from DataFlint account in multi-teanent environment .

## 1. Applications **Exporting** Performance Data

### For Databricks

The exporting of your application performance data is done via **Databricks API**.

The Databricks API key that is being issued to DataFlint SaaS needs minimal permissions - only read only access to cluster information (such as metadata, events and logs).

For more information, see [Databricks Docs](https://docs.databricks.com/en/security/auth-authz/access-control/cluster-acl.html#cluster-permissions) on Cluster Access Control (especially the blue info box, which is related to PII edge cases that covered in [#appendix-edge-cases-when-pii-might-get-sent](#appendix-edge-cases-when-pii-might-get-sent "mention")

It means:

1. The Databricks API key **does** **not have permission/access to your production** **data,** only metadata about your job's performance (i.e. any information that is available in clusters UI/Spark UI/spark event log)
2. The Databricks API key **is read-only, and cannot restart or effect clusters**
3. **You don't need to give any permissions to your cloud environment**.

### For EMR (both classic and serverless)

The exporting of your application performance data is done via **IAM assume role**.

The role we assume needs minimal permissions - just read-only permissions for EMR (such as metadata, events and logs)

See: <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-policy-readonly-v2.html>

## 2. Developers **Accessing** to DataFlint SaaS Web Client

DataFlint SaaS Web Client enables you to continuously monitor all of your spark applications.

It is a managed SaaS portal (meaning the backend is running in DataFlint AWS account) that you connect with OAuth (managed by Auth0) if you are in an authorized users list that you manage.

## Stability & Performance

1. Exporting is done in the end of the run, so **it won't effect your performance during the spark job run**.
2. Exporting takes only few seconds, and DataFlint prints a log of how much time it took to export so you can track it. So it **have very low impact on your spark job resource usage**.
3. The exporting is done directly to S3, so the chance of data-loss due to downtime is low.
4. If exporting is failing it won't fail your spark job, it will only log an error.

## Appendix: edge cases when PII might get sent

See this medium article: <https://medium.com/@menishmueli/did-you-know-that-your-apache-spark-logs-might-be-leaking-piis-06f2a0e8a82c>\
The TLDR is that if you would send by accident PII to DataFlint SaaS you are also already sending PII to both logz.io/coralogix and to your spark event log store, which are not designed for PII in mind.

## Misc

### Review

DataFlint architecture and security was reviewed by an AWS solution architect. You can use him as a reference for DataFlint security measurements.

### Questions

If you have additional questions around security & stability please contact us.
