🔒SaaS Security & Stability

Summary

This document is for explaining DataFlint architecture and why it both secure and won't hurt your spark jobs performance or stability.

This document is refering to the DataFlint SaaS offering, for security & stability consideration of using the open source offering see Security & Stability

TLDR

DataFlint SaaS does not have access or export anything relates to your data, only non-sensitive metadata about your job performance. [except for extreme edge cases - See Appendix: edge cases when PII might get sent]
DataFlint SaaS will export your job performance metadata via the DataFlint open source library to a S3 bucket using IAM user token,
1. For Databricks, exporting is being done via Databricks API
You don't need to give any permissions to your environment.
You get access to DataFlint via a web portal with OAuth authentication.

SaaS Interfaces

Deployment options:

Databricks
1. Fully SaaS
2. Vendor account for Databricks API access and processing
DataProc/K8s/Standalone
1. Fully SaaS
2. Vendor account for uploading run summary and processing
EMR
1. Fully SaaS
2. Vendor account for assuming EMR read-only role to production account and processing

For EMR:

Fully SaaS Option

Vendor Account Option:

For DataProc/K8s:

Fully SaaS option:

Vendor Account Option:

For Databricks:

Fully SaaS Option:

Vendor Account Option:

DataFlint SaaS exposes 2 interfaces to your organization:

Applications Exporting Performance Data
Developers Accessing DataFlint SaaS Web Client

Internally in DataFlint account your data is processed to be available to the web client. Your job performance metadata or any derivatives of it does not leave DataFlint AWS account or being shared with any 3rd party.

In both SaaS and Vendor Account deployment your metrics and insights are being shared in a tenant DB for accessibility via the Web UI

In Vendor Account mode your raw data (job summary metadata) and Databricks API access are being done single-teanant from your environment, while in fully SaaS mode it's being done from DataFlint account in multi-teanent environment .

1. Applications Exporting Performance Data

For Databricks

The exporting of your application performance data is done via Databricks API.

The Databricks API key that is being issued to DataFlint SaaS needs minimal permissions - only read only access to cluster information (such as metadata, events and logs).

For more information, see Databricks Docs on Cluster Access Control (especially the blue info box, which is related to PII edge cases that covered in Appendix: edge cases when PII might get sent

It means:

The Databricks API key does not have permission/access to your production data, only metadata about your job's performance (i.e. any information that is available in clusters UI/Spark UI/spark event log)
The Databricks API key is read-only, and cannot restart or effect clusters
You don't need to give any permissions to your cloud environment.

For EMR (both classic and serverless)

The exporting of your application performance data is done via IAM assume role.

The role we assume needs minimal permissions - just read-only permissions for EMR (such as metadata, events and logs)

See: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-policy-readonly-v2.html

2. Developers Accessing to DataFlint SaaS Web Client

DataFlint SaaS Web Client enables you to continuously monitor all of your spark applications.

It is a managed SaaS portal (meaning the backend is running in DataFlint AWS account) that you connect with OAuth (managed by Auth0) if you are in an authorized users list that you manage.

Stability & Performance

Exporting is done in the end of the run, so it won't effect your performance during the spark job run.
Exporting takes only few seconds, and DataFlint prints a log of how much time it took to export so you can track it. So it have very low impact on your spark job resource usage.
The exporting is done directly to S3, so the chance of data-loss due to downtime is low.
If exporting is failing it won't fail your spark job, it will only log an error.

Appendix: edge cases when PII might get sent

See this medium article: https://medium.com/@menishmueli/did-you-know-that-your-apache-spark-logs-might-be-leaking-piis-06f2a0e8a82c The TLDR is that if you would send by accident PII to DataFlint SaaS you are also already sending PII to both logz.io/coralogix and to your spark event log store, which are not designed for PII in mind.

Misc

Review

DataFlint architecture and security was reviewed by an AWS solution architect. You can use him as a reference for DataFlint security measurements.

Questions

If you have additional questions around security & stability please contact us.

Last updated 2 months ago