πŸ”’SaaS Security & Stability

Summary

This document is for explaining DataFlint architecture and why it both secure and won't hurt your spark jobs performance or stability.

This document is refering to the DataFlint SaaS offering, for security & stability consideration of using the open source offering see Security & Stability

TLDR

  1. DataFlint SaaS does not have access or export anything relates to your data, only non-sensitive metadata about your job performance. [except for extreme edge cases - See Appendix: edge cases when PII might get sent]

  2. DataFlint SaaS will export your job performance metadata via the DataFlint open source library to a S3 bucket using IAM user token,

    1. For Databricks, exporting is being done via Databricks API

  3. You don't need to give any permissions to your environment.

  4. You get access to DataFlint via a web portal with OAuth authentication.

SaaS Interfaces

Deployment options:

  1. Databricks

    1. Fully SaaS

    2. Vendor account for Databricks API access and processing

  2. EMR/K8S/Standalone

    1. Fully SaaS

    2. Vendor account for uploading run summary and processing

For DataProc/EMR/K8s:

Fully SaaS option:

Vendor Account Option:

For Databricks:

Fully SaaS Option:

Vendor Account Option:

DataFlint SaaS exposes 2 interfaces to your organization:

  1. Applications Exporting Performance Data

  2. Developers Accessing DataFlint SaaS Web Client

Internally in DataFlint account your data is processed to be available to the web client. Your job performance metadata or any derivatives of it does not leave DataFlint AWS account or being shared with any 3rd party.

In both SaaS and Vendor Account deployment your metrics and insights are being shared in a tenant DB for accessibility via the Web UI

In Vendor Account mode your raw data (job summary metadata) and Databricks API access are being done single-teanant from your environment, while in fully SaaS mode it's being done from DataFlint account in multi-teanent environment .

1. Applications Exporting Performance Data

For EMR/K8s/DataProc

The exporting of your application performance data is done via the DataFlint Open Source library https://github.com/dataflint/spark Once you add a auth token (which is an IAM user credentials) like this:

Before your spark session is closing, DataFlint will summarize your job's performance and will upload the data to S3 bucket in DataFlint AWS account using the DataFlint token.

It means that:

  1. The spark job needs access to the AWS S3 endpoint. If you are in AWS, no traffic is leaving AWS.

  2. You don't need to give any permissions to your cloud environment.

  3. DataFlint does not export any of your production data, only metadata about your job's performance (i.e. any information that is available in Spark UI/spark event log)

  4. The code that runs in your clusters is open-sourced and you can read it and verify it only export performance relates data.

For Databricks

The exporting of your application performance data is done via Databricks API.

The Databricks API key that is being issued to DataFlint SaaS needs minimal permissions - only read only access to cluster information (such as metadata, events and logs).

For more information, see Databricks Docs on Cluster Access Control (especially the blue info box, which is related to PII edge cases that covered in Appendix: edge cases when PII might get sent

It means:

  1. The Databricks API key does not have permission/access to your production data, only metadata about your job's performance (i.e. any information that is available in clusters UI/Spark UI/spark event log)

  2. The Databricks API key is read-only, and cannot restart or effect clusters

  3. You don't need to give any permissions to your cloud environment.

2. Developers Accessing to DataFlint SaaS Web Client

DataFlint SaaS Web Client enables you to continuously monitor all of your spark applications.

It is a managed SaaS portal (meaning the backend is running in DataFlint AWS account) that you connect with OAuth (managed by Auth0) if you are in an authorized users list that you manage.

Stability & Performance

  1. Exporting is done in the end of the run, so it won't effect your performance during the spark job run.

  2. Exporting takes only few seconds, and DataFlint prints a log of how much time it took to export so you can track it. So it have very low impact on your spark job resource usage.

  3. The exporting is done directly to S3, so the chance of data-loss due to downtime is low.

  4. If exporting is failing it won't fail your spark job, it will only log an error.

Appendix: edge cases when PII might get sent

See this medium article: https://medium.com/@menishmueli/did-you-know-that-your-apache-spark-logs-might-be-leaking-piis-06f2a0e8a82c The TLDR is that if you would send by accident PII to DataFlint SaaS you are also already sending PII to both logz.io/coralogix and to your spark event log store, which are not designed for PII in mind.

Misc

Review

DataFlint architecture and security was reviewed by an AWS solution architect. You can use him as a reference for DataFlint security measurements.

Alternatives

If you have additional security considerations and want alternative ways to export your jobs performance metadata or access DataFlint web portal, please contact us and to see if we could adapt DataFlint to your security considerations.

Questions

If you have additional questions around security & stability please contact us.

Last updated