Install on Spark
Dataflint is implemented as a spark-plugin and spark plugins can be installed in variety of ways, all installation options should take no more than a few minutes to implement
For Scala 2.13 users: replace artifactId spark_2.12 to spark_2.13
For Iceberg support, add set spark.dataflint.iceberg.autoCatalogDiscovery
to true
for iceberg write metrics support. For more details, see Apache Iceberg.
Option 1: With package installation and code changes (Scala only)
Add to your package manager this lines:
Then, add to your code at startup the following config
You can also supply this spark config via spark-submit or spark-defaults.conf
Option 2: No-code via spark-submit or spark-properties.conf (python & scala)
Dataflint can be installed with no code changes!
Add these 2 lines to your spark-submit call:
This method does require internet access to maven central from where spark is running
Option 3: With only code changes (python only)
See the notes for Option 2
Option 4: download jar manually and add it to class path
You can manually download the JAR and add it to spark
Option 5: k8s Spark Operator
Add to your kubernetes kind: SparkApplication manifest this lines:
Option 6: EMR
You can use any of options 1-4, after installation you can access Spark UI & DataFlint UI via the Yarn Resource Management proxy
Last updated