βInstall on Spark
Dataflint is implemented as a spark-plugin and spark plugins can be installed in variety of ways, all installation options should take no more than a few minutes to implement
Dataflint installation is very similar to other spark libraries such as deltalake and iceberg
In case you have long conditions in your queries, consider increasing the config spark.sql.maxMetadataStringLength to 1000, so spark will log your filter/select/join conditions without trancating them
For Scala 2.13 users: replace artifactId spark_2.12 to spark_2.13
For Iceberg support, add set spark.dataflint.iceberg.autoCatalogDiscovery
to true
for iceberg write metrics support. For more details, see Apache Iceberg.
Option 1: With package installation and code changes (Scala only)
Add to your package manager this lines:
Then, add to your code at startup the following config
You can also supply this spark config via spark-submit or spark-defaults.conf
Option 2: No-code via spark-submit or spark-properties.conf (python & scala)
Dataflint can be installed with no code changes!
Add these 2 lines to your spark-submit call:
This method does require internet access to maven central from where spark is running
If you already have existing spark.jars.packages
or spark.plugins
, just separate the package names with Commas, see spark documentation
Option 3: With only code changes (python only)
See the notes for Option 2
Option 4: download jar manually and add it to class path
You can manually download the JAR and add it to spark
Option 5: k8s Spark Operator
Add to your kubernetes kind: SparkApplication manifest this lines:
Option 6: EMR
You can use any of options 1-4, after installation you can access Spark UI & DataFlint UI via the Yarn Resource Management proxy
Last updated