⭐Install on Spark
Dataflint is implemented as a spark-plugin and spark plugins can be installed in variety of ways, all installation options should take no more than a few minutes to implement
For Scala 2.13 users: replace artifactId spark_2.12 to spark_2.13
For Iceberg support, add set spark.dataflint.iceberg.autoCatalogDiscovery
to true
for iceberg write metrics support. For more details, see Apache Iceberg.
Option 1: With package installation and code changes (Scala only)
Add to your package manager this lines:
libraryDependencies += "io.dataflint" %% "spark_2.12" % "0.4.1"
Then, add to your code at startup the following config
val spark = SparkSession
.builder
.appName("MyApp")
.config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
...
.getOrCreate()
You can also supply this spark config via spark-submit or spark-defaults.conf
Option 2: No-code via spark-submit or spark-properties.conf (python & scala)
Dataflint can be installed with no code changes!
Add these 2 lines to your spark-submit call:
spark-submit
--packages io.dataflint:spark_2.12:0.4.1 \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...
This method does require internet access to maven central from where spark is running
Option 3: With only code changes (python only)
builder = pyspark.sql.SparkSession.builder
.appName("MyApp") \
.config("spark.jars.packages", "io.dataflint:spark_2.12:0.4.1") \
.config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
...
See the notes for Option 2
Option 4: download jar manually and add it to class path
You can manually download the JAR and add it to spark
DATAFLINT_VERSION="0.4.1"
if [ ! -f /tmp/spark_2.12-$DATAFLINT_VERSION.jar ]; then
wget --quiet \
-O /tmp/spark_2.12-$DATAFLINT_VERSION \
https://repo1.maven.org/maven2/io/dataflint/spark_2.12/$DATAFLINT_VERSION/spark_2.12-$DATAFLINT_VERSION.jar
fi
spark-submit \
--driver-class-path /tmp/spark_2.12-$DATAFLINT_VERSION.jar \
--conf spark.jars=files///tmp/spark_2.12-$DATAFLINT_VERSION.jar \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...
Option 5: k8s Spark Operator
Add to your kubernetes kind: SparkApplication manifest this lines:
spec:
deps:
packages:
- io.dataflint:spark_2.12:0.4.1
sparkConf:
spark.plugins: "io.dataflint.spark.SparkDataflintPlugin"
spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
Option 6: EMR
You can use any of options 1-4, after installation you can access Spark UI & DataFlint UI via the Yarn Resource Management proxy

Last updated