⭐Install on Spark

Dataflint is implemented as a spark-plugin and spark plugins can be installed in variety of ways, all installation options should take no more than a few minutes to implement

Dataflint installation is very similar to other spark libraries such as deltalake and iceberg

In case you have long conditions in your queries, consider increasing the config spark.sql.maxMetadataStringLength to 1000, so spark will log your filter/select/join conditions without trancating them

For Scala 2.13 users: replace artifactId spark_2.12 to spark_2.13

For Iceberg support, add set spark.dataflint.iceberg.autoCatalogDiscovery to true for iceberg write metrics support. For more details, see Apache Iceberg.

Option 1: With package installation and code changes (Scala only)

Add to your package manager this lines:

libraryDependencies += "io.dataflint" %% "spark_2.12" % "0.2.7"

Then, add to your code at startup the following config

    val spark = SparkSession
      .builder
      .appName("MyApp")
      .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
      ...
      .getOrCreate()

You can also supply this spark config via spark-submit or spark-defaults.conf

Option 2: No-code via spark-submit or spark-properties.conf (python & scala)

Dataflint can be installed with no code changes!

Add these 2 lines to your spark-submit call:

spark-submit
--packages io.dataflint:spark_2.12:0.2.7 \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...

This method does require internet access to maven central from where spark is running

If you already have existing spark.jars.packages or spark.plugins, just separate the package names with Commas, see spark documentation

Option 3: With only code changes (python only)

builder = pyspark.sql.SparkSession.builder
    .appName("MyApp") \
    .config("spark.jars.packages", "io.dataflint:spark_2.12:0.2.7") \
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
    ...

See the notes for Option 2

Option 4: download jar manually and add it to class path

You can manually download the JAR and add it to spark

DATAFLINT_VERSION="0.2.7"
if [ ! -f /tmp/spark_2.12-$DATAFLINT_VERSION.jar ]; then
    wget --quiet \
    -O /tmp/spark_2.12-$DATAFLINT_VERSION \
    https://repo1.maven.org/maven2/io/dataflint/spark_2.12/$DATAFLINT_VERSION/spark_2.12-$DATAFLINT_VERSION.jar
fi

spark-submit \
--driver-class-path /tmp/spark_2.12-$DATAFLINT_VERSION.jar \
--conf spark.jars=files///tmp/spark_2.12-$DATAFLINT_VERSION.jar \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...

Option 5: k8s Spark Operator

Add to your kubernetes kind: SparkApplication manifest this lines:

spec:
    deps:
    packages:
      - io.dataflint:spark_2.12:0.2.7
  sparkConf:
    spark.plugins: "io.dataflint.spark.SparkDataflintPlugin"
    spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"

Option 6: EMR

You can use any of options 1-4, after installation you can access Spark UI & DataFlint UI via the Yarn Resource Management proxy

Last updated