⭐Install on Spark

Dataflint is implemented as a spark-plugin and spark plugins can be installed in variety of ways, all installation options should take no more than a few minutes to implement

circle-info

Dataflint installation is very similar to other spark libraries such as deltalake and iceberg

circle-info

In case you have long conditions in your queries, consider increasing the config spark.sql.maxMetadataStringLength to 1000, so spark will log your filter/select/join conditions without trancating them

circle-exclamation

For Spark 4.0 users: Replace spark_2.12 in the artifact/package name to dataflint_spark4_2.13

circle-exclamation
circle-exclamation

Option 1: With package installation and code changes (Scala only)

Add to your package manager this lines:

libraryDependencies += "io.dataflint" %% "spark_2.12" % "0.8.2"

Then, add to your code at startup the following config

    val spark = SparkSession
      .builder
      .appName("MyApp")
      .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
      ...
      .getOrCreate()

You can also supply this spark config via spark-submit or spark-defaults.conf

Option 2: No-code via spark-submit or spark-properties.conf (python & scala)

Dataflint can be installed with no code changes!

Add these 2 lines to your spark-submit call:

circle-exclamation
circle-info

If you already have existing spark.jars.packages or spark.plugins, just separate the package names with Commas, see spark documentationarrow-up-right

Option 3: With only code changes (python only)

circle-exclamation

Option 4: download jar manually and add it to class path

You can manually download the JAR and add it to spark

Option 5: k8s Spark Operator

Add to your kubernetes kind: SparkApplication manifest this lines:

Option 6: EMR

You can use any of options 1-4, after installation you can access Spark UI & DataFlint UI via the Yarn Resource Management proxy

Last updated