# Install on Spark

Dataflint is implemented as a spark-plugin and spark plugins can be installed in variety of ways, all installation options should take no more than a few minutes to implement.

{% hint style="info" %}
Dataflint installation is very similar to other spark libraries such as deltalake and iceberg
{% endhint %}

{% hint style="info" %}
In case you have long conditions in your queries, consider increasing the config *spark.sql.maxMetadataStringLength* to **1000,** so spark will log your filter/select/join conditions without trancating them
{% endhint %}

{% hint style="warning" %}

### For Spark 4.0 users: **Replace spark\_2.12 in the artifact/package name to dataflint\_spark4\_2.13**

**For example:** libraryDependencies += "io.dataflint" %% "dataflint\_spark4\_2.13" % "0.8.8"

For package name: io.dataflint:dataflint\_spark4\_2.13:0.8.8
{% endhint %}

{% hint style="warning" %}
**For Scala 2.13** **users**: replace artifactId spark\_2.12 to spark\_2.13
{% endhint %}

{% hint style="warning" %}
For Iceberg support, add set **`spark.dataflint.iceberg.autoCatalogDiscovery`** to **`true`** for iceberg write metrics support. For more details, see [apache-iceberg](https://dataflint.gitbook.io/dataflint-for-spark/integrations/apache-iceberg "mention").
{% endhint %}

### Option 1: With package installation and code changes (Scala only)

Add to your package manager this lines:

{% tabs %}
{% tab title="sbt" %}

```gradle
libraryDependencies += "io.dataflint" %% "spark_2.12" % "0.8.8"
```

{% endtab %}

{% tab title="Gradle" %}

```gradle
implementation 'io.dataflint:spark_2.12:0.8.8'
```

{% endtab %}

{% tab title="maven" %}

```xml
<dependency>
  <groupId>io.dataflint</groupId>
  <artifactId>spark_2.12</artifactId>
  <version>0.8.8</version>
</dependency>
```

{% endtab %}
{% endtabs %}

Then, add to your code at startup the following config

```scala
val spark = SparkSession
  .builder
  .appName("MyApp")
  .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
  ...
  .getOrCreate()
```

You can also supply this spark config via spark-submit or spark-defaults.conf

### Option 2: No-code via spark-submit or spark-properties.conf (python & scala)

Dataflint can be installed with no code changes!

{% tabs %}
{% tab title="spark-submit" %}
Add these 2 lines to your spark-submit call:

```bash
spark-submit \
  --packages io.dataflint:spark_2.12:0.8.8 \
  --conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
  ...
```

{% endtab %}

{% tab title="spark-defaults.conf" %}
Add these 2 lines to your spark-defaults.conf

```
spark.jars.packages io.dataflint:spark_2.12:0.8.8
spark.plugins io.dataflint.spark.SparkDataflintPlugin
```

{% endtab %}
{% endtabs %}

{% hint style="warning" %}
This method does require internet access to maven central from where spark is running
{% endhint %}

{% hint style="info" %}
If you already have existing `spark.jars.packages` or `spark.plugins`, just separate the package names with commas, see [spark documentation](https://spark.apache.org/docs/latest/configuration.html)
{% endhint %}

### Option 3: With only code changes (python only)

```python
builder = pyspark.sql.SparkSession.builder \
    .appName("MyApp") \
    .config("spark.jars.packages", "io.dataflint:spark_2.12:0.8.8") \
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
    ...
```

{% hint style="warning" %}
See the notes for Option 2
{% endhint %}

### Option 4: download jar manually and add it to class path

You can manually download the JAR and add it to spark

```bash
DATAFLINT_VERSION="0.8.8"
if [ ! -f /tmp/spark_2.12-$DATAFLINT_VERSION.jar ]; then
  wget --quiet \
    -O /tmp/spark_2.12-$DATAFLINT_VERSION.jar \
    https://repo1.maven.org/maven2/io/dataflint/spark_2.12/$DATAFLINT_VERSION/spark_2.12-$DATAFLINT_VERSION.jar
fi

spark-submit \
  --driver-class-path /tmp/spark_2.12-$DATAFLINT_VERSION.jar \
  --conf spark.jars=files:///tmp/spark_2.12-$DATAFLINT_VERSION.jar \
  --conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
  ...
```

### Option 5: k8s Spark Operator

Add to your kubernetes **kind: SparkApplication** manifest this lines:

```yaml
spec:
  deps:
    packages:
      - io.dataflint:spark_2.12:0.8.8
  sparkConf:
    spark.plugins: "io.dataflint.spark.SparkDataflintPlugin"
    spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
```

### Option 6: EMR

You can use any of options 1-4, after installation you can access Spark UI & DataFlint UI via the Yarn Resource Management proxy

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FAubD4k1iGIw7EvxpWSZl%2FScreenshot%202024-02-06%20at%2018.56.52.png?alt=media&#x26;token=f51edd64-7c82-429c-abc7-899015a84a13" alt=""><figcaption></figcaption></figure>
