# Install on Spark

Dataflint is implemented as a spark-plugin and spark plugins can be installed in variety of ways, all installation options should take no more than a few minutes to implement.

{% hint style="info" %}
Dataflint installation is very similar to other spark libraries such as deltalake and iceberg
{% endhint %}

{% hint style="info" %}
In case you have long conditions in your queries, consider increasing the config *spark.sql.maxMetadataStringLength* to **1000,** so spark will log your filter/select/join conditions without trancating them
{% endhint %}

{% hint style="warning" %}

### For Spark 4.0 users: **Replace spark\_2.12 in the artifact/package name to dataflint\_spark4\_2.13**

**For example:** libraryDependencies += "io.dataflint" %% "dataflint\_spark4\_2.13" % "0.9.0"

For package name: io.dataflint:dataflint\_spark4\_2.13:0.9.0
{% endhint %}

{% hint style="warning" %}
**For Scala 2.13** **users**: replace artifactId spark\_2.12 to spark\_2.13
{% endhint %}

{% hint style="warning" %}
For Iceberg support, add set **`spark.dataflint.iceberg.autoCatalogDiscovery`** to **`true`** for iceberg write metrics support. For more details, see [apache-iceberg](https://dataflint.gitbook.io/dataflint-for-spark/integrations/apache-iceberg "mention").
{% endhint %}

### Option 1: With package installation and code changes (Scala only)

Add to your package manager this lines:

{% tabs %}
{% tab title="sbt" %}

```gradle
libraryDependencies += "io.dataflint" %% "spark_2.12" % "0.9.0"
```

{% endtab %}

{% tab title="Gradle" %}

```gradle
implementation 'io.dataflint:spark_2.12:0.9.0'
```

{% endtab %}

{% tab title="maven" %}

```xml
<dependency>
  <groupId>io.dataflint</groupId>
  <artifactId>spark_2.12</artifactId>
  <version>0.9.0</version>
</dependency>
```

{% endtab %}
{% endtabs %}

Then, add to your code at startup the following config

```scala
val spark = SparkSession
  .builder
  .appName("MyApp")
  .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
  ...
  .getOrCreate()
```

You can also supply this spark config via spark-submit or spark-defaults.conf

### Option 2: No-code via spark-submit or spark-properties.conf (python & scala)

Dataflint can be installed with no code changes!

{% tabs %}
{% tab title="spark-submit" %}
Add these 2 lines to your spark-submit call:

```bash
spark-submit \
  --packages io.dataflint:spark_2.12:0.9.0 \
  --conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
  ...
```

{% endtab %}

{% tab title="spark-defaults.conf" %}
Add these 2 lines to your spark-defaults.conf

```
spark.jars.packages io.dataflint:spark_2.12:0.9.0
spark.plugins io.dataflint.spark.SparkDataflintPlugin
```

{% endtab %}
{% endtabs %}

{% hint style="warning" %}
This method does require internet access to maven central from where spark is running
{% endhint %}

{% hint style="info" %}
If you already have existing `spark.jars.packages` or `spark.plugins`, just separate the package names with commas, see [spark documentation](https://spark.apache.org/docs/latest/configuration.html)
{% endhint %}

### Option 3: With only code changes (python only)

```python
builder = pyspark.sql.SparkSession.builder \
    .appName("MyApp") \
    .config("spark.jars.packages", "io.dataflint:spark_2.12:0.9.0") \
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
    ...
```

{% hint style="warning" %}
See the notes for Option 2
{% endhint %}

### Option 4: download jar manually and add it to class path

You can manually download the JAR and add it to spark

```bash
DATAFLINT_VERSION="0.9.0"
if [ ! -f /tmp/spark_2.12-$DATAFLINT_VERSION.jar ]; then
  wget --quiet \
    -O /tmp/spark_2.12-$DATAFLINT_VERSION.jar \
    https://repo1.maven.org/maven2/io/dataflint/spark_2.12/$DATAFLINT_VERSION/spark_2.12-$DATAFLINT_VERSION.jar
fi

spark-submit \
  --driver-class-path /tmp/spark_2.12-$DATAFLINT_VERSION.jar \
  --conf spark.jars=files:///tmp/spark_2.12-$DATAFLINT_VERSION.jar \
  --conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
  ...
```

### Option 5: k8s Spark Operator

Add to your kubernetes **kind: SparkApplication** manifest this lines:

```yaml
spec:
  deps:
    packages:
      - io.dataflint:spark_2.12:0.9.0
  sparkConf:
    spark.plugins: "io.dataflint.spark.SparkDataflintPlugin"
    spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
```

### Option 6: EMR

You can use any of options 1-4, after installation you can access Spark UI & DataFlint UI via the Yarn Resource Management proxy

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FAubD4k1iGIw7EvxpWSZl%2FScreenshot%202024-02-06%20at%2018.56.52.png?alt=media&#x26;token=f51edd64-7c82-429c-abc7-899015a84a13" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
