Dataflint is implementing as a file system plugin to be loaded to Spark History Server
For Spark 4.0 users: Replace spark_2.12 in the artifact/package name to dataflint_spark4_2.13
Spark History Server Installation Script
#!/bin/bashcd$SPARK_HOMEDATAFLINT_VERSION="0.8.3"# Step 1: Download the jar to the history server machinewget-O/tmp/spark_2.12-$DATAFLINT_VERSION.jarhttps://repo1.maven.org/maven2/io/dataflint/spark_2.12/$DATAFLINT_VERSION/spark_2.12-$DATAFLINT_VERSION.jar# Step 2: add dataflint jar to classpathexportSPARK_DAEMON_CLASSPATH=/tmp/spark_2.12-$DATAFLINT_VERSION.jar# step 3: if history server already running, stop and start it again./sbin/stop-history-server.sh./sbin/start-history-server.sh
Alternative installation
Instead of environment variable, you can download the jar to the $SPARK_HOME/jars folder so it will be loaded automatically to spark history server
How it works
The jar includes a history server plugin that add the DataFlint UI when a spark UI app is being loaded from logs.
History Server does not support packages loading (Apache Ivy) like live spark app, so you need to download the jar and load it to the history server manually
Install on EMR history server
This method does not work on persistent EMR history server, and for the on-cluster the default AWS proxy currently doesn't work correctly so you need to do port-forward
Location of the On-Cluster Spark History Server link you should use
Via bootstrap script
Via SSH
Connect to your EMR cluster via ssh and run the following commands:
From EMR terminated cluster
Go to your EMR terminated server, Applications tab and press "Spark History Server" to open the persistant history server
Press the "Download" button on the relevant application or applications
Set up the SPARK_HOME environment variable to where you extracted Spark
Last updated
sudo su
DATAFLINT_VERSION="0.8.3"
if sudo grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
sudo wget \
-O /usr/lib/spark/jars/spark_2.12-$DATAFLINT_VERSION.jar \
https://repo1.maven.org/maven2/io/dataflint/spark_2.12/$DATAFLINT_VERSION/spark_2.12-$DATAFLINT_VERSION.jar
fi