πͺInstall on Spark History Server
Dataflint is implementing as a file system plugin to be loaded to Spark History Server
Spark History Server Installation Script
Alternative installation
Instead of environment variable, you can:
Download the jar to the $SPARK_HOME/jars folder so it will be loaded automatically to spark history server
Config the provider in the spark-default.conf file, like this:
How it works
The dataflint plugin is loading the log file the same as the default FsHistoryProvider. After the loading is complete it loads to the loaded app's spark UI the Dataflint UI.
History Server does not support packages loading (Apache Ivy) like live spark app, so you need to download the jar and load it to the history server manually
Install on EMR history server
This method does not work on persistent EMR history server, and for the on-cluster the default AWS proxy currently doesn't work correctly so you need to do port-forward
Via bootstrap script
Via SSH
Connect to your EMR cluster via ssh and run the following commands:
From EMR terminated cluster
Go to your EMR terminated server, Applications tab and press "Spark History Server" to open the persistant history server
Press the "Download" button on the relevant application or applications
Extract the zip file to /tmp/spark-events, and then run the Install on Spark History Server locally on your machine
Running spark history server on your machine
In order to run the spark history server on your machine you need first to:
Have java (version 8 or 11) installed on your machine, and set up JAVA_HOME
Download spark from https://spark.apache.org/downloads.html and extract the downloaded zip somewhere on your machine
Set up the SPARK_HOME environment variable to where you extracted Spark
Last updated