Optional, opt-in instrumentation that adds extra metrics and metadata to Spark UI.
DataFlint Spark Instrumentation
DataFlint provides optional instrumentation that enhances Spark observability. It injects extra metrics and metadata into the Spark UI. All instrumentation is opt-in and disabled by default
This feature is currently experimental, DataFlint instrument your query physical plan so use with caution.
Usage examples
Enable all Python UDF instrumentation (PySpark)
Enable only mapInPandas instrumentation
Enable only mapInArrow instrumentation
Spark Python UDF Instrumentation
Overview
Spark mapInPandas and mapInArrow execute Python UDFs per partition. Spark does not report Python UDF execution time per partition by default.
DataFlint replaces the built-in physical plan nodes:
MapInPandasExec / PythonMapInArrowExec
It uses instrumented versions that add a duration SQL metric. The metric is visible in the Spark UI SQL tab.