πŸ§ƒSpark Instrumentation

Optional, opt-in instrumentation that adds extra metrics and metadata to Spark UI.

DataFlint Spark Instrumentation

DataFlint provides optional instrumentation that enhances Spark observability. It injects extra metrics and metadata into the Spark UI. All instrumentation is opt-in and disabled by default

circle-exclamation

Usage examples

Enable all Python UDF instrumentation (PySpark)

Enable only mapInPandas instrumentation

Enable only mapInArrow instrumentation


Spark Python UDF Instrumentation

Overview

Spark mapInPandas and mapInArrow execute Python UDFs per partition. Spark does not report Python UDF execution time per partition by default.

DataFlint replaces the built-in physical plan nodes:

  • MapInPandasExec / PythonMapInArrowExec

It uses instrumented versions that add a duration SQL metric. The metric is visible in the Spark UI SQL tab.

Usage examples

Enable all Python UDF instrumentation (PySpark)

Enable only mapInPandas instrumentation

Enable only mapInArrow instrumentation

Configuration

All properties default to false.

  • spark.dataflint.instrument.spark.enabled

    • Global toggle.

    • Enables all Spark Python UDF instrumentation.

    • Equivalent to enabling both flags below.

  • spark.dataflint.instrument.spark.mapInPandas.enabled

    • Enables mapInPandas instrumentation only.

  • spark.dataflint.instrument.spark.mapInArrow.enabled

    • Enables mapInArrow instrumentation only.

Supported Spark versions

DataFlint ships version-specific implementations. They match each Spark version’s internal MapInBatchExec API.

Spark version
MapInPandas implementation
MapInArrow implementation

3.3.x

DataFlintMapInPandasExec_3_3

DataFlintPythonMapInArrowExec_3_3

3.4.x

DataFlintMapInPandasExec_3_4

DataFlintPythonMapInArrowExec_3_4

3.5.x

DataFlintMapInPandasExec_3_5

DataFlintPythonMapInArrowExec_3_5

4.0.x

DataFlintMapInPandasExec_4_0

DataFlintPythonMapInArrowExec_4_0

4.1.x

DataFlintMapInPandasExec_4_1

DataFlintPythonMapInArrowExec_4_1

How it works

  1. During SparkDataflintPlugin.init(), DataFlint checks instrumentation flags.

  2. If enabled, it registers DataFlintInstrumentationExtension into spark.sql.extensions.

  3. The extension injects a ColumnarRule.

  4. The rule runs during preColumnarTransitions.

    • It runs before columnar optimizations.

  5. The rule uses transformUp to walk the physical plan.

  6. It replaces matching nodes:

    • MapInPandasExec β†’ DataFlintMapInPandasExec

    • PythonMapInArrowExec / MapInArrowExec β†’ DataFlintPythonMapInArrowExec

  7. The instrumented nodes wrap the original doExecute() logic.

    • They measure System.nanoTime() before and after each partition evaluation.

    • They accumulate elapsed time into a SQL metric (milliseconds).

  8. DataFlint detects the Spark runtime version.

    • It picks the correct implementation for that version’s internal APIs.

What you get

Once enabled, the Spark UI SQL plan shows a duration metric. It appears on DataFlintMapInPandas / DataFlintMapInArrow nodes.

The metric is total wall-clock time (milliseconds). It covers Python UDF execution across all partitions.

This helps you confirm if the Python UDF is the bottleneck.

Last updated