🦣window instrumentation

Usage examples

Enable global instrumentation (PySpark)

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("MyApp")
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
    .config("spark.dataflint.instrument.spark.enabled", "true")
    .getOrCreate()
)

Enable only window instrumentation

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("MyApp")
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
    .config("spark.dataflint.instrument.spark.window.enabled", "true")
    .getOrCreate()
)

Spark Python UDF Instrumentation

Overview

Spark window function execute native/java/python aggregated function per partition. Spark does not report execution time per partition by default.

DataFlint replaces the built-in physical plan nodes:

DataFlintWindow / DataFlintWindowInPandas

It uses instrumented versions that add a duration SQL metric. The metric is visible in the Spark UI SQL tab and on the DataFlint UI

Usage examples

Enable all Python UDF instrumentation (PySpark)

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("MyApp")
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
    .config("spark.dataflint.instrument.spark.enabled", "true")
    .getOrCreate()
)

Enable only Window instrumentation

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("MyApp")
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
    .config("spark.dataflint.instrument.spark.window.enabled", "true")
    .getOrCreate()
)

Configuration

All properties default to false.

spark.dataflint.instrument.spark.enabled
- Global toggle.
- Enables all DataFlint Spark instrumentation.
spark.dataflint.instrument.spark.window.enabled
- Enables window instrumentation only.

Supported Spark versions

DataFlint ships version-specific implementations. They match each Spark version’s internal MapInBatchExec API.

Spark version

Window implementation

WindowInPanda

3.x.x

DataFlintWindowExec

DataFlintWindowInPandasExec

4.0.x

DataFlintWindowExec

DataFlintWindowInPandasExec_4_0

4.1.x

DataFlintWindowExec

DataFlintArrowWindowPythonExec_4_1

How it works

During SparkDataflintPlugin.init(), DataFlint checks instrumentation flags.
If enabled, it registers DataFlintInstrumentationExtension into spark.sql.extensions.
The extension injects a Strategy.
The rule runs during injectPlannerStrategy.
- It runs as part of logical to physical plan transformation.
The rule uses match to replace PhysicalWindow to DataFlintWindow.
It replaces matching nodes:
- WindowExec → DataFlintWindowExec
- WindowInPandasExec / ArrowWindowPythonExec → DataFlintWindowInPandas
The instrumented nodes wrap the original doExecute() logic.
- They measure System.nanoTime() before and after each partition evaluation.
- They accumulate elapsed time into a SQL metric (milliseconds).
DataFlint detects the Spark runtime version.
- It picks the correct implementation for that version’s internal APIs.

What you get

Once enabled, the Spark UI SQL plan shows a duration metric. It appears on DataFlintWindow / DataFlintWindowInPandas nodes.

The metric is total wall-clock time (milliseconds). It covers window execution across all partitions.

The timer wraps the actual Window execution within partition:

Pulls rows from its child (e.g., a SortExec)
Accumulates a full partition group
Computes window function values
Emits output rows

Steps 1–3 all happen during the first iter.hasNext call. So the timer captures child-fetch time + window computation time combined.

This helps you confirm if the Window function is the bottleneck.

Last updated 10 days ago

hashtagUsage examples

hashtagEnable global instrumentation (PySpark)

hashtagEnable only window instrumentation

hashtagSpark Python UDF Instrumentation

hashtagOverview

hashtagUsage examples

hashtagEnable all Python UDF instrumentation (PySpark)

hashtagEnable only Window instrumentation

hashtagConfiguration

hashtagSupported Spark versions

hashtagHow it works

hashtagWhat you get

Usage examples

Enable global instrumentation (PySpark)

Enable only window instrumentation

Spark Python UDF Instrumentation

Overview

Usage examples

Enable all Python UDF instrumentation (PySpark)

Enable only Window instrumentation

Configuration

Supported Spark versions

How it works

What you get