π¦£window instrumentation
Usage examples


Enable global instrumentation (PySpark)
Enable only window instrumentation
Spark Python UDF Instrumentation
Overview
Spark window function execute native/java/python aggregated function per partition. Spark does not report execution time per partition by default.
DataFlint replaces the built-in physical plan nodes:
DataFlintWindow / DataFlintWindowInPandas
It uses instrumented versions that add a duration SQL metric. The metric is visible in the Spark UI SQL tab and on the DataFlint UI
Usage examples
Enable all Python UDF instrumentation (PySpark)
Enable only Window instrumentation
Configuration
All properties default to false.
spark.dataflint.instrument.spark.enabledGlobal toggle.
Enables all DataFlint Spark instrumentation.
spark.dataflint.instrument.spark.window.enabledEnables
windowinstrumentation only.
Supported Spark versions
DataFlint ships version-specific implementations. They match each Spark versionβs internal MapInBatchExec API.
3.x.x
DataFlintWindowExec
DataFlintWindowInPandasExec
4.0.x
DataFlintWindowExec
DataFlintWindowInPandasExec_4_0
4.1.x
DataFlintWindowExec
DataFlintArrowWindowPythonExec_4_1
How it works
During
SparkDataflintPlugin.init(), DataFlint checks instrumentation flags.If enabled, it registers
DataFlintInstrumentationExtensionintospark.sql.extensions.The extension injects a
Strategy.The rule runs during
injectPlannerStrategy.It runs as part of logical to physical plan transformation.
The rule uses
matchto replacePhysicalWindow to DataFlintWindow.It replaces matching nodes:
WindowExecβDataFlintWindowExecWindowInPandasExec/ArrowWindowPythonExecβDataFlintWindowInPandas
The instrumented nodes wrap the original
doExecute()logic.They measure
System.nanoTime()before and after each partition evaluation.They accumulate elapsed time into a SQL metric (milliseconds).
DataFlint detects the Spark runtime version.
It picks the correct implementation for that versionβs internal APIs.
What you get
Once enabled, the Spark UI SQL plan shows a duration metric. It appears on DataFlintWindow / DataFlintWindowInPandas nodes.
The metric is total wall-clock time (milliseconds). It covers window execution across all partitions.
The timer wraps the actual Window execution within partition:
Pulls rows from its child (e.g., a SortExec)
Accumulates a full partition group
Computes window function values
Emits output rows
Steps 1β3 all happen during the first iter.hasNext call. So the timer captures child-fetch time + window computation time combined.
This helps you confirm if the Window function is the bottleneck.
Last updated