π§Apache Iceberg
Summary
Apache Iceberg is a popular file format for data lake, and has support for usage with Apache Spark via an official library.
But, the official library does not have metrics on write metrics, and lacking in analyzing read and write metrics and making them into actionable items.
Adding iceberg metric reporter
TLDR: set spark.dataflint.iceberg.autoCatalogDiscovery
to true
for iceberg support
To collect the additional needed write metric, we need to add the an iceberg metric reporter that collects the relevant data
To do it, we could instruct DataFlint to find all iceberg catalogs and add it for us, by enabling:
spark.dataflint.iceberg.autoCatalogDiscovery
to true
It can also be set manually by setting the iceberg metric reporter manually for each catalog, for example:
Set spark.sql.catalog.[catalog name].metrics-reporter-impl
to pl org.apache.spark.dataflint.iceberg.DataflintIcebergMetricsReporter
Potential problems
If you already have a metrics-reporter-impl set up you will get a warning log, and metrics will not work. You can create an instance of DataflintIcebergMetricsReporter in your metric reporter and call the report method when your implementation report method is being called
There is a known bug in Iceberg that if your metric reporter is implemented in a different class loader it cannot load it. In this case dataflint will throw a warning log, for more details see https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1711034217566819?thread_ts=1705849607.605659&cid=C025PH0G1D4
Write metrics
When trying to write to an iceberg table (with insert/delete/update/merge operations) in the SQL plan all you can see is an AppendData/ReplaceData/WriteDelta node with no metrics at all
For example, here is the end of the plan for this SQL query for an empty table:
DataFlint knows to collect additional data on Iceberg writes and includes them in the SQL plan
Here is the same append in DataFlint
The table is partitioned by vendor id (the first value) so we got 2 files, and the percentage need the metric means the amount of files/records/bytes added to the table.
If we run the same code again we will get:
Now the percentages are 50% because the data we added is 50% of the table (2 our of 4 files)
A delete query will show how much data from the table got deleted:
Update query will show how many records got replaced:
And merge query will show both - how many files records were removed or added
Read metrics
Reading an apache iceberg SQL node looks like this:
In DataFlint, we only show the most important metrics, and alert when you read small files
Last updated