🧊Apache Iceberg

Summary

Apache Iceberg is a popular file format for data lake, and has support for usage with Apache Spark via an official library.

But, the official library does not have metrics on write metrics, and lacking in analyzing read and write metrics and making them into actionable items.

Adding iceberg metric reporter

TLDR: set spark.dataflint.iceberg.autoCatalogDiscovery to true for iceberg support

To collect the additional needed write metric, we need to add the an iceberg metric reporter that collects the relevant data

To do it, we could instruct DataFlint to find all iceberg catalogs and add it for us, by enabling:

spark.dataflint.iceberg.autoCatalogDiscovery to true It can also be set manually by setting the iceberg metric reporter manually for each catalog, for example:

Set spark.sql.catalog.[catalog name].metrics-reporter-impl to pl org.apache.spark.dataflint.iceberg.DataflintIcebergMetricsReporter

Potential problems

If you already have a metrics-reporter-impl set up you will get a warning log, and metrics will not work. You can create an instance of DataflintIcebergMetricsReporter in your metric reporter and call the report method when your implementation report method is being called
There is a known bug in Iceberg that if your metric reporter is implemented in a different class loader it cannot load it. In this case dataflint will throw a warning log, for more details see https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1711034217566819?thread_ts=1705849607.605659&cid=C025PH0G1D4

Write metrics

When trying to write to an iceberg table (with insert/delete/update/merge operations) in the SQL plan all you can see is an AppendData/ReplaceData/WriteDelta node with no metrics at all

For example, here is the end of the plan for this SQL query for an empty table:

INSERT INTO demo.nyc.taxis
VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N');

DataFlint knows to collect additional data on Iceberg writes and includes them in the SQL plan

Here is the same append in DataFlint

The table is partitioned by vendor id (the first value) so we got 2 files, and the percentage need the metric means the amount of files/records/bytes added to the table.

If we run the same code again we will get: