# Apache Iceberg

### Summary

Apache Iceberg is a popular file format for data lake, and has support for usage with Apache Spark via an official library.

But, the official library does not have metrics on write metrics, and lacking in analyzing read and write metrics and making them into actionable items.

### Adding iceberg metric reporter

**TLDR**: set **`spark.dataflint.iceberg.autoCatalogDiscovery`** to **`true`** for iceberg support

To collect the additional needed write metric, we need to add the an iceberg metric reporter that collects the relevant data

To do it, we could instruct DataFlint to find all iceberg catalogs and add it for us, by enabling:

&#x20;**`spark.dataflint.iceberg.autoCatalogDiscovery`** to **`true`**\
\
It can also be set manually by setting the iceberg metric reporter manually for each catalog, for example:&#x20;

Set **`spark.sql.catalog.[catalog name].metrics-reporter-impl`**  to pl **`org.apache.spark.dataflint.iceberg.DataflintIcebergMetricsReporter`**

#### Potential problems

1. If you already have a metrics-reporter-impl set up you will get a warning log, and metrics will not work. You can create an instance of **DataflintIcebergMetricsReporter** in your metric reporter and call the report method when your implementation report method is being called
2. There is a known bug in Iceberg that if your metric reporter is implemented in a different class loader it cannot load it. In this case dataflint will throw a warning log, for more details see <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1711034217566819?thread_ts=1705849607.605659&cid=C025PH0G1D4>

### Write metrics

When trying to write to an iceberg table (with insert/delete/update/merge operations) in the SQL plan all you can see is an AppendData/ReplaceData/WriteDelta node with no metrics at all

For example, here is the end of the plan for this SQL query for an empty table:

```sql
INSERT INTO demo.nyc.taxis
VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N');
```

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FUoyEvuIPHxl0C22oOl97%2Fimage.png?alt=media&#x26;token=2196356b-64dc-4f30-842f-5fa52a09d2ed" alt="" width="273"><figcaption></figcaption></figure>

DataFlint knows to collect additional data on Iceberg writes and includes them in the SQL plan

Here is the same append in DataFlint

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FRWFuYmy1SYMdQDJe8QWO%2Fimage.png?alt=media&#x26;token=27d14b2f-2b8b-4a7e-b3a9-a71dd14ec611" alt="" width="375"><figcaption></figcaption></figure>

The table is partitioned by vendor id (the first value) so we got 2 files, and the percentage need the metric means the amount of files/records/bytes added to the table.

If we run the same code again we will get:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FssEad3BLRQDA0oQEU4I2%2Fimage.png?alt=media&#x26;token=256c90f4-414d-4eec-b197-a31437459b94" alt="" width="375"><figcaption></figcaption></figure>

Now the percentages are 50% because the data we added is 50% of the table (2 our of 4 files)

A delete query will show how much data from the table got deleted:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FZjm1kDnuOT2Fr5QWyP9x%2Fimage.png?alt=media&#x26;token=6e9f56e5-6146-4611-bc5e-6f151c07cc13" alt="" width="281"><figcaption></figcaption></figure>

Update query will show how many records got replaced:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FbVddaoOhjFjVhLvFJqq7%2Fimage.png?alt=media&#x26;token=290f548b-7a63-4633-8a93-21d5698a47e6" alt="" width="292"><figcaption></figcaption></figure>

And merge query will show both - how many files records were removed or added

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FdMJ7bc8xED7zEKyPjcaY%2Fimage.png?alt=media&#x26;token=cdb51410-4481-46b1-83ea-45cbc3a1f2be" alt="" width="266"><figcaption></figcaption></figure>

### Read metrics

Reading an apache iceberg SQL node looks like this:

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2F5DLQZg3iKRflq1azeNqs%2Fimage.png?alt=media&#x26;token=064862d4-0eab-4060-9078-6f56b9c3ad0c" alt="" width="286"><figcaption></figcaption></figure>

In DataFlint, we only show the most important metrics, and alert when you read small files

<figure><img src="https://2982210886-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fcg8pTm3VgVaeMncRl8LP%2Fuploads%2FgCRbtOzYWVCDDQON5zW3%2Fimage.png?alt=media&#x26;token=79a28d30-5c82-453e-874e-1b81bce2dfe7" alt="" width="375"><figcaption></figcaption></figure>
