# Apache Iceberg

### Summary

Apache Iceberg is a popular file format for data lake, and has support for usage with Apache Spark via an official library.

But, the official library does not have metrics on write metrics, and lacking in analyzing read and write metrics and making them into actionable items.

### Adding iceberg metric reporter

**TLDR**: set **`spark.dataflint.iceberg.autoCatalogDiscovery`** to **`true`** for iceberg support

To collect the additional needed write metric, we need to add the an iceberg metric reporter that collects the relevant data

To do it, we could instruct DataFlint to find all iceberg catalogs and add it for us, by enabling:

&#x20;**`spark.dataflint.iceberg.autoCatalogDiscovery`** to **`true`**\
\
It can also be set manually by setting the iceberg metric reporter manually for each catalog, for example:&#x20;

Set **`spark.sql.catalog.[catalog name].metrics-reporter-impl`**  to pl **`org.apache.spark.dataflint.iceberg.DataflintIcebergMetricsReporter`**

#### Potential problems

1. If you already have a metrics-reporter-impl set up you will get a warning log, and metrics will not work. You can create an instance of **DataflintIcebergMetricsReporter** in your metric reporter and call the report method when your implementation report method is being called
2. There is a known bug in Iceberg that if your metric reporter is implemented in a different class loader it cannot load it. In this case dataflint will throw a warning log, for more details see [https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1711034217566819?thread\_ts=1705849607.605659\&cid=C025PH0G1D4](<https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1711034217566819?thread_ts=1705849607.605659\&cid=C025PH0G1D4&#xA;>)

### Write metrics

When trying to write to an iceberg table (with insert/delete/update/merge operations) in the SQL plan all you can see is an AppendData/ReplaceData/WriteDelta node with no metrics at all

For example, here is the end of the plan for this SQL query for an empty table:

```sql
INSERT INTO demo.nyc.taxis
VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N');
```

<figure><img src="/files/NAPp0Oq2kGPIgWVpmdnw" alt="" width="273"><figcaption></figcaption></figure>

DataFlint knows to collect additional data on Iceberg writes and includes them in the SQL plan

Here is the same append in DataFlint

<figure><img src="/files/4Mq6MKzNe4W5V7HbT1am" alt="" width="375"><figcaption></figcaption></figure>

The table is partitioned by vendor id (the first value) so we got 2 files, and the percentage need the metric means the amount of files/records/bytes added to the table.

If we run the same code again we will get:

<figure><img src="/files/EoykS3d8m3kuUc6Bsnht" alt="" width="375"><figcaption></figcaption></figure>

Now the percentages are 50% because the data we added is 50% of the table (2 our of 4 files)

A delete query will show how much data from the table got deleted:

<figure><img src="/files/KK0syDP6XEG1eudUBhqG" alt="" width="281"><figcaption></figcaption></figure>

Update query will show how many records got replaced:

<figure><img src="/files/bhFBn2sn0wmqYAAffR4m" alt="" width="292"><figcaption></figcaption></figure>

And merge query will show both - how many files records were removed or added

<figure><img src="/files/Ng6bT776K5UpcnhIO6sF" alt="" width="266"><figcaption></figcaption></figure>

### Read metrics

Reading an apache iceberg SQL node looks like this:

<figure><img src="/files/oinNNE9fNt8jWfDqpFOd" alt="" width="286"><figcaption></figcaption></figure>

In DataFlint, we only show the most important metrics, and alert when you read small files

<figure><img src="/files/nFEJwSZ10muAsaXrGZZD" alt="" width="375"><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dataflint.gitbook.io/dataflint-for-spark/integrations/apache-iceberg.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
