Why Apache Iceberg is the center of attention in data platforms
In a Q&A, consultant Donald Farmer explains why vendors are rushing to support Apache Iceberg and discusses its capabilities and deployment issues for data teams.
Data platform vendors are lining up behind Apache Iceberg, with several recently announcing new or expanded support for the open table format used to manage large analytics datasets in data lakes and lakehouses.
For example, Snowflake made support for Version 3 of the Iceberg specification generally available in May, a month after enabling users to store Iceberg tables in its platform. These moves further deepen the full embrace of the table format that Snowflake announced in April 2025, after previously offering only limited support for Iceberg.
Also in May, Databricks added a broad set of Iceberg capabilities to its data lakehouse platform, including Iceberg V3 support and the ability to create Iceberg tables in its Unity Catalog software. Its support, which began with the 2024 acquisition of a startup whose founders included Iceberg's creators, is particularly notable because Databricks initially developed the alternative Delta Lake table format.
SAP is also going the acquisition route: In May, it said it's buying Iceberg-based data lakehouse vendor Dremio. The deal will extend the reach of SAP's Business Data Cloud to new external data sources without forcing users to move data into the SAP platform.
In an interview with TechTarget, Donald Farmer, principal of consulting firm TreeHive Strategy, said these and various other vendor announcements demonstrate that Iceberg has become the de facto industry-standard table format, beating out both Delta Lake and Apache Hudi. Vendors really have no choice but to support it, he added.
Farmer also discussed Iceberg's capabilities and issues that data leaders and teams need to consider when planning deployments of the table format.
Editor's note:This Q&A has been edited for clarity and conciseness.
Why are data platform vendors rushing to add or expand support for Iceberg?
Donald Farmer: Partly, it's because the question of table format is now settled. Iceberg has won, and it is now the default. Once that happens, as a vendor, you can't sit outside that process. You have to adopt it.
Donald Farmer
The first thing is wanting to defend against being excluded [by technology buyers]. You always want to be in on the RFP, so if Iceberg is in the RFP, you need to support it. Also, no matter how big you are as a vendor, once something becomes commoditized in that way, you have to support it. Look at SAP, which acquired Dremio. They didn't have an Iceberg-native engine. They had to find one.
If you think of that as a negative framing -- "I don't want to be excluded" -- the positive framing is that the Iceberg-native data lakehouse is now the substrate for AI agents. One of the problems with agentic AI is that people see the data architecture that underlies the agents as being fragmented and siloed in their organizations. The Iceberg-native lakehouse pools all that together and provides a fairly neutral environment for agents to run over, rather than trying to build agents over this fragmented, siloed architecture.
For data leaders and teams, what capabilities does Iceberg provide that aren't supported by what could perhaps be called traditional data lakes at this point?
Farmer: The traditional data lake is more or less just files in an object store -- particularly, more recently, Parquet files in an object store, with a Hive-style directory. Iceberg adds a metadata tier over that. It's not just file storage; there's the metadata tier, and also a sort of transaction layer. To a certain extent, you can have ACID transactions, for example.
Iceberg also supports schema evolution, so you can add columns, you can rename them, maybe even change the type of a column, but you don't have to rewrite the data file back to that. And it has similar support for partition evolution, so you can change the partitioning schema without rewriting the existing data. It has capabilities like that that go above what you could also call the kind of dumb data lake, which is just files in an object store with a loose structure around them.
Why has Iceberg become the default table format instead of Delta Lake or Hudi, or the three being more equal competitors?
Farmer: To be fair, Delta still has a very large installed base. It's the native default format in Microsoft Fabric, for example. It's not as if Delta has gone away; it's just not the open industry standard.
Part of this is governance. Iceberg is part of the Apache Software Foundation, while Delta came from Databricks -- and even though it's open source, you still get the impression that there's single-vendor control. I think that's an issue -- people are very averse to being locked into any technologies. Also, if you look at the catalog layer, Iceberg has a REST catalog API, while Delta is cataloged through Databricks's Unity Catalog, which is powerful but still vendor-specific.
As for Hudi, it's really good in certain scenarios, like high-volume, high-frequency streaming or change data capture. It's got really good record-level indexing that enables that, and a merge-on-read system that enables it to keep data current with very high performance for streaming scenarios.
That sounds like a mixed story, but it really isn't. Iceberg is the standard. It's dominant.
Apache XTable is an incubating technology that supports interoperability between the three table formats. Delta Lake also has a Universal Format, or UniForm, feature that lets users read its tables in Iceberg and Hudi. Do you expect to see many mixed environments with the different table formats?
Farmer: I expect we are going to see convergence. Version 3 of Iceberg has done a pretty good job of unifying the data layer. It's got features like row lineage, for example, and the Variant data type, which is important for storing semistructured data. And Databricks has said pretty clearly that eventually, Delta and Iceberg will just use the same metadata and share tables.
Yeah, there are some bridges like UniForm and XTable, which has good backing -- Microsoft and Google are backing it. But I think the pattern [of sharing metadata and tables] is probably the way things are going, rather than having a lasting split between the three different formats with integrations between them. Iceberg becomes the lingua franca of data.
Does Iceberg have any limitations or challenges that data teams looking to deploy it should be aware of? Migration costs, for example?
Farmer: There is a migration cost, and if you are doing it at scale -- migrating petabytes, which is absolutely possible nowadays -- that's a huge migration. And there are some things that people get stuck on -- issues with partitioning and file paths, for example, that make it messy to migrate. But even without that messiness, it can just be a big job.
You have to build maintenance into your operational plan for Iceberg. Too many people discover that only after the system has started to degrade.
Donald FarmerPrincipal, TreeHive Strategy
I think maintenance is a bigger issue, though. The way Iceberg works is that it writes new metadata and new data files for every change. That creates the potential for performance deterioration over time. You end up with something a little bit like the old days when you had to defragment your disk drive, if you remember doing that. You have this proliferation of very small files, and you need to schedule maintenance of that. You've got to do file compaction, and you have to do expiration of data snapshots that have been taken. You need to clean all that out.
As a result, you have to build maintenance into your operational plan for Iceberg. Too many people discover that only after the system has started to degrade. No doubt this will get fixed over time. But right now, there is a maintenance cost for Iceberg that there is not for Delta. That, I think, is one of the reasons we haven't seen a wholesale migration from Delta to Iceberg.
Any additional advice or best practices for data teams on deploying and managing Iceberg?
Farmer: My advice to people who ask me about these formats is that it's not just one decision. There's the choice of the table format, there's the catalog, there's a query engine, there's the governance and maintenance you have to do. Iceberg is a pretty straightforward choice now. But the catalog, in particular, is a sticky decision that you need to get right.
Iceberg's REST catalog spec is very portable and has a great API. You could use Apache Polaris, which is an open source catalog purpose-built for Iceberg. But you have other options, which could include Unity Catalog, Snowflake Horizon Catalog, Dremio Open Catalog, AWS Glue, Hive Metastore and other catalogs. It's a really important decision. The choice of a catalog is going to define how Iceberg integrates into your enterprise environment.
Craig Stedman is an industry editor at TechTarget who edits and writes articles on data technologies and processes. He has covered enterprise IT for more than 40 years.