Getty Images/iStockphoto

Komprise adds Iceberg for unstructured data management

Data storage vendor Komprise integrates Apache Iceberg to help enterprises analyze unstructured data without incurring data movement costs, addressing AI readiness.

Data storage company, Komprise, has added Apache Iceberg support to its data management platform, enabling data scientists and engineers to analyze unstructured data without moving it from existing storage systems.  

The update addresses a growing challenge for enterprise IT leaders: making vast swaths of unstructured data accessible for AI and analytics initiatives without increasing the costs associated with moving massive amounts of data.

"It is very hard to figure out what unstructured data is about," said Krishna Subramanian, Komprise co-founder and chief operating officer. "And it's also too big to move, because if you actually took petabytes of unstructured data and tried to copy it into a lakehouse, it's going to take you months to go do that. And by the time you do it, the data has changed."

The platform update builds on the company's technology base that helps storage administrators manage their resources. It compiles metadata about what is being stored in users' object and file-based storage repositories, allowing admins to make smart decisions and automate processes to ensure data lands in the most cost-effective locations.

The company's new Komprise Transparent File Tables (TFT), built on Apache Iceberg, exposes a structured view of unstructured data, so it can be ingested by AI, business intelligence and analytics platforms such as Snowflake and Databricks.

It is very hard to figure out what unstructured data is about. And it's also too big to move.
Krishna SubramanianCo-founder and COO, Komprise

Overall, unstructured data accounts for over 80% of an enterprise's data footprint, IDC has estimated. Komprise itself has estimated from customer data that approximately 70% of this data has not been accessed for over a year. Once Komprise identifies this older data, it can be moved to more cost-effective storage mediums.

For Subramanian, the move into data analytics made sense. The company's software is already deployed across many sectors, especially in data-intensive markets in healthcare, financial services, media and energy.

The software, she notes, indexes all the unstructured data within the enterprise, including all data amassed in NFS, SMB/CIFS and S3/Object stores, located both on-premises and in the cloud. The index includes a wealth of information about each file, including its privacy permissions. 

Transparent tables opens more data to analysis

The tables are populated with the Komprise contextual tagged metadata and pointers to the original data, making them accessible to any Iceberg-based query engine. In effect, it acts as a direct conduit, allowing customers to feed their dark, unstructured data directly into existing data lakehouses.

"We've indexed all the unstructured data. We created a table, a tabular representation of unstructured data as an Iceberg table. Now, any data engineer can build a dashboard to query against this table," Subramanian said.

Apache Iceberg, an open source project, provides a vendor-neutral format for storing data in a structured, columnar fashion.   

When a user issues a query in Komprise's setup, only the needed data is retrieved from the source, reducing data transfer costs immeasurably compared to the usual approach of moving entire data sets in Iceberg tables -- which can be expensive.

In addition to metadata automatically captured by Komprise itself, the organization can also augment the file data with additional subject-oriented tags that can be queried against. Komprise AI Preparation and Process Automation (KAPPA) data services and Komprise Smart Data Workflows can both help in this task.

Historically, Komprise metadata was only accessible to the storage or system administrator, who used the knowledge to make storage allocation decisions and automation scripts.

"Nobody comes and tells the storage guy, 'Hey, I'm adding 100 terabytes of data.' Storage people have to understand what's happening with the data. We give them analytics on what's happening to the data. How fast is it growing? Who's using it? What is it? Where is it? How much is it costing you?" Subramanian said.

The TFT, however, opens this set of metadata up to a wider audience of data scientists and analysts, according to the company.

Enterprise use cases

By focusing on unstructured data, Komprise could potentially unlock significant value, Subramanian said. Komprise's approach enables organizations to extract insights from unstructured data repositories without the time and cost penalties associated with traditional data migrations.

A pharmaceutical company analyst could, for example, create dashboards in Snowflake or Databricks to monitor instrument and lab activity, enriching the data with output from queries against financial data.

Working as part of a content generation pipeline, a media AI agent can use the data to pinpoint which content could be repurposed, and then reformatted, to meet a current topic.

A data governance officer could generate a single view of all sensitive data in an organization and how it flows across systems, making trouble spots easy to identify and remediate. 

According to Komprise, it also has technology to copy the needed files at twice the speed of standard data transfer speeds, so the user feels no extra latency when executing queries. This also reduces the computational workload on the backend storage system, which otherwise would have to field all the queries itself.

Storage expands into data management

Komprise is one of several storage management companies expanding their portfolio and expertise to meet the AI-driven analytics market. Often, similar competitors cite the advantage of zero-data movement as a promise to cut data transfer and ingestion costs.

Last week, Everpure (formerly Pure Storage) debuted Data Stream, which prepares data for AI workloads, while keeping that data at the primary source of fast SSD and flash arrays. NetApp's intelligent data infrastructure provides a roadmap for bridging data residing on-premises to cloud AI services.

Komprise's technology is available in early preview, with a full release expected by the end of the year.  

Freelance news writer Joab Jackson has been writing about back-end IT technologies for the past three decades. His grandfather programmed mainframes, and his father wrote computer games for hobbyist programming magazines in the 1980s.

Dig Deeper on Storage system and application software