Getty Images

Unstructured data needed, but often untapped, for agentic AI

AI development initiatives hinge on the quality and completeness of the underlying data, but research from BARC shows that many organizations struggle to operationalize key data.

Unstructured data is critical in the age of AI.

However, just as when most enterprise data initiatives focused on building analytics tools such as reports and dashboards, unstructured data is underutilized now that organizations are building agents and other AI applications capable of autonomously generating insights and executing business processes.

BARC U.S. analyst Kevin PetrieKevin Petrie

While structured data such as financial records and point-of-sale transactions provides key information and is critical when building analytics and AI tools, unstructured data such as text in documents and emails and audio from customer interactions adds vital context that structured data can't provide.

Autonomous AI applications require vast amounts of high-quality, contextually relevant data to deliver trustworthy outputs. Without enough data, agentic AI tools won't understand an organization's unique characteristics and will make up responses. Often, those made-up outputs are so bizarre that they're easy for humans to dismiss. Sometimes, however, they are plausible enough to fool people, which can lead to misinformation that causes a business significant harm.

It's estimated that unstructured data now makes up as much as 90% of all data. Unstructured data, therefore, can be the difference between an agent that doesn't have enough context to properly perform and never makes it beyond the pilot stage, and an agent that generates substantial business value for an organization.

According to a report from research and advisory firm BARC titled "Harnessing Unstructured Data for AI Innovation: Problems, Practices, and Principles for Success," nearly three-quarters of organizations report that less than 50% of their unstructured data is discoverable and can be used to inform decisions.

That means that even as investments in AI development continue to surge and organizations deploying agents have potential competitive advantages, most enterprises are still unprepared for building agents.

In addition to finding that many organizations report that they don't have systems in place that enable them to discover their unstructured data for AI, the report, authored by BARC analysts and Kevin Petrie and Merv Adrian and co-sponsored by Datahub and Ohalo, found that data quality problems such as inaccuracy and inconsistency hinder the operationalization of unstructured data.

In a recent interview, Perie discussed the report, including the importance of unstructured data in AI development and why organizations still struggle to make appropriate use of it. In addition, he spoke about the consequences of failing to operationalize unstructured data for AI, the benefits of successfully using unstructured data to inform AI, and how organizations can go about tapping into their unstructured data.

Editor's note: This Q&A has been edited for clarity and conciseness.

Why is unstructured data important for agents -- what is it about agents and other AI applications that make structured data not enough?

Kevin Petrie: Structured data -- in other words, tables -- remain the top input for AI models overall because they contain the most easily verifiable facts. However, unstructured objects such as documents, images, emails and so on offer rich context that AI agents cannot get from tables. Unstructured data describes user intentions, stakeholder behavior patterns, company processes, customer sentiment, corporate values and myriad other factors that influence the decisions and actions of a business manager each day. Without this rich context, AI agents will be unable to reason like humans. They will fail to generate trustworthy outputs, take safe actions and deliver business value outside of a narrow range of use cases.

This long tail of unstructured data represents the beating heart and conscience of a business. You have to capture and make sense of it to differentiate yourself and create true competitive advantage with agentic AI.

Why do many organizations struggle to harness their unstructured data and derive value from it -- what are the problems that prevent them from tapping into unstructured data as a resource?

Unstructured data represents the beating heart and conscience of a business. You have to capture and make sense of it to differentiate yourself and create true competitive advantage with agentic AI.
Kevin PetrieAnalyst, BARC U.S.

Petrie: Unstructured data has piled up in siloed and far-flung systems for years. We found that 52% of organizations have unstructured data in on-premises or hybrid database environments, and another 16% have it sitting in multiple cloud database platforms. Given this, 70% of organizations say that less than half of their unstructured data is discoverable and usable.

Other obstacles include skill gaps, privacy requirements and immature data quality controls. Most data teams have focused their initial agentic AI projects on operational tables and a selection of trusted documents. That's a safe way to start. But over time they clearly need to cast a wider net to address more sophisticated, value-generating use cases.

As agentic AI becomes more ubiquitous, what are the competitive consequences of not tapping into unstructured data?

Petrie: The reality is that large language models themselves do not create competitive advantage. Any enterprise can subscribe to Anthropic Claude and use it to make their knowledge workers more productive -- in fact, that's now table stakes to survive. But to truly differentiate your enterprise in the modern era, you need to integrate smart multimodal agents into your proprietary business processes. That requires the context that you can only get from the unstructured data sitting behind your firewall.

If you cannot harness that unstructured data and tap that value, your agentic AI initiative will fail to differentiate your organization. You will be limited to lower-value use cases that your competitors will match.

Conversely, as more organizations move agents into production, what are the competitive benefits of operationalizing unstructured data as a contextual source for agents?

Petrie: That's where things get interesting. If you can classify, validate, and derive meaning from your customer service records, you can start to have agents prioritize and escalate complaints on a real-time basis. If a hospital chain or pharmaceutical company can analyze more doctors' notes in less time, its caregivers might identify new methods of improving patient treatments.

[Benefits include] use cases that can help improve customer satisfaction, reduce costs and increase revenue.

For organizations still working to manage and operationalize unstructured data for AI, what is a blueprint -- what are the steps they need to take and the ideal technology stack they need to put together to build a system that organizes and prepares unstructured data for AI?

Petrie: Merv Adrian and I recommend some specific steps for data and AI leaders to harness their unstructured data for AI projects.

First, they must find, prioritize and catalog all this stuff. The more they can organize critical metadata for mission-critical documents, and text records that humans consume on a regular basis, the better they can feed agents the necessary context to add value in business processes. Amazingly, only 38% of survey respondents have cataloged their unstructured data for AI.

Second, they must extend their data governance programs to address these unstructured objects, with smart human oversight. We found especially concerning gaps in data bias and lineage for unstructured data that will create agent chaos if left unaddressed. Only half of organizations have bias controls in place, and less than half trace lineage.

Third, and perhaps most importantly, we recommend that data teams create an independent semantic layer that can query and make sense of data wherever it lives. This is required because migration complexity, data gravity, and sovereignty concerns make full consolidation a non-starter for most AI adopters.

With so many organizations -- 70% with less than half their unstructured data AI-ready -- still struggling to overcome the barriers that hold them back, how long do you think it will take for that number to reverse itself and most organizations to have AI-ready unstructured data?

Petrie: That's a great question because it's hard to predict major shifts like this. I don't believe enterprises should try to reach 100% readiness, because inevitably some portion of that unstructured data will not add value for AI initiatives. But given the huge focus on context engineering, I expect that most companies will have discovered and classified most of their unstructured data within the next 24 months.

What is the state of AI-readiness when it comes to structured data -- are many organizations struggling to even manage their structured data for AI, or do they have a much better handle on that than they do their unstructured data?

Petrie: While our survey did not investigate this, it's fair to say that structured data is overall much more ready. AI teams tend to use structured data first because it is cleaner, more organized and more accessible than any other data type. Database tables are the lifeblood of any organization, driving business functions such as finance, sales, operations, and so on. While data quality issues continue to plague most database environments, unstructured files pose a bigger challenge and require more preparation.

Eric Avidon is a senior news writer for Informa TechTarget and a journalist with more than three decades of experience. He covers analytics and data management.

Dig Deeper on Data management strategies