putilov_denis - stock.adobe.com

AI data governance guidance that gets you to the finish line

As organizations dive into AI adoption, many realize the first real bottleneck is not the model but how to prepare their information so it can be used effectively in AI workflows.

Enterprises are rapidly adopting AI to unlock its potential, but many fail to address a key area: preparing vast, fragmented data sets so models can use them effectively.

AI efforts often slow to a crawl -- or fail entirely -- because teams quickly discover how hard it is to turn raw data into something leaders and systems trust. The real work isn't the modeling but finding the right data, cleaning and governing it, and enforcing standards to keep it consistent and reusable. Enterprises that build momentum avoid inertia by continuously monitoring, refining and validating their data. Those practices build hard-won trust that the AI needs to produce accurate, relevant results that push projects from experimentation to production.

"Prior to the arrival of AI, corporate decision making was centered around the trustworthiness of your existing data, and most people did not [trust their data]," said Stephen Catanzano, an analyst at Omdia, a division of Informa TechTarget. "And our current research shows most people still don't fully trust their data. So, the question remains: can I give my data to an AI agent and have that agent make decisions for my company, like changing processes? Well, you can't. The definition of AI-ready data starts and ends with trust."

Why AI data readiness matters now

This trust factor underscores that AI can't deliver meaningful business value until enterprises address their longstanding gaps in data quality and management.

AI data readiness is increasingly recognized as the foundation of successful corporate AI initiatives. Analysts highlight its strategic importance, with Gartner forecasting that 60% of AI projects will be abandoned by the end of 2026 due to inadequate data management. By 2027, the failure rate is expected to climb to 80% for GenAI projects driven by deficiencies in data quality, governance and trust.

According to Gartner, siloed data that prevents AI from seeing across multiple CRM, ERP and regulatory systems is a common barrier. Ungoverned data introduces compliance risks and can expose mission-critical data.

There's a growing consensus that scalable AI architectures rely on consistent standards to ensure data accuracy, accessibility and compliance. While no single governance model dominates, ISO/IEC 42001 -- the international standard for AI management systems -- offers structured guidance for responsible AI development and oversight. Enterprises often pair it with semantic frameworks, such as Resource Description Framework (RDF) and the Web Ontology Language (OWL). Combined, these approaches strengthen AI data governance, encourage ethical data practices and support scalability.

Making data trustworthy takes work

For many organizations, simply identifying what data they have and where it lives remains the biggest obstacle before they can start the refining process.

"It's all well and good to get your data AI ready, but if you don't know where your data resides, it's really hard to do that," said Jack Gold, principal analyst with J. Gold Associates. "Companies have isolated or siloed data stashed all over the place. Unfortunately, a lot of companies are still in the middle of that process. If the majority of users were actually prioritizing this, these data lake companies would be worth trillions."

Once organizations know what data they have, governance becomes the next hurdle.

"AI systems do not just use data -- they learn from it, and that makes governance critical," Catanzano said. "Poorly governed data leads to biased, insecure, and/or noncompliant AI."

Governance gives lineage visibility, allowing teams to trace how the data moves and changes across systems. It also enforces access controls to limit exposure of sensitive information and helps organizations meet regulations, such as HIPAA, GDPR and the EU AI Act.

"Adding in lineage and observability tools is becoming really important," Catanzano said. "They allow you to actually see the data and look for governance challenges, along with being able to map out compliance requirements for data-specific challenges. It has taken people a while to figure out the importance of these tools and how to achieve it. Frankly, most users haven't got to this stage yet."

How embeddings make AI useful

The next step is making that data relevant at scale. AI performs better when it understands the data's context. Embeddings turn words, images and logs into vectors, so systems retrieve the right content rather than guessing.

Paired with strong metadata, embeddings help AI return the right information at the right time. AI is shifting toward metadata-rich, vector-based retrieval, Catanzano said.

"We have been moving towards higher levels of metadata and larger amounts of the vectorization of data," Catanzano noted. "So now, users want to use more AI because vectors create relevancy, which means AI can find the most relevant data based on vector scores and so improve the quality of data being searched for."

Vectorization is becoming a baseline requirement for effective AI implementation, Catanzano noted. Converting unstructured data into embeddings and combining them with high-fidelity metadata optimizes retrieval precision and increases confidence in the accuracy of the information delivered to the end user.

How tokenization improves performance

With retrieval grounded in embeddings and metadata, the next lever on performance is how text is prepared for the model. Using enterprise data effectively with AI workflows often requires converting it into formats the language models can understand.

Tokenization is a key part of the pipeline, but it's only the first step. Once tokenized, the model applies learned patterns and relationships to analyze content, generate responses or make predictions. Efficient tokenization reduces the number of tokens the system must handle, improving response times and lowering compute and inference costs in the production environment.

In modern AI workflows, organizations often convert documents and other unstructured data into vector embeddings. This change makes the information available for wider use, enabling more precise insights tied to specific business needs.

"Developers and users have to transform their data located, for instance, in a database, into a [format] that can travel across platforms," said Frank Dzubeck, president of Communications Network Architects. "Companies, in the pharmaceutical industry, for instance, are doing that now. It changes the way they can look at data because they can create [embeddings] that specifically address problems they are researching in their industry."

What are the building blocks of AI‑ready data?

Beyond AI data governance requirements, enterprises achieve the best results when embeddings rest on a strong data layer consisting of several key elements.

  • Standardized structures. Common formats, such as CSV and JSON, help keep data portable, but their real value is applying consistency to the information they hold;
  • Smart labeling. Tagging and annotating data ensure AI models can interpret the raw values and the intended meaning behind them;
  • A shared language. Semantic frameworks, such as RDF and Shapes Constraint Language (SHACL), act as a translator to give data sets a common structure to promote interoperability; and
  • Deep context. Using logic tools, such as OWL, gives the context and meaning needed to form relationships across data sets.

What to ask vendors before you buy

Don't be swayed by a polished demo. Press for specifics about how the system handles your data today and what could break tomorrow.

When evaluating vendors, anchor your questions in your governance model and data flows. Start with data use and ownership. How is proprietary data isolated? Will your content be used to train shared models, or will it only serve your tenants? What are the defaults for retention, deletion and cross-tenant safeguards?

Models trained on your organization's data tend to generate more accurate results that are less susceptible to biases inherent in internet-based data. Ask how the tuning is done and how quality is measured over time.

It's also critical to understand the data preparation workflow and the tools involved. What formats are supported? How is the data transformed? What lineage, logging and rollback is available? What are the commitments for backward compatibility as the platform evolves?

Without these answers, organizations risk getting locked into proprietary formats that might not work in later versions.

"They need to actually show users their various transformation tools for databases, searching and other functions because they are all different," said Dzubeck. "As for future proofing, that's a tough question for users to get an answer to. They are only going to give you answers that directly link to their products and strategies like Google, which is focused on search, and large database makers like Oracle."

Ed Scannell is a freelance writer and journalist based in Needham, Mass. He reports on a wide range of technologies and issues related to corporate IT. He can be reached at [email protected].

Dig Deeper on Data governance