Getty Images/iStockphoto
How to get data ready for AI development
Factors including a combination of commitment to data quality, proper technology and pertinent processes are key to preparing data for feeding models and applications.
As enterprises' interest in developing AI tools rises, first making sure their data is ready to train models and applications is critical.
Getting data ready to inform AI tools, however, is not straightforward and simple. It takes a combination of committing to data quality, using proper tools and implementing appropriate organizational processes, according to according to a panel of experts speaking during Impact 2024, a virtual conference hosted by data observability specialist Monte Carlo in mid-November.
If organizations don't properly prepare their data before using it to train AI models and applications, consequences could be significant, including financial loss, violations of regulatory requirements and substantial embarrassment.
For example, in February 2024, Air Canada was forced to compensate a customer who was misled by Air Canada's AI chatbot into paying full price for a ticket after the death of their grandmother when a bereavement rate was available.
"Garbage in leads to garbage out. So you need to [prioritize] what's happening to your model and what's coming out of your model to make sure it stays healthy," said Casey Maskalenko, senior lead data engineer at DraftKings.
In addition to Maskalenko, the panel featured Stefanie Tignor, head of product, data science and engineering at Grammarly, and Sri Subramanian, director of data engineering at SurveyMonkey.
Each discussed how their companies are using AI. Beginning with simply defining AI-ready data, they outlined a basic strategy for properly preparing data for AI tools.
Defining AI-ready data
At its most basic, AI-ready data refers to data that will result in the most accurate results possible.
AI is not infallible, even when trained with high-quality data. AI applications -- generative AI in particular -- make mistakes called hallucinations. Sometimes they're ridiculous, even offensive, and they are easy to spot. Other times, however, they seem plausible; if not checked carefully, they can lead to misinformed decisions.
While not a perfect solution, the most obvious means of reducing the likelihood of AI hallucinations is by training models and applications with high volumes of data that has been properly prepared. The more high-quality data used to train an AI tool, the more accurate and less prone to hallucinations it will be.
Establishing and understanding what constitutes high-quality data is, therefore, a significant part of making sure data is ready for AI, according to Tignor.
"AI-ready data are reliable. They're high-quality. They're trustworthy," she said. "We need all that for data to be AI-ready."
However, determining the trustworthiness and reliability of data doesn't end when it's used to train a model, Tignor continued. Instead, organizations should establish metrics to determine the accuracy and effectiveness of AI tools after they've been developed. By looking at outputs and ensuring they meet certain standards, developers can gain insight into the underlying data that feeds AI tools.
"People focus [on input data] a lot -- which, of course, is very important," Tignor said. "But you also need to spend a lot of time instrumenting effective metrics to understand whether AI outputs are good [and] useful [as well as] whether they're doing what you hoped they would do. It can be easy to ignore, but in my opinion, it's really important."
Subramanian similarly said reliability is the essence of AI-ready data.
Data remains isolated in many organizations. When separated from an organization's data governance framework or inconsistent across an enterprise's different domains, data is unreliable. But when consistent and properly governed, it can be trusted.
"You don't want data to be in silos," Subramanian said. "You want to focus on a single source of truth that is highly reliable, very structured and … well-governed in the sense that it can be easily accessed by people who are authorized. That is a good focus for making data truly AI-ready."
The role of technology
With data quality foundational for successful AI development, technology now plays a critical role in data preparation.
The advent of cloud data warehouses, an increasing emphasis on real-time analytics and now a growing demand for AI development has changed data quality monitoring. Before the cloud, organizations kept all their data on-premises, where it was overseen by teams of trained data experts.
Stefanie TignorHead of product data science and engineering, Grammarly
Given the absence of near-universal connectivity, data sources were limited, but the data volume was manageable due to the finite number of sources.
Meanwhile, with analytics often constituting predictable weekly, monthly, quarterly and yearly reports, data teams had time to carefully develop the reports, including checking the underlying data informing reports for accuracy and other quality standards.
Now, however, data volume is rising exponentially each year, while data complexity, including unstructured data types such as text and audio files, is also increasing. In addition, predictable scheduled reporting is no longer sufficient for enterprises to remain competitive. They need real-time data to act and react instantly.
That combination of petabytes of data, various forms of data and the need for data to be ready for analysis at a moment's notice makes it impossible for even teams of humans to check for quality.
Instead, data quality initiatives need to be automated, whether using homegrown technology or tools from vendors such as data observability experts Monte Carlo and Acceldata. Such tools surface anomalies and data changes, and engineering platforms such as DBT Labs, which uncover errors during testing.
"The combination allows us to be proactive and reactive," Subramanian said.
In addition, AI itself is an increasingly effective means of monitoring data to make sure it's ready to inform analytics and AI tools, he continued.
For example, DBT Labs enables engineers to develop tests and documentation using generative AI, while Monte Carlo provides users with generative AI capabilities that let them use natural language rather than code to make fixes. Both are about saving engineers and other data experts significant time by taking on time-consuming tasks.
Eventually, however, AI will likely become more autonomous. Agents that can act independently will monitor data and applications for errors, anomalies and changes as well as take on some of the preparation and restoration work.
"AI-driven self-healing of data pipelines and AI-driven data cleansing and profiling are going to be extremely critical because as [enterprises] scale their needs will increase and they will need to keep up," Subramanian said. "You'll need to have something automatically doing stuff for you, and those will be AI agents profiling the data and cleaning the data."
People and processes
While technology provides the means for carrying out data preparation and quality initiatives, organizational policies are also a significant aspect of making sure data is ready to inform AI models and applications, according to Tignor.
"We need the right tooling and we need the right infrastructure, but so many data quality problems are culture problems and making sure we have the right people and processes in place," she said. "That's often overlooked, but it's so important and a critical ingredient. Without it, nothing really works."
One way to create a culture that values data quality is to assign ownership of each metric to someone on the data team, Tignor continued.
By assigning ownership to an individual, making them responsible for overseeing performance and regularly reporting on that performance to the rest of the team, it makes clear who is responsible for each aspect of data quality. In addition, it raises the stakes for those individuals.
"Even though we can automate a lot of the alerting, you still need someone who feels it's their responsibility to be the steward of the metric -- the 'understander-in-chief' of that metric -- because when alerts happen, they have to go somewhere," Tignor said. "That's the point of the alerts: that someone acts on them."
Maskalenko similarly highlighted the importance of ownership and communication.
"You need to have someone who understands and is responsible for [data quality metrics]," he said. "And to continue that, we focus on … having someone facilitate the conversation so that stakeholders can know that if 'X' happened, then 'Y' is going to happen, and we can either be fine with that or mediate it."
Beyond ownership, implementing a strong data governance framework is important, according to Subramanian.
Data governance frameworks are where enterprises can control who can access what data, ensuring it doesn't fall into the wrong hands. They're where specifications can be set so that no data is used to inform analytics and AI tools that don't meet certain standards. Definitions can be standardized so all of an organization's data is uniform, and data catalogs can be implemented to make datasets and tools easy to find.
"Having a governance framework goes a long way," Subramanian said.
A final organizational must for ensuring data quality is that a human needs to have the final say on whether data is ready before being used to train AI models and applications, according to Tignor. Machines, including AI-powered tools, can make mistakes. It's up to people to catch them before they do damage.
"Folks get really excited about all this automation and AI, but sometimes that can lead to people going 100 miles per hour and taking all human processes out," Tignor said. "You need human evaluations, subject-matter expertise, and … critical thought. That's how you will improve your system and know whether it is executing on what you want it to do."
Looking ahead
As AI evolves, so must the technology used to prepare data for training models and applications and the teams overseeing the development, deployment and operationalization of AI tools.
With enterprise interest in AI tools surging in the two years since OpenAI's November 2022 launch of ChatGPT marked significant improvement in generative AI capabilities, data teams will evolve to become enablers of AI, according to Subramanian.
"We will be driving all the AI initiatives, working to produce high-quality data for specific models and making sure there is an AI-ready infrastructure," he said.
An AI-ready infrastructure is one that can scale up as data volumes continue to grow, Subramanian continued, noting that SurveyMoneky's data workloads have increased fivefold over the past two years.
"I can only imagine five years from now how much the volume is going to be and how many datasets we are going to produce for all the different initiatives," he said. "It's important to scale the infrastructure."
Maskalenko, meanwhile, noted that regardless of how much data volume continues to increase and how many new AI initiatives an enterprise invests in, data teams will continue to be responsible for making sure the data used to inform AI tools is truly ready.
"We've all heard the phrase, 'Data is the new oil,'" he said. "I see data engineering teams fitting perfectly into that analogy. Data is powerful and it's available, but you need to refine it."
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.