pathdoc - stock.adobe.com
Data observability specialist Monte Carlo on Wednesday unveiled new features designed to ensure data quality, including integrations with vector databases and Apache Kafka.
Data observability is the process of monitoring data as it progresses through the pipeline from ingestion through analysis to make sure the data used to inform decisions is accurate and up to date.
When organizations collected data from only a few sources and stored their data in on-premises databases, data observability was relatively simple. Now, however, organizations collect data from myriad sources, so the data itself can be widely disparate in structure and stored in multiple locations.
As a result, with data becoming much more difficult to monitor for quality, vendors such as San Francisco-based Monte Carlo and Acceldata now specialize in data observability.
In addition to the integrations, Monte Carlo introduced Performance Monitoring, a dashboard that customers can use to find inefficiencies in data pipelines, and Data Product Dashboard to enable users to track the reliability of data products, including AI and machine learning models.
Monte Carlo revealed the new features during Impact 2023, a virtual data observability conference hosted by the vendor.
The integrations are scheduled for general availability in early 2024, while the new data observability tools are available now.
Vector databases have gained importance in the year since OpenAI released ChatGPT, which marked a significant improvement in generative AI and large language model (LLM) technology.
Generative AI technology is now enabling organizations to develop their own LLMs, which require large amounts of data that vector databases can help organizations discover and combine.
Using starter code from ChatGPT, Google Bard, Azure OpenAI from Microsoft and other generative AI platforms, organizations are using their own data to develop and train models tailored to their own specific needs.
However, generative AI models need to be trained with huge amounts of data to deliver accurate responses.
Unlike traditional AI and ML models, generative AI models will deliver outputs whether they have the requisite information to respond accurately or not. When they have the right data, they will deliver an accurate output. But when they don't have the data to answer a user's question, they will make up an answer which may seem plausible enough that it could fool a user and lead to decisions and actions based on an AI hallucination.
Therefore, the more data used to train a model, the better the chance of accurate responses.
Vectors help organizations discover enough data to train generative AI models. Vectors, which are numerical representations of data, give structure to previously unstructured data such as text, audio files and videos. That previously unstructured data can then be combined with an organization's traditional structured data to inform models.
In addition, vectors enable similarity searches that make it easier for data engineers to discover all the needed data to train an LLM.
Meanwhile, because of all the data that's needed to train an LLM, data quality remains imperative. If the data is inaccurate, so likely will the model outputs. Only they won't be AI hallucinations. They'll be the result of bad data.
Now, through an integration vector database vendor Pinecone, Monte Carlo is enabling customers to apply data observability capabilities to pipelines that include vector databases in a development that Kevin Petrie, an analyst at Eckerson Group, called significant.
Eckerson Group research shows under a quarter of data experts rate their data governance and data quality controls sufficient for AI and machine learning initiatives, including generative AI, he noted.
"Monte Carlo is taking a significant step to help them ensure accuracy by observing the quality of text files as they are transformed into vectors and embedded into vector databases," Petrie said. "They also ensure the quality of database records that will complement generative AI in many of these initiatives."
In addition to the integration with vector databases, Monte Carlo will soon offer an integration with Apache Kafka, an open source platform that enables organizations to ingest data in real time.
The integration will allow joint Monte Carlo and Kafka users to administer Monte Carlo's data observability tools to Kafka pipelines to make sure the real-time data being used to update AI/ML models, including LLMs, is reliable.
Kevin PetrieAnalyst, Eckerson Group
Lior Gavish, Monte Carlo's founder and CTO, noted that Kafka is one of the most popular streaming data ingestion platforms. As organizations develop retrieval augmented generation pipelines that feed generative AI models, Kafka is becoming even more ubiquitous.
"It's almost the de facto standard for feeding data from various sources and feeding them into vector databases. So we felt the need was greater than ever to add observability on that," Gavish said. "It serves a lot of existing use cases, but generative AI was front and center when we decided to build this integration."
Customer demand, therefore, was the primary impetus for developing the integration with Kafka, he continued.
The integration with Pinecone, meanwhile, was more internally motivated, resulting from Monte Carlo's recognition of increasing demand for vector databases as organizations begin developing their own generative AI models.
"Vector databases are earlier in the adoption curve with customers," Gavish said. "They're starting to look at them. We're hearing that customers are building pipelines for generative AI and need to add observability into that. But the landscape there is early. It's not like everyone is using vector databases yet."
Beyond the integrations with vector databases, Monte Carlo's two new data observability dashboards are designed to better help customers monitor the health of their data throughout its lifecycle, including its use in informing AI models and other data products.
Performance Monitoring is a dashboard that displays the performance of data and AI pipelines and is aimed at helping organizations monitor and control computing costs.
When pipelines run slower than usual, costs increase. Therefore, the sooner organizations detect and resolve inefficiencies, the better they are at keeping cloud computing costs under control.
Performance Monitoring enables users to run queries related to specific directed acyclic graphs that show an organization's data models and their connections with one another. It also shows individual models, warehouses, data sets and users to discover the root causes of inefficiencies and fix them.
Just as Performance Monitoring enables users to view the performance of an organization's data and AI pipelines, Data Product Dashboard lets users track the performance of an organization's data and AI products.
Using the dashboard, customers can identify the data used to inform individual AI and ML models, dashboards and applications to quickly resolve any problems.
Now that Monte Carlo's integrations with Pinecone and Kafka are slated for release early in 2024 and the new data observability dashboards are available, the vendor is planning to develop integrations with more vector database vendors, according to Gavish.
Other areas of focus include adding support for more tools within the data stack to make Monte Carlo a bigger part of an enterprise's data operations and expanding the vendor's cloud presence beyond its current support for AWS and hybrid environments.
"We're going to support more clouds to make it fully up to the customer where they want to run Monte Carlo," Gavish said.
Meanwhile, Petrie said he'd like Monte Carlo to go even further with its support for vector databases.
Beyond vendors such as Pinecone and Chroma that specialize in vector databases, many database and broader data management vendors offer vector databases within their larger offerings. That could be an opportunity for Monte Carlo.
"Now that they've taken this first step with vector databases, Monte Carlo should round out its support by integrating with the full range of vector databases, including both dedicated platforms and broader platforms that include vector capabilities," Petrie said.
In addition, applying data observability to metadata management is an opportunity for Monte Carlo to further integrate with vector databases, he continued.
"I'll be interested to see whether they enrich metadata management to help companies select and curate the text files that they feed into vector databases, either via fine-tuning or prompt enrichment," Petrie said.
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.