Databricks unveiled a new portfolio aimed at helping users customize generative AI applications with their own data through retrieval-augmented generation.
Retrieval-augmented generation (RAG) is an AI capability that gathers data from databases and other data storage repositories to supplement the data already used to inform an application and improve the application's output.
While not new, RAG pipelines have gained popularity in the year since OpenAI released its ChatGPT generative AI system, the first mainstream large language model (LLM).
LLMs such as ChatGPT and Google Bard are trained on public data and can be useful for information searches and content generation. They don't, however, have the data to know the details of a given enterprise's business so they can't be used to help inform business decisions.
As a result, some organizations are augmenting LLMs from vendors such as OpenAI and Hugging Face with proprietary data so that the LLMs have domain-specific information to aid the decision-making process. Other organizations, meanwhile, are developing their own domain-specific language models from scratch.
To do either, developers need to build RAG pipelines that discover, retrieve and load the needed data.
Unveiled on Dec. 6, Databricks' new suite of tools, which builds on the vendor's recent focus on generative AI development, is designed to help developers do just that. As a result, the new suite is a significant addition for Databricks users, according to Donald Farmer, founder and principal of TreeHive Strategy.
"This is a great step forward," he said.
Before introducing its new portfolio, Databricks acquired MosaicML for $1.3 billion in June to add generative AI development capabilities, unveiled new LLM and GPU optimization capabilities in October to help users improve their generative AI outcomes, and revealed plans in November to combine its existing lakehouse platform with AI and rename its flagship tool the Data Intelligence Platform.
Generative AI has the potential to make the data management and analytics processes both more efficient as well as more accurate.
To do so, however, the data used to train the generative AI must be accurate itself and sufficient in volume to ensure a model or application has the information to respond to a user's query.
Generative AI models and applications are trained to deliver outcomes whether they have sufficient underlying data or not. Sometimes, those responses are obviously wrong, and users quickly discard them. Other times, however, incorrect responses closely resemble what an accurate output might look like and can fool the end user into basing a decision on the incorrect response.
Therefore, it's critical that generative AI have all the information needed to deliver outputs.
Vector search is a key element of helping RAG pipelines provide models and applications that information. Vectors are numerical representations of unstructured data, such as text and audio, that make unstructured data can be searched and discovered. Vectors also enable the discovery of similar data so a large pool of data can be found and used to train generative AI.
Databricks' new tools include vector search, feature and function search to find structured data, prebuilt foundation models that can be augmented with proprietary data, a quality monitoring interface so users can monitor the performance of RAG pipelines, and a set of development tools.
Kevin Petrie, an analyst at Eckerson Group, noted that enterprises are developing domain-specific models in several different ways. Some are building models completely on their own; others are fine-tuning pre-existing models, such as LLaMa or Bloom; and still others are enriching LLMs via RAG pipelines.
Enriching LLMs via RAG pipelines is most popular so far because it requires less computing cost and data science experience, according to Petrie. As a result, Databricks' new tools are now an important part of its overall platform.
"RAG makes sense for use cases that don't use language in specialized ways and, therefore, don't require fine-tuning but do need to reduce hallucinations by grounding LM responses in trustworthy content," Petrie said. "Databricks now makes it easier for early adopters to implement domain-specific language models with RAG."
Kevin PetrieAnalyst, Eckerson Group
While Databricks has made generative AI a priority in recent months and is in the process of building a set of tools that enable users to develop generative AI models and applications, it is not the only vendor doing so.
Snowflake, perhaps Databricks' primary rival, is similarly making generative AI a focal point, highlighted by its May acquisition of AI search engine vendor Neeva, improved containerization capabilities, and vector search capabilities. Similarly, tech giants AWS, Google and Microsoft have all introduced a spate of generative AI-related features.
Farmer, however, noted that Databricks, to date, has been among the most aggressive in adding tools that enable generative AI development.
"Compared to other data management vendors, I would say Databricks' efforts in GenAI appear to be quite ambitious and executed with urgency," he said. "They are focusing on cutting-edge machine learning techniques but integrating this with data management, emphasizing quality assurance and management."
Petrie similarly noted that Databricks is among a group of vendors not only speaking about generative AI as a priority but demonstrating it.
"As with the cloud hyper-scalers, Databricks continues to enrich its solution suite to support the full GenAI lifecycle, including model development, training, deploying, monitoring and optimization," he said.
While an important addition, Databricks' new suite enabling developers to build RAG pipelines is only part of what the vendor still needs to develop to enable customers to develop, deploy and manage generative AI applications, according to Farmer.
The new suite addresses the development aspect but not the others.
"In the future, we will still need more tools for easier model customization and deployment," Farmer said.
Petrie, meanwhile, pointed to data governance as an area where Databricks still needs to improve its platform.
He noted that the vendor's data lineage capabilities help organizations ensure the quality of the data used to train generative AI models and applications. But data lineage alone can't guarantee data quality.
Databricks already provides significant data governance capabilities, according to Petrie. However, additional measures would be beneficial.
"I'll be interested to see what steps Databricks takes to help companies validate and govern the data inputs they feed to language models," he said. "For example, companies need to carefully tag unstructured content, such as text files and images, and ensure proper use of master data."
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.