your123 -

Databricks acquiring MosaicML to add more generative AI

The data lakehouse vendor's purchase of the generative AI vendor will enable customers to build and train language models specific to their needs by using their own data.

Databricks on Monday reached an agreement to acquire MosaicML for $1.3 billion in a move aimed at adding new generative AI capabilities.

MosaicML is a generative AI vendor whose platform enables organizations to develop and secure generative AI and language models using their own data rather than data provided by generative AI and large language models (LLMs) such as ChatGPT and Google Bard.

The San Francisco-based vendor was founded in 2021 and had raised $37 million in venture capital.

Databricks, meanwhile, is a machine learning and data lakehouse vendor whose main lakehouse platform combines the structured data storage capabilities of data warehouses with the unstructured data storage capabilities of data lakes.

The fusion enables organizations to join both structured and unstructured -- as well as semistructured -- data in one system rather than have to move data back and forth between systems to combine different types of data in preparation for analysis.

Additive capabilities

Once MosaicML's platform is combined with Databricks, the vendor's customers will be able to securely develop and train language models specific to their own needs by using their own data housed within the secure Databricks environment.

In part, it's that ability to develop what Eckerson Group analyst Kevin Petrie terms "small language models" that focus on an organization's relevant data and not just vast amounts of public data that makes the acquisition significant, according to Petrie.

This acquisition shows that Databricks is serious about helping companies build and train language models on its lakehouse platform.
Kevin PetrieAnalyst, Eckerson Group

"This acquisition shows that Databricks is serious about helping companies build and train language models on its lakehouse platform," he said. "It also aligns with the rise of … 'small language models,' which are domain-specific models that improve governance and the ability to support specialized use cases."

Beyond better security than public LLMs, the small language models users can develop with MosaicML improve the accuracy of outputs.

One of the problems with ChatGPT and other LLMs is that they don't always return accurate responses to queries, and those AI hallucinations can have a significant negative impact if they are used to inform a business decision, Petrie continued.

"MosaicML helps companies train and fine-tune language models on their own data, improving the accuracy of their outputs and reducing the risk of hallucinations," he said. "These capabilities, along with optimized model training, will make it easier and cheaper for companies to build small language models."

"They don't need to boil the ocean with hundreds of billions of parameters like ChatGPT-4," he added, referring to ChatGPT creator OpenAI's most powerful LLM.

Donald Farmer, founder and principal of TreeHive Strategy, similarly said Databricks' acquisition of MosaicML will enable Databricks customers to build and train their own language models.

His clients frequently ask how they can develop their own LLM using data actually relevant to their needs, he noted.

"The answer has often been Mosaic, based on the scenario," Farmer said. "With Mosaic integrated fully with Databricks, companies should be able to train their own LLMs and, most importantly, manage the lifecycle of the LLM using the same tools as they use today for other machine learning data engineering. So, a win all around."

More than just technology

Beyond acquiring the MosaicML platform, Databricks will inherit the MosaicML leadership team, including co-founder and CEO Naveen Rao.

Joel Minnick, vice president of marketing at Databricks, said MosaicML's open source approach to development aligns with Databricks' approach to product development.

In addition, he noted that MosaicML's sense that customers want to use their own data to build language models fits with Databricks' belief that LLMs' access to public data, while perhaps beneficial in some instances, is not as important as an organization's use of its own data to inform and train models.

"Customers … don't need all of the content of the internet if they're trying to build a large language model to answer questions about their customers' health insurance policies, for example," Minnick said.

Top benefits of generative AI for businesses.
Seven benefits of generative AI for the enterprise.

Beyond enabling use of only relevant data, MosaicML technology enables more cost effective model development than LLMs informed by public data, he continued.

"Across the vision, the technology and the team, we saw lots of synergies," Minnick said. "To bring that kind of [model] training platform into the lakehouse where customers are able to bring all their data together and have it highly governed and visible … will enable customers to do their best work as we go forward into this age of generative AI."

Data and generative AI

The promise of generative AI and LLM technology for analytics and data management is that it will broaden the use of analytics within organizations beyond just data experts and that it will make data management more efficient.

The spread of analytics use within organizations has been stagnant for decades, stuck somewhere around one-quarter of employees. Even recent technological advances such as natural language processing (NLP) and low-code/no-code tools have failed to make data analysis accessible beyond those with data literacy training.

Insufficient NLP vocabularies proved a hindrance, and even low-code/no-code require training to be used securely and effectively.

Now, however, generative AI and LLM technology have the potential to eliminate the data literacy training previously required to work with data.

ChatGPT, launched in November 2022, and other generative AI platforms have much larger vocabularies, which enable freeform language use rather than specific business phrasing.

That will perhaps enable more people within organizations to work with data. Data experts, meanwhile, stand to benefit by no longer being required to write the copious amounts of code required to develop data pipelines and build and train data models.

As a result, numerous data management and analytics vendors have unveiled capabilities combining their existing tools with generative AI in the months since ChatGPT was first released, including Databricks.

But while many other vendors are adding generative AI through integrations -- including Microsoft, a major investor in OpenAI -- Databricks is developing its own.

Three months before agreeing to acquire MosaicML, the vendor unveiled Dolly, an open source LLM similar to ChatGPT. Databricks' acquisition of MosaicML seemingly continues its strategy of producing its own tools -- either through product development or acquisition -- rather than adding generative AI and other machine learning capabilities through integrations with third parties.

"Databricks has been establishing its role as the leading data engineering platform, creating the most compelling machine learning lifecycle platform," Farmer said. "So this acquisition makes a lot of sense."

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data science and analytics

Data Management
Content Management