Nabugu - stock.adobe.com
The hype surrounding generative AI, large language models and the technologies' potential impact on how enterprises use data is becoming a reality.
That's according to a panel of industry experts speaking on June 29 during Snowflake Summit, the annual user conference hosted by the data cloud vendor in Las Vegas.
Ever since OpenAI's November 2022 launch of ChatGPT -- which marked a substantial increase in generative AI and large language model (LLM) proficiency -- data management and analytics vendors have introduced tools promising to augment their existing capabilities with the power of generative AI.
The result, they suggest, will transform business.
Few of the tools, however, have made it beyond the private preview stage. And those that have are mostly rudimentary tools that automatically convert text to code or automatically generate summaries of text reports.
But already, generative AI is moving beyond the hype stage and beginning to materially affect what organizations can do with their data, the industry experts noted.
Ali DalloulVice president, Microsoft's Azure AI platform
AI has long held the potential to transform business, according to Ali Dalloul, vice president of Microsoft's Azure AI platform. The release of ChatGPT and similar generative AI and LLM models is making that a reality.
"The architecture and the science have caught up to the hype," he said. "They have multipurpose, rich models that in and of themselves are platforms. It is, in our view, a fundamental shift in the industry."
In fact, Dalloul predicted that within the next 12 months, generative AI and LLM technology will become so ubiquitous in enterprises that AI assistants will be the norm for data workers.
"We're going to make tremendous progress on the explainability of these models and the level of assurance that society will have that they are equal to creating value for the enterprise and the individual," he said. "My prediction is that each [data worker] next year will have an AI assistant helping them do their job."
At the core of the hype about generative AI's ability to change how organizations use data are the promises of increased efficiency and expanded use of analytics to inform decisions.
For decades, BI use in the enterprise has been stuck. Despite technological advances such as natural language processing (NLP) and low-code/no-code capabilities, only about a quarter of all employees within organizations use analytics tools, according to studies.
The reasons are multiple, but include that BI platforms -- even those that include a smattering of tools built for business users -- are historically designed for data experts and that the NLP and low-code/no-code tools built by BI and data management vendors still require data literacy training to use.
ChatGPT, Google Bard and other new generative AI platforms might eliminate the need for users to be data literate.
Their LLMs have far more extensive vocabularies than the NLP tools built by data management and analytics vendors, thus enabling them to understand true natural language. And that, when applied to an organization's data, can open analytics to virtually anyone and lead to more informed decisions and better business outcomes.
Generative AI could also lighten the burden on data experts. When others within an enterprise can use data to inform decisions, data experts are freed from responding to a litany of requests for reports and models and can do more in-depth analysis.
Meanwhile, just as fully eliminating the need to code potentially opens data analysis to a broader audience and makes data analysts more efficient, it offers the promise of making data engineers more efficient as well.
Building data pipelines requires copious amounts of code. But if the need to code is eliminated or substantially lessened, the time and effort required to ingest, integrate and prepare data can be greatly reduced.
"Now, we're entering an era where computers are coming to where humans are," said Jonathan Cohen, vice president of applied research at Nvidia. "Computers can now interact with us in ways that are natural to us. If you now make it so any person on the planet can access all that power without needing any of the expertise you needed even five years ago, it's going to be completely transformational."
Dalloul similarly said that early excitement about generative AI and its potential to change how effectively organizations use their data is becoming realized.
Data -- especially unstructured data such as text, audio and images -- requires extensive human effort to prepare for analysis. Generative AI can greatly reduce the burden on humans and allow them to eliminate the mundane tasks that dominate their time so that they can focus on deep analysis.
"It is very real," Dalloul said. "The use cases that emerge are very durable. It's a paradigm shift in the industry."
AI and unstructured data
Some of the first data-related tools powered by generative AI are beginning to become available.
For example, in June, Monte Carlo launched Fix with AI and Generate with AI. Both are text-to-code translators that enable data engineers working with Monte Carlo's data observability platform to type commands in natural language that automatically get converted to code that the platform understands.
Also in June, Dremio launched Text-to-SQL, a tool that similarly converts text to code to save users of the vendor's lakehouse platform from having to code every query or command.
But the early applications of generative AI and LLMs go well beyond translating text to code.
As Dalloul noted, using unstructured data is far more complicated than using structured data. Structured data is data predefined by numbers or tables. Unstructured data, meanwhile, has no predefinition and includes data types such as text, audio and images.
To make use of unstructured data and combine it with structured data for a more complete picture of an organization's operation, unstructured data essentially has to be given structure. To do that, unstructured data has to be assigned a numerical representation called a vector.
The process of assigning vectors has to be carried out for every single piece of unstructured data. And with unstructured data amounting to some 80% of all the data now being generated, that's a monumental task that often never gets done.
Unless generative AI is involved.
Generative AI can be programmed to automatically assign vectors to unstructured data as it gets ingested, potentially enabling enterprises to access vast troves of data now sitting untouched.
Therefore, as hype evolves into reality, one of the first material ways that enterprises will likely use generative AI is to gain more access to their unstructured data -- and begin using that data to help train data models and inform decisions, according to Andrew Ng, founder and CEO of data-centric AI vendor Landing AI.
One tool designed to make unstructured data usable is Atlas Vector Search from MongoDB, now in preview. Organizations can also build their own integrations with ChatGPT and other generative AI platforms to convert unstructured data.
"When you think about the AI journey ... so far, a lot of the value has been on structured data," Ng said. "It turns out that the majority of the world's data is unstructured data. The exciting opportunity that's ahead of all of us is loading unstructured data and letting people have at it. I think most people underestimate the magnitude of the transformation that's going to come."
Likewise, Cohen pointed to the vast amount of untapped unstructured data as a clear starting point for organizations looking to apply generative AI to their data operations.
"There's so much value locked in data -- and especially unstructured data that we've logged and archived, but not done anything with," he said. "Now, you can take all that unstructured data and customize a model, build AI that has access to all that information, and formulate answers and responses and look for patterns. That's accessible today and very achievable."
Large to small
Another way enterprises will use generative AI as it moves from hype to reality will be developing language models designed for specific purposes, according to Dalloul.
LLMs are built for the masses. They are general-purpose platforms and contain public data that's largely extraneous to the needs of a given enterprise. While some of the public data in LLMs might be relevant and add context to an organization's own data, the vast majority of the data in LLMs is irrelevant.
Therefore, as generative AI continues to evolve, organizations will augment LLMs by building their own language models using tools provided by data management vendors.
In addition, vendors will design domain-specific models that their customers can deploy. Many data management and analytics vendors already provide domain-specific versions of their data storage and analysis tools. They'll likely do the same as their generative AI capabilities move from preview to public availability.
"There will be a few very large models that will continue, and they will be relevant," Dalloul said. "But the market and the needs and the use cases are rich enough for there to be other models, fine-tuned models on private data sets within the enterprise."
Open source LLMs also are now emerging for certain vertical industries and applications, he added.
Beyond the efficiency that customized language models will foster, another reason enterprises might begin opting for smaller models rather than general-purpose LLMs is reducing exposure to security breaches, Cohen noted.
LLMs have so far struggled to keep data secure. If an organization combines its data with an LLM, it risks having that data exposed.
"If I train a model on a whole lot of data, I don't know how I can guarantee at runtime that the data doesn't leak to someone who's not supposed to have access to that data," Cohen said. "That's a pretty strong argument that's going to push toward lots of custom models fine-tuned on custom data that is insulated and [managed] with access controls and governance policies."
One vendor already advancing beyond the hype of generative AI and addressing the possibility of enabling customers to develop their own language models to augment the capabilities of LLMs is Databricks.
The data lakehouse specialist reached an agreement to acquire MosaicML on June 26 for the specific purpose of adding tools that enable users to develop and secure generative AI and language models built on their own data.
"There will be both [large and small models]," Ng said. "The large models let you build something quite quickly, but I'm seeing more businesses get extra performance by fine-tuning and customizing models."
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.