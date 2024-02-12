With multimodal generative AI, teams can create machine learning models that support multiple data types, such as text, images and audio. These new capabilities enable content creation, customer service, and research and development.

Many generative AI offerings from Google, Microsoft, AWS, OpenAI and the open source community now support at least text and images within a single model. Efforts are also underway to support other inputs, such as data from IoT devices, robot controls, enterprise records and code.

"Multimodality in AI for business applications is best understood by first recognizing the variety and complexity of data types businesses deal with every day," said Christian Ward, executive vice president and chief data officer at digital experience platform Yext.

Multimodal generative AI can help with financial data, customer profiles, store statistics, geographical information, search trends and marketing insights -- all of which are stored in diverse forms, including images, charts, text, voice and dialogues. Multimodal AI can automatically find connections among different data sets representing entities such as customers, equipment and processes.

"We are so used to seeing these data sets as separate, often different software packages, but multimodality is also about merging and meshing this into completely new output forms," Ward said.

Getting started with multimodal models Major AI services, including OpenAI's GPT-4 and Google's Gemini, are starting to support multimodal capabilities. These models can understand and generate content across multiple formats, including text, images and audio. The advent of capable generative multimodal models, such as GPT-4 and Gemini, marks a significant milestone in AI development. Samuel HamwayResearch analyst, Nucleus Research "The advent of capable generative multimodal models, such as GPT-4 and Gemini, marks a significant milestone in AI development," said Samuel Hamway, research analyst at technology research firm Nucleus Research. Hamway recommends that businesses start by exploring and experimenting with consumer-available chatbots such as ChatGPT and Gemini, formerly called Bard. With their multimodal functionality, these platforms provide an excellent opportunity for businesses to enhance their productivity in several areas. For example, ChatGPT and Gemini can automate routine customer interactions, assist in creative content generation, simplify complex data analysis and interpret visual data in conjunction with text queries. Despite recent progress, multimodal AI is generally less mature than LLMs, primarily due to challenges related to obtaining high-quality training data. In addition, multimodal models can incur a higher cost of training and computation compared with traditional LLMs. Vishal Gupta, partner at advisory firm Everest Group, observed that current multimodal AI models predominantly focus on text and images, with some models including speech at experimental stages. That said, Gupta expects that the market will gain momentum in the coming years, given multimodal AI's broad applicability across industries and job functions.