Tech Accelerator What is GenAI? Generative AI explained

Prev Next

Feature

Explore real-world use cases for multimodal generative AI

Multimodal generative AI can integrate and interpret multiple data types within a single model, offering enterprises a new way to improve everyday business processes.

George Lawton

Published: 20 Mar 2025

With multimodal AI systems based on generative artificial intelligence (GenAI), data science teams can create machine learning models that support multiple data types, such as text, images and audio. These new capabilities enable enhancements in areas such as content creation, customer service, and research and development.

Many generative AI applications from Google, Microsoft, AWS, Anthropic, OpenAI and the open source community now support at least text and images within a single model. Efforts are also underway to support other inputs, such as data from IoT devices, robot controls, enterprise records and code.

"Multimodality in AI for business applications is best understood by first recognizing the variety and complexity of data types businesses deal with every day," said Christian Ward, executive vice president and chief data officer at digital experience platform Yext.

Multimodal generative AI models can help with financial data, customer profiles, store statistics, geographical information, search trends and marketing insights -- all of which are stored in diverse forms, including images, charts, text, voice and dialogues. Multimodal AI can automatically find connections among different data sets representing entities such as customers, equipment and processes.

This article is part of

What is GenAI? Generative AI explained

Which also includes:
9 top generative AI tool categories for 2026
Will AI replace jobs? 18 job types that might be affected
30 of the best large language models in 2026

"We are so used to seeing these data sets as separate, often different software packages, but multimodality is also about merging and meshing this into completely new output forms," Ward said.

Getting started with multimodal models

Major AI services, including OpenAI's GPT-4.5 and Google's Gemini, are starting to support different modalities. These models can understand and generate content across multiple formats, including text, images and audio.

The advent of capable generative multimodal models, such as GPT-4 and Gemini, marks a significant milestone in AI development.

Samuel HamwayProduct associate, Cohere Health

"The advent of capable generative multimodal models, such as GPT-4.5 and Gemini, marks a significant milestone in AI development," said Samuel Hamway, product associate at healthcare data analytics company Cohere Health.

Hamway recommends that businesses start by exploring and experimenting with consumer-available chatbots such as ChatGPT and Gemini. With their multimodal functionality, these platforms provide an excellent opportunity for businesses to enhance their productivity in several areas. For example, ChatGPT and Gemini can automate routine customer interactions, assist in creative content generation, simplify complex data analysis and interpret visual data in conjunction with text queries.

Despite recent progress, multimodal AI models are generally less mature than large language models (LLMs), primarily due to challenges related to obtaining high-quality training data. In addition, multimodal models can incur a higher cost of training and computation compared with traditional LLMs.

Vishal Gupta, partner at advisory firm Everest Group, observed that current multimodal AI models predominantly focus on text and images, with some models including speech at experimental stages. That said, Gupta added that the market will gain momentum in the coming years, given multimodal AI's broad applicability across industries and job functions.

Image listing successful generative AI use cases

8 multimodal generative AI use cases

Here are eight real-world use cases where multimodal generative AI algorithms can provide value to enterprises today or in the near future compared to traditional AI.

1. Marketing and advertising

Marketing content creation is one of the top multimodal generative AI use cases seeing relatively substantial traction, Gupta said. Multimodal models can integrate audio, images, video and text to help develop dynamic images and videos for marketing campaigns.

"This has huge potential to further elevate the customer experience by dynamically personalizing content for users, as well as improving efficiency and productivity for content teams," Gupta said.

However, enterprises need to balance personalization with privacy concerns, Hamway cautioned. In addition, they must develop data infrastructures capable of effectively managing large and diverse data sets to glean actionable insights.

2. Image and video labeling

Multimodal generative AI models can generate text descriptions for sets of images, Gupta said. This capability can be applied to caption videos, notate and label images, generate product descriptions for e-commerce, and generate medical reports.

3. Customer support and interactions

Yaad Oren, managing director of SAP Labs U.S. and global head of SAP BTP Innovation, believes that the most promising multimodal generative AI use case is customer support. Multimodal generative AI can enhance customer support interactions by simultaneously analyzing text, images and voice data, leading to more context-aware and personalized responses that improve the overall customer experience.

Chatbots can also use multimodalities to understand and respond to customer queries in a more nuanced manner by incorporating visual and contextual information. One key challenge, however, is ensuring accurate and ethical handling of diverse data types, especially with sensitive customer information.

Image listing GenAI business challenges — Implementing generative AI is not just about technology. Businesses must also consider its impact on people and processes.

4. Supply chain optimization

Multimodal generative AI can optimize supply chain processes by analyzing text and image data to provide real-time insights into inventory management, demand forecasting and quality control. Oren said SAP Labs U.S. is exploring analyzing images for quality assurance in manufacturing processes and identifying defects or irregularities. The company is also examining how natural language processing models can analyze textual data from various sources to predict demand fluctuations and optimize inventory levels.

5. Improved healthcare

Taylor Dolezal, CEO chief of staff at machine learning programming company Merly, sees considerable promise in the healthcare sector for integrating various data types to enable more accurate diagnostics and personalized patient care. Multimodal generative AI is particularly useful for diagnostic tools, surgical robots and remote monitoring devices.

"While these advancements promise improved patient outcomes and accelerated medical research, they pose challenges in data integration, accuracy and patient privacy," Dolezal said.

6. Improving manufacturing and product design

Multimodal generative AI can improve manufacturing and design processes, Dolezal said. Models trained on design and manufacturing data, defect reports, and customer feedback can enhance the design process, increase quality control and improve manufacturing efficiency.

AI can analyze market trends and consumer feedback in product design as well as implement quality control and predictive maintenance in manufacturing processes. The main challenge lies in integrating multiple data sources and ensuring the interpretability of AI decisions, Dolezal said.

7. Employee training

Multimodal generative AI can enhance learning and mastery in employee training programs, Ward said. By using diverse instructional materials and data to create content, AI can create a custom experience for each role. From here, employees can "teach" the material back to the AI through an audio or video recording to create an interactive feedback mechanism. As employees articulate their understanding of the material to the AI system, it assesses their comprehension and identifies learning gaps.

Ward cautioned that this approach could face challenges, particularly in human adoption of AI feedback. Nevertheless, it promises a more personalized and effective learning experience.

8. Multimodal question answering

Ajay Divakaran, senior technical director of SRI International, said the nonprofit scientific research institute is working on how to improve question answering through the combination of images and text, as well as audio.

This multimodal feature is particularly useful for applications that involve carrying out ordered steps. For example, someone querying an AI system with a home repair question could receive a combination of textual steps along with generated images and videos, with the text and visuals working together to explain the steps to the user.

George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.

Next Steps

How will generative AI reshape the enterprise?

AI readiness: What is it, and is your business ready?

How bad is generative AI data leakage and how can you stop it?

AI existential risk: Is AI a threat to humanity?

How to detect AI-generated content

Explore real-world use cases for multimodal generative AI