Google Search Labs What is the inception score (IS)?

multimodal AI

What is multimodal AI?

Multimodal AI is artificial intelligence that combines multiple types, or modes, of data to create more accurate determinations, draw insightful conclusions or make more precise predictions about real-world problems. Multimodal AI systems train with and use video, audio, speech, images, text and a range of traditional numerical data sets. Most importantly, multimodal AI means numerous data types are used in tandem to help AI establish content and better interpret context, something missing in earlier AI.

How does multimodal AI differ from other AI?

At its core, multimodal AI follows the familiar AI approach founded on AI models and machine learning.

AI models are the algorithms that define how data is learned and interpreted, as well as how responses are formulated based on that data. Data, once ingested by the model, both trains and builds the underlying neural network, establishing a baseline of suitable responses. The AI itself is the software application that builds on the underlying machine learning models. The ChatGPT AI application, for example, is currently built on the GPT-4 model.

As new data is ingested, the AI makes determinations and generates responses from that data for the user. That output -- along with the user's approval or other rewards -- is looped back into the model to help the model continue to refine and improve.

The fundamental difference between multimodal AI and traditional single modal AI is the data. A single modal AI is generally designed to work with a single source or type of data. For example, a financial AI uses business financial data, along with broader economic and industrial sector data, to perform analyses, make financial projections or spot potential financial problems for the business. That is, the single modal AI is tailored to a specific task.

On the other hand, multimodal AI ingests and processes data from multiple sources, including video, images, speech, sound and text, allowing more detailed and nuanced perceptions of the particular environment or situation. In doing this, multimodal AI more closely simulates human perception.

What technologies are associated with multimodal AI?

Multimodal AI systems are typically built from a series of three main components:

  • An input module is the series of neural networks responsible for ingesting and processing, or encoding, different types of data such as speech and vision. Each type of data is generally handled by its own separate neural network, so expect numerous unimodal neural networks in any multimodal AI input module.
  • A fusion module is responsible for combining, aligning and processing the relevant data from each modality --speech, text, vision, etc. -- into a cohesive data set that utilizes the strengths of each data type. Fusion is performed using a variety of mathematical and data processing techniques, such as transformer models and graph convolutional networks.
  • An output module is responsible for creating the output from the multimodal AI, including making predictions or decisions or recommending other actionable output the system or a human operator can utilize.

Typically, a multimodal AI system includes a variety of components or technologies across its stack, such as the following:

  • Natural language processing (NLP) technologies provide speech recognition and speech-to-text capabilities, along with speech output or text-to-speech capabilities. Finally, NLP technologies detect vocal inflections, such as stress or sarcasm, adding context to the processing.
  • Computer vision technologies for image and video capture clarify object detection and recognition, including human recognition, and differentiate activities like running or jumping.
  • Text analysis allows the system to read and understand written language and intent.
  • Integration systems allow the multimodal AI to align, combine, prioritize and filter data inputs across its various data types. This is the key to multimodal AI because integration is central to developing context and context-based decision-making.
  • Storage and compute resources for data mining, processing and result generation are vital to ensure quality real-time interactions and results.
Elements of NLP
These are several uses for natural language processing (NLP).

What are the use cases for multimodal AI?

Multimodal AI yields a range of use cases that make it more valuable than unimodal AI. Common applications of multimodal AI include the following:

Computer vision

The future of computer vision goes far beyond just identifying objects. Combining multiple data types helps the AI identify the context of an image and make more accurate determinations. For example, the image of a dog combined with the sounds of a dog are more likely to result in the accurate identification of the object as a dog. As another possibility, facial recognition paired with NLP may result in better identification of an individual.


Multimodal AI has a wide range of workplace applications. An industrial vertical uses multimodal AI to oversee and optimize manufacturing processes, improve product quality or reduce maintenance costs. A healthcare vertical harnesses multimodal AI to process a patient's vital signs, diagnostic data and records to improve treatment. The automotive vertical uses multimodal AI to watch a driver for signs of fatigue, such as closing eyes and lane departures, to interact with the driver and make recommendations such as rest or a change of drivers.

Language processing

Multimodal AI performs NLP tasks such as sentiment analysis. For example, a system identifies signs of stress in a user's voice and combines that with signs of anger in the user's facial expression to tailor or temper responses to the user's needs. Similarly, combining text with the sound of speech can help an AI improve pronunciation and speech in other languages.


Multimodal AI is central to robotics development because robots must interact with real-world environments, with humans and with a wide range of objects such as pets, cars, buildings and their access points, and so on. Multimodal AI uses data from cameras, microphones, GPS and other sensors to create a detailed understanding of the environment and more successfully interact with it.

Multimodal AI challenges

Multimodal AI's potential and promise comes with challenges, particularly with data quality and interpretation, for developers. Common challenges include the following:

  • Data volume. The data sets needed to operate a multimodal AI, driven by the sheer variety of data involved, pose serious challenges for data quality, storage and redundancy. Such data volumes are expensive to store and costly to process.
  • Learning nuance. Teaching an AI to distinguish different meanings from identical input can be problematic. Consider a person who says "Wonderful." The AI understands the word, but "wonderful" can represent sarcastic disapproval. Other context, such as speech inflections or facial cues, help differentiate and create an accurate response.
  • Data alignment. Properly aligning meaningful data from multiple data types -- data that represents the same time and space -- is difficult.
  • Limited data sets. Not all data is complete or easily available. Limited data, such as public data sets, are often difficult and expensive to find. Many data sets also involve significant aggregation from multiple sources. Consequently, data completeness, integrity and bias can be a problem for AI model training.
  • Missing data. Multimodal AI depends on data from multiple sources. However, a missing data source can result in AI malfunctions or misinterpretations. For example, if audio input malfunctions and provides no audio, or audio such as whining or static noises, AI's recognition and response to such missing data is unknown.
  • Decision-making complexity. The neural networks that develop through training can be difficult to understand and interpret, making it hard for humans to determine exactly how AI evaluates data and makes decisions. Yet this insight is critical for fixing bugs and eliminating data and decision-making bias. At the same time, even extensively trained models use a finite data set, and it is difficult to know how unknown, unseen or otherwise new data might affect the AI and its decision-making. This can make multimodal AI unreliable or unpredictable, resulting in undesirable outcomes for AI users.
This was last updated in May 2023

Continue Reading About multimodal AI

Dig Deeper on AI technologies

Business Analytics
Data Management