metamorworks - stock.adobe.com
Multimodal AI is a relatively new development that combines different AI techniques such as natural language processing, computer vision and machine learning to gain a richer understanding of something. It accomplishes this by analyzing different data types simultaneously to make predictions, take actions or interact more appropriately in context.
More fundamentally, humans want AI to behave in a human-like manner because it would simplify communication and enable better mutual understanding. To do that, AI must use multiple modalities (i.e., video, text, audio files or images) like humans use multiple senses.
"What's happening with multimodal AI is that different types of data are being mixed in the inputs to multimodal AI models to generate more nuance and the ability to answer complicated questions with AI," said Bob Rogers, CEO of Oii, a data science company specializing in supply chain modeling.
Automobile companies and autonomous vehicles use multimodal AI
Multimodal AI applications already have practical uses in various industries. In the automotive industry, multimodal AI is being used in three primary ways: internal operations, customer-facing use cases and manufacturing.
For example, auto manufacturers are automating supply chain operations, such as sending car replacement parts directly from suppliers to consumers without human intervention. Multimodal AI is also being used to automate various tasks such as the following:
- handling customer requests and responding via text or voice;
- collecting and verifying customer IDs;
- automating a recall process; and
- collecting text and filling in forms for customers to sign remotely.
Multimodal AI also helps shorten production cycles by automating traditionally manual tasks. Finally, auto manufacturers are using it to make cars safer, such as in driver assistance systems that detect sleep, fatigue, distraction or attention loss.
"The main benefit of multimodal AI is that it allows organizations to become autonomous enterprises that can automate a large portion of the work process and communications while keeping humans in the loop," said Yaniv Hakim, founder and CEO at AI-powered omnichannel communication platform CommBox.
Healthcare becomes more personalized
Stanford University and global digital transformation solutions provider UST have partnered on multimodal AI to understand how people react when they're subjected to trauma or have suffered an adverse healthcare event, such as a heart attack, using a combination of IoT sensors, audio, images and video.
"It's called 'a weighted combination of networks,'" said Adnan Masood, chief architect of AI and machine learning at UST. "That helps us do a correlation analysis, called 'collusion analysis,' which is a very important thing in multimodal AI where you take these weighted combination networks. A neural network understands what is most important in different modalities and then co-learns based on this information."
If a person suffers an adverse health event, ER personnel can determine if the patient needs immediate care or if the patient's behavior is atypical for a COVID-19 patient, for example. Oii's Rogers said multimodal AI is being used constantly in patient diagnosis, particularly patient imaging.
"You can do an ultrasound to understand whether there's internal bleeding, but it's a very noisy piece of information," Rogers said. "[Multimodal] AI is reading the imaging, but it's also pulling in patient history via text and possibly even details around the kind of impact the patient experienced to interpret the ultrasound. AI combines this knowledge to build a decision path for how to treat that patient."
Multimodal AI in media and telecom
UST worked with a large telecom company to implement multimodal AI with the goal of determining the next best action, such as automatically notifying customers of a service outage.
Telecom companies are also using multimodal AI for fraud detection. In this case, AI is identifying the people using the most bandwidth via multimodal sensors in cell towers, customer behavior across the internet and data usage patterns. From there, it identifies new users who will likely exhibit the same kind of behavior. Then, based on all that, AI applies predetermined targeting and thresholds.
"There are millions of users nationwide, so applying that model [across] a large [number] of users was a fairly daunting challenge and we were able to do it using multimodal AI," Masood said.
Media and entertainment companies are analyzing different media feeds using multimodal AI. The models used learn from various data sets and attempt to understand what an image or set of images contains.
"[Multimodal AI] is heavily used for combining different media feeds and doing analysis on them," Masood said. "So, if you see a visual image, you can ask if what's going on in a sequence is appropriate for a certain audience, whether that sequence has some sort of image in it that is not suitable for a certain audience, or whether the sequence has a certain celebrity in it."
Common challenges with multimodal AI applications
For starters, processing power is an issue. Multimodal AI needs to process terabytes of data in real time from multiple systems and databases, which requires upgraded resources and sufficient processing power. Another one of the main challenges of using multimodal AI is the successful transfer of knowledge between modalities (also known as co-learning).
"Due to the high diversity of questions and lack of high-quality data, some AI models might make educated guesses by relying on statistics, changing the final outcome," CommBox's Hakim said.
Since multimodal AI is relatively new, it's not yet fully understood, nor are the use cases and potential benefits. Data professionals are so accustomed to working on models that focus on a single modality that they don't understand the importance of doing multimodal causality and correlation analysis.
"We know an event has happened but we don't know why. If you work with multimodal data sets, the causality and inference become much easier," Masood said. "We're creating a temporal timeline of events that's pieced together by multiple models -- video, audio and sensors. A lot of algorithmic work has to happen."
Vertical markets are optimistic about the future of their multimodal AI applications, given that it's currently assisting them in their operations, and many have concluded that long-term benefits outweigh the short-term challenges. AI enthusiasts will be observing this nascent branch of AI in the future and focusing on value it adds to industries.