JJ'Studio - Fotolia
Multimodal machine learning in healthcare aids patient consults
In this Q&A, Louis-Philippe Morency talks about how he's building algorithms that capture and analyze the three V's of communication -- voice, verbal and visual.
Louis-Philippe Morency is on a mission to build technology that can better understand human behavior in face-to-face communication.
By using specialized cameras and a kind of artificial intelligence called multimodal machine learning in healthcare settings, Morency, associate professor at Carnegie Mellon University (CMU) in Pittsburgh, is training algorithms to analyze the three Vs of communication: verbal or words, vocal or tone and visual or body posture and facial expressions.
"Our main goal is to use this technology in the healthcare field, specifically in mental health, to help build an objective measure of mental illness symptoms," Morency said.
In this Q&A, Morency, a member of the CMU School of Computer Science's Language Technologies Institute and former research faculty at the University of Southern California (USC) Computer Science Department, talks about his earlier work on virtual agents, the challenges involved with multimodal machine learning in healthcare, why a hybrid approach to machine learning is necessary and how patients and providers interact with the technology.
Some of your work at USC focused on virtual agents. What about your work at CMU?
Louis-Philippe Morency: What we are developing at CMU is complementary to the virtual human, which is how we quantify what we call the three V's of communication: verbal, vocal and visual. In other words, what you say or the words you say, how you say those words and also the facial expression and body gestures that you do at the same time.
I've seen a term connected with your work -- multimodal machine learning. What is that?
Morency: This is an exciting and also challenging aspect of artificial intelligence, and it brings us closer to human intelligence. How do you bring together these three modalities -- the visual, verbal and vocal? Because it's not just the words: I can say the same word, 'yeah,' with different tones of voice, which also means there are different meanings behind the word or different intents. Also your facial expressions become important to compliment what you are saying. And so multimodal machine learning is the idea of building algorithms that reflect human intelligence in the sense that they are integrating the information to better get an estimate of people's emotions, intent or even traits and personalities behind what they say and how they behave.
How do you go about building algorithms for multimodal machine learning in healthcare?
Morency: We do this in two ways: We build some algorithms specific to healthcare. Knowing that sometimes we have limited amount of a recording, we have to adapt and learn quickly from a small amount of data. Complementary to that, we are doing research that uses huge amounts of data available online, on YouTube and other similar sources like social networks. These document how people are expressing their opinions when they're sad, happy, when they're from the U.S., from Canada, from other countries.
It would be challenging for us to get data only from the hospital to understand the breadth and variation. That's why we complement our research with a larger source of information, which allows us to look at how people are behaving in a social environment. This is slightly different from an interaction with a doctor, but we can adapt later on to the environmental hospital.
Do you select the specific videos you use to train a multimodal machine learning algorithm?
Morency: We started simple because there is so much data there. We decided to start with videos of people talking directly to the camera and expressing their opinions. There is a category of videos called vlogs, video blogs, and, so, how can we use this data? It turned out they were very easy to find. People love talking about what they feel and their opinions and what happened during their day. We take all of this data, a lot of it is already transcribed either manually or automatically, and we start annotating with local annotators but also using services like Mechanical Turk to really get a feeling of the sentiments expressed, the emotion, as well as the fine-grain data about personality as well.
So you go through a transcript and find a place where the subject is expressing, say, happiness, and you label that part of the video?
Morency: Yes, using this technique, we've been able to annotate 25,000 videos. It is large from a healthcare perspective. It is small compared to the internet, where usually we talk about millions. This is why we're scaling this even more and we are also looking at different languages and different cultures to expand this effort.
How does the patient interact with the AI? Does it collect data while the patient is with a doctor? Or is it used in a pre-interview with the patient?
Morency: We don't want to change what the clinician is already doing. Or if there are changes, they should be as minimal as possible. So as the healthcare provider is talking with their patient, we simply have a camera on the table. It's important to have the patient aware of it so that they can accept or [reject] the recording. But as long as the patient agrees, the recording can happen. The rest of the discussion happens the same way it used to between the two participants.
During that time, we record both sides, because it's not about just looking at the patient. Maybe the patient is smiling because the interviewer smiled, and so we need to also look at what's called the dyadic aspect of the interaction -- or how the two people interact together.
Do you have to think about making a patient feel comfortable -- that the technology is not a lie detector test but an assistant?
Morency: At the beginning, we worried people would look at the camera more often, would change their behavior. Usually what we do is before their first interaction, we take maybe 30 seconds or a minute to explain the technology. But quickly, in almost all of our recordings, people forget about it. These are interactions that are centered on them, on their emotions and on questions about themselves. They quickly become involved in the conversation and forget about the hardware.
How do you provide the results to healthcare providers?
Morency: At the end of a session, we produce a report. It's an interactive report that can be accessed on an iPad, iPhone or on their computer. It will summarize three different aspects, at least. The first is about confidence. Before you start looking at what the computer has to say, it's important to know how confident the computer is at this point -- what aspects of the report are clear from the computer's perspective and what aspects still need feedback from a human to confirm. This is the first aspect, which is important because it's about trust and confidence. The second aspect looks at what we call behavior bio markers or behavior markers: What did we see, what are the changes in interesting behaviors we saw. Finally, we look at which bracket of intensity of symptoms this person displays. For example, for psychosis, is it more about positive symptoms or negative symptoms that are observed. This is a holistic kind of assessment.
What kinds of machine learning do you use? Deep neural nets, I assume, but what else?
Morency: We use a hybrid approach. Deep learning is useful when we have objective measures. For example, I want to know if you're smiling or not. This is something humans can annotate precisely and reliably. Deep neural networks are good for that.
But when we have things that are a little bit fuzzier, where a judgment is needed, then we start looking at graphical models or probabilistic models because they allow us to model the uncertainty of a measure. That becomes important when you have measures that are either subjective or that they have uncertainty about them.