'Virtual humans' pick up on social cues
Carnegie Mellon University's Justine Cassell talks about her efforts to turn software into 'virtual humans.'
In the fifth episode of Schooled in AI, Justine Cassell, a professor at the Carnegie Mellon University's Human-Computer Interaction Institute, talks about building socially aware robots and the importance of rapport in artificial intelligence.
In this episode, you'll learn:
- How "virtual humans" could benefit from the art of small talk
- Why Cassell's research starts with observing human-human interactions
- How machine learning has benefited and challenged her work
To learn more about Cassell and her research, listen to the podcast by clicking on the player above or read the full transcript below.
Transcript - 'Virtual humans' pick up on social cues
Justine Cassell: Something I've learned more recently about behavior both in human-human interaction and in human-robot interaction is that part of rapport building is being predictable.
Cassell said it was a graduate student's research that led to the observation.
Cassell: People come into an interaction with a set of expectations. And adhering to those expectations -- whatever they are -- increases their felt rapport.
The discovery has not only helped Cassell become a better teacher, but it's helped her build better robots -- machines that are reliable, predictable, transparent -- even admitting when they've hit their programming wall. So, they'll tell the user that they'll …
Cassell: … make sure my developers pay more attention to that skill when they do the next revision of my software. People love that. They really respond positively to a robot that expresses its limitations. And that's reliability.
OK, so, in this episode, we're going to touch on something a little different -- we're going to talk about social awareness in artificial intelligence and why things like trust and reliability and even small talk are important components to consider when building what Cassell calls 'virtual humans,' or, as they're more commonly referred to these days, 'virtual digital assistants.'
Cassell's virtual humans are trained to read between the lines, to reason when an insult is a tease and not a put down. They're modeled on human behavior in an effort to build connections -- to build rapport -- with the people they're working with.
Cassell: I try in my work to find aspects of human interaction that, really, in some way defines what it means to be human and, yet, has been to date ignored by the human-computer interaction community. That's been the case with social language.
Decoding human behavior
The first virtual human Cassell built, with the help of a Ph.D. student, was a real estate agent. The work was based on intensive study of the human version and the objective was to observe and document how a real estate agent builds trust with her clients.
Cassell: After roughly six months of studying a real estate agent and then just to give us a kind of balanced view, also eyeglass salesmen who sell to optometrists, we discovered that small talk -- the little social talk that we engage in everyday with people we know or don't know very well -- can be predicted.
It was an important finding because social interactions can establish connections between two people that influence how work gets done. Here's how Cassell put it.
Cassell: Human-human social interaction greases the wheels of task interaction. It facilitates the work we do with other people.
So, they built a small-talk feature into their virtual real estate agent and discovered that clients, especially those who are extroverts, felt they were better known by the agent and felt that the transaction process was smoother.
Cassell: And that was the kind of effect that sent me off on the path of looking at other aspects of human-social interaction and thinking through how to best use them in a human-computer interaction.
Her more recent iterations of virtual humans include an algebra tutor and a virtual personal assistant. And, like the real estate agent, the work that goes into building a virtual human doesn't start with technology -- but with people.
Cassell: My work always starts with human-human interaction.
We're talking thousands of hours of conversations that are video recorded and then annotated in one-thirtieth-of-a-second slices.
Cassell: What I mean by that is we had to transcribe the words and the gestures and the eye gaze and the head nods -- everything we were interested in -- into text. And then we go through and mark the places that we thought were important.
Things like small talk and praise or something like negative self-disclosure, which teenagers are prone to use and, when used correctly, can build rapport between two people.
Cassell: So, they say things like, 'Oh, don't worry. I suck at math, too.' That's a negative self-disclosure where you disclose something about one's self that's negative.
Microworkers and machine learning
Initially, she hired undergraduates and research assistants to annotate her video and audio recordings.
Cassell: More recently, I use microworkers such as Mechanical Turk workers. People online who I'm never going to meet and who will do small pieces of work for small sums of money.
And in the past, she would translate all of the annotations into hard code for the robot, writing a rule that when rapport is low, for example, the algebra tutor should engage in negative self-disclosure.
Cassell: These days, however, I use machine learning, and increasingly, deep learning on the human-human data to automatically choose where negative self-disclosure should be performed. And to automatically make the robot perform that negative self-disclosure.
The machine learning technologies include dialogue systems as well as deep neural networks, which are used to understand the context in which a behavior occurred. An example of this type of network is Long Short-Term Memory, or LSTM.
Cassell: So, something called LSTM, for example, looks at the context in which negative self-disclosure occurs and keeps that context in mind when making the robot perform a similar behavior.
Cassell said machine learning is a double-edged sword in her line of work. It can uncover rules at a much more granular level automatically -- rules that are more concrete than she could write by hand. But there are drawbacks, too.
Cassell: One is you need an awful lot of data to get a stable and reliable machine learning rule.
The data that Cassell uses to train a machine learning algorithm isn't easy to come by. Hours and hours of video-recorded conversations still need to be annotated by humans, and that's a time-consuming, laborious task -- even with an army of microworkers at her disposal.
Cassell: It's a challenge that my students and I face every day.
The second drawback is the so-called black box. With machine learning models, the inputs and the outputs are known. But how a model arrives at the output is not known. It's another way of saying that it's unclear how a machine learning model determines the rules for engagement.
Cassell: Sometimes, when I look at the errors that the system has made, when it says something really outrageous, something that really doesn't make sense in the context, I look at the nature of that error and find out that the machine learning paid attention to something that I know to be irrelevant.
Cassell and her students are trying to find a way to crack open the black box -- at least a little -- and create a kind of human-machine learning cycle where machine learning models benefit from human expertise and where human expertise benefits from machine learning models.
Cassell: I want the goal of my research to teach us things that can help psychiatrists, that can help real estate agents, that especially can help teachers and tutors in classrooms. And with machine learning, there's really no way to help people because, since that machine learning algorithm is a black box, I don't know why those rules have been chosen and what the rules really are.
A model for social awareness
One of the ways she's doing this is by building theories of human behavior and conversational strategies into machine learning models -- to, essentially, use theories and strategies as a way of filtering recommendations. And that has required Cassell to, and these are her words, 'build a social awareness system that can understand language, recognize the kinds of behaviors that build rapport, reason about what an appropriate response would be and then generate the response and speak it using speech synthesis.'
Cassell: That's quite a task. It's an end-to-end dialogue system, which means that it has a bunch of modules. And we've built three modules that have never been built before and added them to a fairly classic dialogue system.
The first module is a conversational strategy classifier.
Cassell: That means it listens to what the person said and what they've done with their nonverbal behavior, because it turns out that that helps.
It even listens for the quality of the person's voice -- is the person speaking loudly or softly.
Cassell: And it uses machine learning to automatically detect which kind of conversational strategy the human has uttered. Are they self-disclosing? Are they praising? Are they violating social norms? And so forth.
It sends the relevant information to a rapport estimator module, which uses something called temporal association rules.
Cassell: That's a kind of machine learning rule that can predict the outcome of a series of events with a temporal relationship amongst those events.
Let's take the algebra tutor, for example. If the student violates social norms, teases the tutor and then smiles and the tutor also smiles, rapport will be high.
Cassell: But when the tutee violates social norms and smiles and the tutor does not smile but interrupts to keep speaking about the task at hand, rapport will be low. Isn't that cool?
The third module is a social reasoner, which takes input from the conversational strategy classifier telling what strategy the person is using, takes information from the rapport estimator and then ranks the level of rapport on a scale from one to seven. It then reasons what a good response might be.
Cassell: Now, rapport always goes up if you use the same strategy as the other person -- that's called reciprocity and reciprocity always raises rapport or usually raises rapport. So, the reasoner might think, 'Hmm, in the absence of any other data, reciprocity would be good. So, if the human self-discloses, I'll self-disclose.' And that actually turns out to be a good way to raise rapport.
Or the system might notice that, in spite of the high number of self-disclosures, rapport is still low. And so it changes its tactic.
Cassell: And so the social reasoner is going back and forth amongst those different strategies and, finally, settles on one -- perhaps praise. And it's using a kind of neural network called a spreading activation network to choose amongst those strategies.
That's then sent to the natural language generation part of the dialogue system, which generates some kind of praise. And then, and this is critical, the cycle repeats itself.
Cassell: Now, I don't want to tell tales out of school, but we're collecting data on this right now, and it does look as if we may have found that when rapport rises over the course of an interaction that, in the tutoring case, the student learns more than when rapport doesn't rise. And so, we go back to the start and we say, 'OK, we've got some good data here, but we've got some other places where some things aren't working. Let's go back and revise this.' And then we go round and round or, as I say, we rinse and repeat.