Voice-enabled devices of all sorts are increasingly finding their way into our daily lives. What may have started out as smart speakers sitting on our desks or kitchen countertops has rapidly evolved into a diverse collection of devices and embedded technologies that provide valuable voice assistant capabilities in a wide range of interfaces.
You can now find devices from a variety of vendors finding their way into an increasing array of locations. You can speak to your television, interact with your toaster oven, talk to your car and, perhaps soon, even have a conversation with your bed. The use of intelligent, conversational systems powered by AI and machine learning is starting to become ubiquitous.
However, as these conversational systems are being used in increasingly different and diverse settings, users are looking to take the benefits of voice assistant technology into new and more challenging realms. Rather than being simple music and query-oriented devices, which might have been their original role, these systems are being asked to control various interfaces, provide more complicated responses and deliver more value for their users.
What might have been acceptable in terms of intelligence just a year or two ago now is becoming more of a hindrance. It is no longer acceptable for these devices to respond unintelligently to queries, to push the users to a web search or to respond that they can't help the user. These systems are being asked to be more intelligent and, as such, are starting to push the limits of what the back-end AI systems can do.
Benchmarking voice assistant intelligence
In 2018, AI research and advisory firm Cognilytica began measuring the intelligence of voice assistant devices to determine their knowledge and reasoning capabilities. This past week, Cognilytica released its most recent update of the benchmark, showing increasing capability and intelligence of the devices across a range of measures.
The benchmark measures conversational system intelligence by asking 120 questions grouped into 12 categories of various levels of cognitive challenge. For example, one question asks, "Should I put a wool sweater in the dryer?" And another asks a complicated formulation, such as, "Paul tried to call George on the phone, but he wasn't successful. Who wasn't successful?"
The aim of the benchmark isn't to test the speech recognition skills of the various devices. With proper tuning and training, these devices are capable of handling almost any voice in many languages. Rather, the benchmark aims to determine how intelligent the back-end AI system is that's responsible for understanding the question being asked, formulating a response and then generating that response back to the user. The intelligence of the back end plays a crucial role in whether devices are able to deliver on the benefits of voice assistant technology.
While the speech recognition capabilities of voice assistants are often fairly simple and use technology that has evolved over the past few decades, the more difficult cognitive conversational question answering uses rapidly evolving machine learning technology that sits in the cloud infrastructure operated by voice assistant vendors. In essence, the benchmark isn't evaluating the devices themselves, but rather the intelligent capability of the AI cloud infrastructure that supports those devices.
Surprising differences in voice assistant capabilities
In the 2018 version of the benchmark, voice assistants as a whole had a failing grade, with Amazon's Alexa device mustering up the greatest number of adequate responses -- only 25% of the total asked. Google was a close second, with 23% adequate responses. Microsoft's Cortana and Apple's Siri trailed far behind, with only 12% and 11% of responses categorized as adequate, respectively.
In the 2019 iteration of the report, voice assistants have shown dramatic improvements. Amazon's Alexa still comes out on top with the greatest number of adequate responses, at 34.7% of total questions asked. Google and Microsoft devices are close behind, at 34.0% and 31.9%, respectively. Apple's Siri still trails behind, at 24.3% adequate responses.
While these conversational systems have shown substantial improvement from the first iteration of the benchmark, as a whole, the devices still are far from delivering the promised benefits of voice assistant technology. Not a single system can muster suitable responses to at least half of the questions asked. This brings up a big question: Are they adequate for the tasks in which people are using these devices?
Cognilytica's benchmark shows they still miss the mark with regard to many routine and expected questions users might ask today -- and perhaps even more so for the sorts of questions users might ask tomorrow, given the places these conversational systems are being put into use.
Growing the knowledge graph
In addition to speech-to-text and natural language understanding capabilities, making these conversational systems able to respond to complex queries requires the creation of deep repositories of information from which these systems can draw, as well as knowledge graphs that connect concepts together in a way that machines can understand. While there is an almost limitless amount of information available on the web from a wide variety of sources from which conversational systems can draw, the same cannot be said for knowledge graphs.
Machines use knowledge graphs to be able to reason about connections between different words and concepts and to construct meaningful replies that are relevant to what is being asked. Because knowledge graphs are so important to the quality of the responses, each of the conversational system vendors are working on building their own cloud-based knowledge graphs to power their systems.
According to Amazon, there are over 10,000 workers in its Alexa division alone, many of whom no doubt are helping to create, manage and power those knowledge graphs. Google, Microsoft and Apple have similar staffing numbers and are furiously building up their knowledge graphs to handle the increasingly complex requirements of their rapidly growing user bases.
In fact, Amazon, Apple and Microsoft have each faced scrutiny over their use of humans in the loop to help power their devices. While many are accusing these firms of not disclosing the fact that humans are listening in to parts of voice assistant conversations, the reality is humans are needed to help build, maintain and fix the knowledge graph over time and make them more useful.
Indeed, while performing the latest benchmark, Cognilytica analysts noticed that Amazon Alexa's responses to one of the questions changed after being asked multiple times, with initial Category 0 responses that later changed to perfect Category 3 responses. This might be a result of Amazon's recently announced Answer Updates feature, which would send failed responses back to its internal teams to resolve and update in order to get toward a more meaningful future response.
While these voice assistants might not currently get a passing grade in even a kindergarten class, it's clear devices continue to get more intelligent over time, and the vendors are determined to make them an intelligent part of our daily lives. With continued improvement, these devices may soon deliver the promised benefits of voice assistant technology.