The future of voice assistants is multiturn conversations

The future looks promising for voice assistants, but for them to really live up to the hype, they are going to have to improve at true multiturn conversations.

It's hard to believe that Amazon's Alexa device has only been around for five years. In just these few short years, conversational interface-based devices have quickly gained popularity. But, while adoption of voice assistant-based devices is increasing, there are some key areas in which the technology needs to grow for the future of voice assistants to reach its full potential.

While some in the industry call these devices smart speakers, the better nomenclature is voice assistants, which more accurately defines these as voice-based conversational interfaces paired with intelligent cloud-based back ends.

Voice assistants, such as Amazon Alexa, Google Assistant (Google Home), Apple Siri, Microsoft Cortana, Samsung Bixby and an increasing number of entrants into the space, are becoming a part of people's daily routines. These voice-activated assistants enable a wide range of capabilities, such as answering questions on a variety of topics, helping with conversational commerce, playing music, performing personal business assistant capabilities and other related activities. As the adoption of voice assistants continues to increase, users are expecting more functionality in these assistants.

Complicating voice interactions: Multiturn conversations

One key in which voice assistants need to improve is multiturn conversations. Since humans feel most comfortable conversing in natural language rather than typing, clicking and swiping, it's no surprise that people generally feel comfortable conversing with voice assistants. However, unlike having a conversation with another human, natural language processing (NLP) technology has typically only been able to answer one conversational interaction at a time rather than enabling longer-form, multiple back-and-forth conversational interactions.

Without the ability to engage in longer conversations, virtual agents struggle to carry the context of one interaction or question to the next. The ability to handle multiple conversational interactions in one context may seem like a simple problem to solve for advanced NLP technology, but figuring out how to get machines to understand spoken words and generate understanding of the communicator's intent is actually quite hard. The challenge lies in the fact that these voice assistants need to take multiturn interactions and connect them together in a cohesive manner. Rather than just responding to individual phrases or sentences, users want to have longer-form, back-and-forth, multiturn conversations.

Data on how people use voice assistants
How people use voice assistants

To tackle the problem of multiturn conversations, voice assistant vendors, such as Amazon and Microsoft, are looking to AI and machine learning. With the ability to handle longer-form interactions, these voice assistant devices will simplify interaction with voice applications. Rather than requiring users to open and query multiple skills -- which are essentially applications run by the voice assistant -- for various needs and ask questions one at a time, multiturn, conversation-enabled devices can connect together multiple voice applications or skills as part of one conversational interaction and create a seamless experience for the user.

At the Amazon Re:Mars conference, held in Las Vegas in June, Amazon Alexa Conversations, which enables a conversational thread across multiple skills, all in one coherent conversation. The "night out" conversation lets a user purchase movie tickets, make dinner reservations and request an Uber ride all in one conversation.

Similarly, Microsoft is showcasing kinds of multiturn, multi-domain and multi-agent experiences. Last year, the company acquired Semantic Machines, an AI startup that provides next-generation conversational AI technology. Microsoft plans to use the technology to support multiturn interactions for business users, helping them do things like book a conference room and update participant schedules, all with one sentence.

Google has also been working to make its voice assistant capable of holding longer conversations. When asking multiple questions in a row, users no longer need to say the activation phrase "OK, Google" or "Hey, Google" every time they give a command. Users can also ask multiple questions at once, and Google Assistant will respond to all of them. For example, users can ask about both the weather and a trivia question, and Google will provide answers to both questions.

Dealing with voice application disintermediation

As virtual assistants are able to handle more complex interactions with less friction due to the user not having to open up a different skill for each action, it begs the question: Which skills will my voice assistant use? If the future of voice assistants is one in which the program will choose for you which applications or skills to use as part of a multiturn interaction, this will, in some ways, disintermediate the user by giving the voice assistant the power to choose which skills to use for any particular context.

The increasing control of the voice assistant device over voice application choice might concern voice application developers who want to open up their skills to a broad audience. If the voice application can choose which skills to use, will this undermine the ability of these voice application developers to get users?

For example, with the current, more simplistic conversation interaction modes, if you want to order a pizza using the voice assistant, you need to select the appropriate pizza voice application to enable and then enable it for use on your device. When you then order your pizza with voice commands, you use the pizza skill you selected or perhaps one of many that are enabled on the device.

However, in the example Amazon showed with the night out experience, it used the Atom Tickets skill when ordering movie tickets. At no point in the interaction did Alexa ask the user which movie ticket system it wanted to use. Does this mean that Amazon will somehow become the arbiter of which skills will win out to simplify multiturn conversational interaction? Will companies not included in these multiturn experiences lose out on customers?

While there's no doubt that people want to have increasingly more powerful and simpler interactions with their voice assistants, it's also clear that we're still in the early days for these multiturn experiences and the impact they will have on the market. The future of voice assistants will largely turn on how these issues resolve.

Next Steps

Amazon updates aim for Alexa everywhere

Dig Deeper on AI infrastructure

Business Analytics
Data Management