chombosan - stock.adobe.com
While AI-powered speech recognition and voice technology has changed how consumers interact with their devices and how businesses capture and process audio data, the technologies still hold some limitations.
In particular, some speech recognition platforms struggle to accurately transcribe speakers with heavy accents or speakers conversing in a language other than English. While there are many products on the market that work well in languages other than English, it's rare for a single platform to perform well with multiple languages.
Also, platforms generally struggle to understand speaker intent, limiting the amount of automation a user can perform on a document.
In this Q&A, Wilfried Schaffner, CTO of Speech Processing Solutions (SPS), discussed some of these limitations.
While SPS doesn't build its own speech recognition engines -- it instead relies on third-party speech recognition software-- it makes hardware, including AI-powered microphones and handheld recorders to capture speech, and document workflow software.
The COVID-19 pandemic has caused a spike in the use of speech recognition and voice technology as organizations speed up digital transformation processes, Schaffner noted.
Particularly, Schaffner said he has seen more people in healthcare turn to speech technologies, as their already busy schedules have become busier.
So, perhaps now, more than ever before, it's important for organizations to understand the benefits and the limitations of speech technologies.
What do you see as the current limitations of AI and speech technology for enterprises?
Wilfried Schaffner: First and foremost, I would say look at the consumer area. Voice recognition is huge with products like Google Home and Alexa. But, think of when you speak to Alexa. You've regularly got a situation where Alexa just doesn't understand you. In professional environments, 90% accuracy is not enough. To really make a business impact, you need more than 90% reliability. We see a need of 98% reliability in our studies.
If you build a service as a business where you would like to optimize and cut time, you don't want to work [on the transcription after it has been transcribed]. What I see as a limitation is quality. On one side, quality is determined by the actual speech recognition engine. On the other end, you need to get the right microphones and hardware, as they make a big difference. Once reliability gets better, then the deliberations will go away, more and more.
I'm pretty certain that this is what's happening right now. I think we are providing better microphones; we are providing better algorithms and better noise-canceling functionalities, and that should help to actually improve the recognition rate of a speech recognition engine.
It seems that many speech recognition vendors struggle to accurately recognize languages other than English, or someone speaking with a heavy accent. Do you see this as a problem enterprises are facing?
Schaffner: For sure. You know, there are two sorts of AI. One version of AI is a trained model; it works the way it's trained. Then there are more advanced AI solutions that actually train themselves by learning the text. When you correct the text, you have a learning loop in there. So, there are really good solutions using that already that can learn accents.
Wilfried SchaffnerCTO, Speech Processing Solutions
So, I think it's a just matter of time [until this isn't a problem anymore]. But currently it's quite fragmented I'd say, due to all the different languages out there. Instead of one engine that can recognize every language, we see lots of engines popping up in the markets [targeting specific languages]. We see a lot of engines popping up in the Middle East.
What are businesses doing with all of their recorded audio? Some enterprises appear to struggle to store their recorded conversations. What are your thoughts?
Schaffner: For sure. This is why we need to bring businesses together with state-of-the-art software.
In medical, legal or insurance, areas that we work in heavily, this is quite common, as they have many recordings come in. That's why we have workflow solutions where you drop all the recordings in and process them and store them in the right place. But then you have to decide what to do with the recording. You stored the recording, but can you understand intent from the recording? This is where AI has to become much stronger, because the intent is useful, not just the recording. But that's a whole process. For this problem you just mentioned, there are enough standard solutions around to actually process the whole pile of recordings.
You mentioned intent. AI still needs to advance a long way to properly capture intent.
Schaffner: You're right, it's a lot of work. But I also think we, like many things in life, should not shoot for the moon immediately. I think it's step by step.
You could transcribe an audio letter into a text letter, but then what do you do with it? You still need to save it for a certain client. You have to click into your CRM [customer relationship management] system and search for the right client, and that takes four or five or six clicks. It adds on 30 seconds, 80 seconds, whatever, until you have it saved. So, we try to solve small intents at the beginning of the process. The solution can listen in to the letter, determine who the client is, find the client in the CRM system, and attach the recording to the correct client. This is the starting point of where we are at. But you're right, at this time, having a system that can fully automate all of your intents is a bit far-fetched. Right now, it's about starting at the beginning.
What industries are currently using AI and speech technology the most?
Schaffner: As I mentioned before, there are the medical and legal markets, and they are the two top users. If you look at Nuance, which is a market leader in speech recognition software, those are the two markets they serve, pretty much. The biggest after that is insurance and then law enforcement.
We recently built a new AI-based microphone that's capable of separating two speakers. Think of a doctor-patient situation, where the doctor is talking to the patient, then turns around and dictates into a microphone the text he wants transcribed. But why not record the conversation between the patient and doctor? Why not have AI then process the conversation and create a text?
The big problem there is that the conversation is not recorded. See, it's quite difficult, even for a human, to separate two speakers when two people speak at the same time. We built a product with a special array of microphones and AI that is capable of actually separating two speakers, even when they do speak at the same time. The separation is done at a quality that enables speech recognition software to process both streams. Currently, we can only separate out two speakers, but we are working on more.
What this means is that [technology like this] can open up a huge new segment of users, because, suddenly, you can document conversations. This is needed in insurance, to capture an insurance agent selling something, or in the financial industry, to capture a financial advisor giving advice so you can document what advice you're given, because maybe it's the wrong advice. There's a huge segment opening up with the power of AI, with additional computational power, where we are able to record conversations. I think this is a huge additional market.
Editor's note: This interview has been edited for clarity and conciseness.