For the 11 million US residents who are hard of hearing or deaf, the development of AI speech recognition technology is not just hype. It's a sign of hope for tools that may eventually help them get through the day.
Automated speech recognition technologies can range from simple transcription services in call centers to dating app matching algorithms, but a burgeoning consumer use case focuses on software and apps aimed at people who are deaf and hard of hearing.
The road to speech to text
Bern Elliot, research vice president and distinguished analyst of artificial intelligence and customer service at Gartner, said the first generation of rule-based speech technology was made of separate algorithms that transformed WAV file data into sound, then mapped those sounds to words.
That rule-based speech technology produced a transcription that statistical grammar models learned to recognize as speech. Now the term "speech to text" reigns supreme. Speech-to-text technology excels in being able to caption in real time.
Most current STT algorithms use neural networks and machine learning models to create transcripts. For example, Microsoft's STT uses two machine learning models that work together -- the first takes spoken language and translates it to text and the second model takes written language and translates it how people would like to read it, Elliot said.
Elliot said everyone who has been to an airport, bar, or used closed captioning on the news is familiar with speech recognition technology, usually at the sentence level.
Automated speech recognition platforms frequently automate at the sentence level in order to allow a machine learning algorithm to incorporate context. When translating or transcribing at the word level, an algorithm goes word by word to create an output, which can sometimes result in a broken sentence or mismatched meaning. At the sentence level, algorithms begin building transcriptions word by word, then readjusts to form a complete sentence, often changing previous words after considering semantic and contextual meaning of the full thought.
Of all the featured STT vendors in Gartner's market report, all but one vendor use real-time technology at the sentence level, word level or both.
"When you're doing it at the word level, there are some words that sound the same, and until you have the sentence context, it may be difficult to know for sure. So [the algorithm] does a best guess," Elliot said.
"It knows when it gets further in, 'I should have said this.' So it changes [the word] in sentence level."
ASR and daily functioning
Automated speech recognition and STT technologies widely vary in use case, but consumers frequently engage with transcription services -- some that provide quick results, and some that may transcribe text in a few minutes. For accessibility uses, the speed at which STT can perform transcription must be nearly immediate.
Michael Conley, a deaf San Diego museum worker who uses STT technologies like Innocaption, a California-based mobile captioning app, said that real-time captioning allows him to complete activities like filling prescriptions, holding interviews and having long phone calls.
"I've talked to people and they don't know until afterwards that the call has gone through artificial intelligence or a live stenographer," Conley said. "There's never been a situation where I needed to reveal that I was using an AI-based ASR."
After recently losing his job, which provided him access to desktop-based tools, getting a mobile app was a high priority for Conley. His experience highlights the disparity between desktop tools, which are common, and mobile apps, which are more rare. Many STT technologies are limited to certain devices, apps or processing systems.
Elliot said speech-to-text technology for deaf and hard-of-hearing people needs to be multimodal, meaning it can be used across devices seamlessly. He predicted that this shift will happen within the next five years.
Limitations of STT
One of the larger issues with creating and implementing STT technologies is training algorithms.
Elliot said that often times, people don't speak the way they would write, and vice versa. There are colloquial terms, inferences, inflections of voice and other nuances that change a word's meaning. Training models on written data for speech output, or speech data for written output, doesn't always work. The intricacies of human language all need to be represented in data sets used to train the machine learning algorithm.
"I've had vendors tell me that it used to be a battle of the algorithms. But a lot of algorithms are now open source. The difference is increasingly who has better data for training," Elliot said.
Elliot added that STT can be hard to get right from a developer point of view, which is why the apps preferred by customers are constantly changing.
Bern ElliotResearch vice president, Gartner
"[STT models] take a lot of a lot of capabilities, a lot of technical ability and a lot of data. You have to have a lot of skill to make these work," Elliot said. "However, it's within reach of a knowledgeable data science and machine learning [developer], because a lot of the algorithms are public now."
Another limitation is that developers need to take a slightly different mindset when building speech-to-text tools for deaf people. Despite the fact that data scientists typically strive for the highest level of accuracy possible when developing machine learning models, Conley noted that for end users, any level of audio transcription automation is helpful. In comparison to other AI and machine learning technologies, STT for deaf users is helpful in any capacity. This means developers should focus on producing tools that are useful, even if they're not perfect.
Accessibility beyond the pandemic
Speech-to-text apps, devices and tools for deaf or hard-of-hearing employees solve a key issue with accessibility -- inclusion for those along the disability spectrum.
"[Many corporate accessibility options are] 'we could put you in touch with an American Sign Language [ASL] interpreter.' That's not a solution. Not everybody who has hearing loss knows ASL, by a long shot," Conley said.
Now with so many people staying home due to ongoing COVID-19-related restrictions, and with the introduction of face masks that hinder lip reading, deaf and hard-of-hearing people are turning to technology to assist in daily activities.
Gartner predicts speech to text and automated natural language generation will keep growing over the next 10 years. Elliot also sees a rising trend of open source, as tech giants like Microsoft's AI lab and Google are keeping their models open in order to attract new talent, researchers and students.